<small>Copyright 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved.<br>
This is AWS Content subject to the terms of the Customer Agreement</small>

# Module 4: End-to-End ROBO-Reviewer Pipeline

Welcome to the complete ROBO-Reviewer system! This notebook demonstrates the full automated pipeline that takes your generated videos and provides comprehensive evaluation reports.

## What This Pipeline Does

**Input**: Generated videos with their prompts  
**Output**: Comprehensive evaluation reports with scores and insights

**Automated Process**:
1. üé¨ **Video Discovery**: Finds your generated videos and prompts
2. üñºÔ∏è **Frame Sampling**: Intelligently extracts representative frames
3. ‚ùì **Content Alignment**: Evaluates prompt-video alignment using Q&A
4. ‚≠ê **Quality Assessment**: Scores video quality using LLM-as-Judge
5. üìä **Report Generation**: Creates comprehensive evaluation reports

## Install Dependencies

First, let's install the required packages for this notebook.

In [None]:
!pip install -q matplotlib opencv-python Pillow tqdm pandas

## Setup and Configuration

In [None]:
import boto3
import json
import time
from datetime import datetime
import pandas as pd
from IPython.display import HTML, display

from utils.content_alignment import *
from utils.quality_assessment import *
from utils.video_processing import *

In [None]:
# AWS Configuration
session = boto3.Session()
s3_client = session.client('s3')

# Import S3 bucket configuration utility
from utils.config import get_s3_bucket

# Get S3 bucket name
S3_BUCKET = get_s3_bucket(session)

# Load configuration for video prefix
with open('config.json', 'r') as f:
    config = json.load(f)

VIDEO_PREFIX = config['video_prefix']

# Evaluation Configuration
MODEL_ID = "us.amazon.nova-premier-v1:0"

# Focus areas for Q&A evaluation
FOCUS_AREAS = [
    "subject_alignment",
    "background_alignment", 
    "color_accuracy",
    "activity_alignment",
    "spatial_relationships"
]

print("üöÄ ROBO-Reviewer Pipeline Initialized")
print(f"üìÅ S3 Bucket: {S3_BUCKET}")
print(f"üé¨ Video Location: {VIDEO_PREFIX}")
print(f"ü§ñ Model: {MODEL_ID}")

## Step 1: Video Discovery

First, let's discover all the videos and their corresponding prompts in your S3 bucket.

In [None]:
def discover_videos_and_prompts(bucket_name, prefix):
    """Discover video files and their corresponding prompt files"""
    
    try:
        # List all objects in the bucket with the given prefix
        response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=prefix)
        
        if 'Contents' not in response:
            print(f"‚ùå No files found in s3://{bucket_name}/{prefix}")
            return []
        
        # Find video files
        video_files = []
        for obj in response['Contents']:
            key = obj['Key']
            if key.endswith('.mp4'):
                video_uri = f"s3://{bucket_name}/{key}"
                prompt_key = key.replace('.mp4', '_prompt.txt')
                
                # Check if corresponding prompt file exists
                try:
                    prompt_response = s3_client.get_object(Bucket=bucket_name, Key=prompt_key)
                    prompt_text = prompt_response['Body'].read().decode('utf-8')
                    
                    video_files.append({
                        'video_uri': video_uri,
                        'prompt': prompt_text,
                        'video_name': key.split('/')[-1].replace('.mp4', '')
                    })
                except:
                    print(f"‚ö†Ô∏è  Prompt file not found for {key}, skipping...")
        
        return video_files
        
    except Exception as e:
        print(f"‚ùå Error discovering videos: {e}")
        return []

# Discover videos
print("üîç Discovering videos and prompts...")
video_data = discover_videos_and_prompts(S3_BUCKET, VIDEO_PREFIX)

if video_data:
    print(f"‚úÖ Found {len(video_data)} video(s) with prompts:")
    for i, data in enumerate(video_data, 1):
        print(f"   {i}. {data['video_name']}")
        print(f"      Prompt: {data['prompt'][:100]}{'...' if len(data['prompt']) > 100 else ''}")
        print()
else:
    print("‚ùå No videos found. Please check your S3 bucket and prefix configuration.")

## Step 2: Automated Evaluation Pipeline

Now let's run the complete evaluation pipeline for each discovered video. This will:
- Generate Q&A pairs for content alignment
- Evaluate video quality using LLM-as-Judge
- Save results to S3

In [None]:
def run_complete_evaluation(video_info):
    """Run both content alignment and quality evaluation for a video"""
    
    video_uri = video_info['video_uri']
    video_name = video_info['video_name']
    prompt = video_info['prompt']
    
    print(f"\nüé¨ Evaluating: {video_name}")
    print(f"üìù Prompt: {prompt}")
    print("=" * 80)
    
    results = {
        'video_name': video_name,
        'video_uri': video_uri,
        'prompt': prompt,
        'evaluation_timestamp': datetime.now().isoformat()
    }
    
    try:
        # 1. Content Alignment Evaluation (Q&A)
        print("\n‚ùì Running Content Alignment Evaluation...")
        alignment_results = evaluation_pipeline(
            s3_video_uri=video_uri,
            boto3_session=session,
            model_id=MODEL_ID,
            focus_areas=FOCUS_AREAS
        )
        
        if alignment_results and video_uri in alignment_results:
            results['content_alignment'] = alignment_results[video_uri]
            
            # Calculate overall alignment score
            total_score = sum(alignment_results[video_uri].values())
            max_score = len(FOCUS_AREAS) * 5  # 5 questions per focus area
            alignment_percentage = (total_score / max_score) * 100
            results['alignment_score'] = alignment_percentage
            
            print(f"‚úÖ Content Alignment: {alignment_percentage:.1f}% ({total_score}/{max_score})")
        else:
            print("‚ùå Content alignment evaluation failed")
            results['content_alignment'] = {}
            results['alignment_score'] = 0
        
        # 2. Quality Assessment (LLM-as-Judge)
        print("\n‚≠ê Running Quality Assessment...")
        quality_results = video_quality_evaluation_pipeline(
            s3_video_uri=video_uri,
            boto3_session=session,
            model_id=MODEL_ID,
            temporal_consistency_flag=True,
            aesthetic_quality_flag=True,
            technical_quality_flag=True,
            motion_effects_flag=True
        )
        
        if quality_results:
            results['quality_assessment'] = quality_results
            
            # Calculate overall quality score
            quality_scores = [metrics['score'] for metrics in quality_results.values()]
            avg_quality = sum(quality_scores) / len(quality_scores)
            results['quality_score'] = avg_quality
            
            print(f"‚úÖ Quality Score: {avg_quality:.1f}/5.0")
            for metric, data in quality_results.items():
                print(f"   {metric.replace('_', ' ').title()}: {data['score']}/5")
        else:
            print("‚ùå Quality assessment failed")
            results['quality_assessment'] = {}
            results['quality_score'] = 0
        
        print(f"\nüéØ Overall Evaluation Complete for {video_name}")
        return results
        
    except Exception as e:
        print(f"‚ùå Error evaluating {video_name}: {e}")
        results['error'] = str(e)
        return results

# Run evaluation for all discovered videos
if video_data:
    print("üöÄ Starting Complete Evaluation Pipeline...")
    print(f"üìä Will evaluate {len(video_data)} video(s)")
    
    all_results = []
    
    for i, video_info in enumerate(video_data, 1):
        print(f"\n{'='*20} Video {i}/{len(video_data)} {'='*20}")
        result = run_complete_evaluation(video_info)
        all_results.append(result)
        
        # Add a small delay between evaluations
        if i < len(video_data):
            print("\n‚è≥ Waiting 5 seconds before next evaluation...")
            time.sleep(5)
    
    print("\nüéâ All evaluations complete!")
else:
    print("‚ùå No videos to evaluate")
    all_results = []

## Step 3: Generate Comprehensive Report

Let's create a comprehensive HTML report with all evaluation results.

In [None]:
def generate_evaluation_report(results_list):
    """Generate a comprehensive HTML evaluation report"""
    
    if not results_list:
        return "<h2>No evaluation results to display</h2>"
    
    # Calculate summary statistics
    valid_results = [r for r in results_list if 'error' not in r]
    
    if not valid_results:
        return "<h2>No valid evaluation results</h2>"
    
    avg_alignment = sum(r.get('alignment_score', 0) for r in valid_results) / len(valid_results)
    avg_quality = sum(r.get('quality_score', 0) for r in valid_results) / len(valid_results)
    
    # Start building HTML report
    html = f"""
    <style>
        .report-container {{ font-family: Arial, sans-serif; max-width: 1200px; margin: 0 auto; }}
        .header {{ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 10px; margin-bottom: 20px; }}
        .summary {{ background: #f8f9fa; padding: 15px; border-radius: 8px; margin-bottom: 20px; }}
        .video-card {{ border: 1px solid #ddd; border-radius: 8px; margin-bottom: 20px; overflow: hidden; }}
        .video-header {{ background: #343a40; color: white; padding: 15px; }}
        .video-content {{ padding: 15px; }}
        .score-grid {{ display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 15px; margin: 15px 0; }}
        .score-card {{ background: #e9ecef; padding: 10px; border-radius: 5px; text-align: center; }}
        .score-high {{ background: #d4edda; color: #155724; }}
        .score-medium {{ background: #fff3cd; color: #856404; }}
        .score-low {{ background: #f8d7da; color: #721c24; }}
        .prompt-box {{ background: #f1f3f4; padding: 10px; border-left: 4px solid #4285f4; margin: 10px 0; }}
        .metric-details {{ margin-top: 15px; }}
        .metric-item {{ margin: 8px 0; padding: 8px; background: #f8f9fa; border-radius: 4px; }}
    </style>
    
    <div class="report-container">
        <div class="header">
            <h1>ü§ñ ROBO-Reviewer Evaluation Report</h1>
            <p>Generated on {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}</p>
        </div>
        
        <div class="summary">
            <h2>üìä Summary Statistics</h2>
            <div class="score-grid">
                <div class="score-card">
                    <h3>Videos Evaluated</h3>
                    <h2>{len(valid_results)}</h2>
                </div>
                <div class="score-card {'score-high' if avg_alignment >= 80 else 'score-medium' if avg_alignment >= 60 else 'score-low'}">
                    <h3>Avg Content Alignment</h3>
                    <h2>{avg_alignment:.1f}%</h2>
                </div>
                <div class="score-card {'score-high' if avg_quality >= 4 else 'score-medium' if avg_quality >= 3 else 'score-low'}">
                    <h3>Avg Quality Score</h3>
                    <h2>{avg_quality:.1f}/5.0</h2>
                </div>
            </div>
        </div>
    """
    
    # Add individual video results
    for i, result in enumerate(valid_results, 1):
        alignment_score = result.get('alignment_score', 0)
        quality_score = result.get('quality_score', 0)
        
        # Determine score classes
        alignment_class = 'score-high' if alignment_score >= 80 else 'score-medium' if alignment_score >= 60 else 'score-low'
        quality_class = 'score-high' if quality_score >= 4 else 'score-medium' if quality_score >= 3 else 'score-low'
        
        html += f"""
        <div class="video-card">
            <div class="video-header">
                <h2>üé¨ Video {i}: {result['video_name']}</h2>
            </div>
            <div class="video-content">
                <div class="prompt-box">
                    <strong>üìù Original Prompt:</strong><br>
                    {result['prompt']}
                </div>
                
                <div class="score-grid">
                    <div class="score-card {alignment_class}">
                        <h3>Content Alignment</h3>
                        <h2>{alignment_score:.1f}%</h2>
                    </div>
                    <div class="score-card {quality_class}">
                        <h3>Quality Score</h3>
                        <h2>{quality_score:.1f}/5.0</h2>
                    </div>
                </div>
        """
        
        # Add content alignment details
        if 'content_alignment' in result and result['content_alignment']:
            html += "<div class='metric-details'><h4>‚ùì Content Alignment Details:</h4>"
            for focus_area, score in result['content_alignment'].items():
                html += f"<div class='metric-item'><strong>{focus_area.title()}:</strong> {score}/5</div>"
            html += "</div>"
        
        # Add quality assessment details
        if 'quality_assessment' in result and result['quality_assessment']:
            html += "<div class='metric-details'><h4>‚≠ê Quality Assessment Details:</h4>"
            for metric, data in result['quality_assessment'].items():
                metric_name = metric.replace('_', ' ').title()
                html += f"""
                <div class='metric-item'>
                    <strong>{metric_name}:</strong> {data['score']}/5<br>
                    <small><em>{data.get('justification', 'No justification provided')}</em></small>
                </div>
                """
            html += "</div>"
        
        html += "</div></div>"  # Close video-content and video-card
    
    html += "</div>"  # Close report-container
    
    return html

# Generate and display the report
if all_results:
    print("üìä Generating comprehensive evaluation report...")
    report_html = generate_evaluation_report(all_results)
    
    # Display the report
    display(HTML(report_html))
    
    # Create final_report directory if it doesn't exist
    import os
    os.makedirs('final_report', exist_ok=True)
    
    # Save report to file
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    report_filename = f"final_report/robo_reviewer_report_{timestamp}.html"
    
    with open(report_filename, 'w', encoding='utf-8') as f:
        f.write(report_html)
    
    print(f"\nüíæ Report saved as: {report_filename}")
else:
    print("‚ùå No results to generate report")

## Step 4: Export Results for Further Analysis

Let's also create a structured data export for further analysis.

In [None]:
# Create a structured DataFrame for analysis
if all_results:
    # Prepare data for DataFrame
    df_data = []
    
    for result in all_results:
        if 'error' not in result:
            row = {
                'video_name': result['video_name'],
                'prompt': result['prompt'],
                'alignment_score': result.get('alignment_score', 0),
                'quality_score': result.get('quality_score', 0),
                'evaluation_timestamp': result['evaluation_timestamp']
            }
            
            # Add content alignment scores
            if 'content_alignment' in result:
                for focus_area, score in result['content_alignment'].items():
                    row[f'alignment_{focus_area.replace(" ", "_")}'] = score
            
            # Add quality scores
            if 'quality_assessment' in result:
                for metric, data in result['quality_assessment'].items():
                    row[f'quality_{metric}'] = data['score']
            
            df_data.append(row)
    
    # Create DataFrame
    if df_data:
        df = pd.DataFrame(df_data)
        
        print("üìà Evaluation Results Summary:")
        print("=" * 50)
        display(df[['video_name', 'alignment_score', 'quality_score']].round(2))
        
        # Save to CSV in final_report folder
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        csv_filename = f"final_report/robo_reviewer_results_{timestamp}.csv"
        df.to_csv(csv_filename, index=False)
        
        print(f"\nüíæ Results exported to: {csv_filename}")
        
        # Save detailed results as JSON in final_report folder
        json_filename = f"final_report/robo_reviewer_detailed_{timestamp}.json"
        with open(json_filename, 'w', encoding='utf-8') as f:
            json.dump(all_results, f, indent=2, ensure_ascii=False)
        
        print(f"üíæ Detailed results saved as: {json_filename}")
    else:
        print("‚ùå No valid results to export")
else:
    print("‚ùå No results to export")

## Step 5: Insights and Recommendations

Based on the evaluation results, let's generate some insights about Nova Reel's performance.

In [None]:
def generate_insights(results_list):
    """Generate insights and recommendations based on evaluation results"""
    
    valid_results = [r for r in results_list if 'error' not in r and r.get('alignment_score', 0) > 0]
    
    if not valid_results:
        return "No valid results for analysis"
    
    insights = []
    
    # Overall performance analysis
    avg_alignment = sum(r.get('alignment_score', 0) for r in valid_results) / len(valid_results)
    avg_quality = sum(r.get('quality_score', 0) for r in valid_results) / len(valid_results)
    
    insights.append(f"üìä **Overall Performance Analysis**")
    insights.append(f"   ‚Ä¢ Average Content Alignment: {avg_alignment:.1f}%")
    insights.append(f"   ‚Ä¢ Average Quality Score: {avg_quality:.1f}/5.0")
    
    # Performance categorization
    if avg_alignment >= 80:
        insights.append(f"   ‚Ä¢ ‚úÖ Excellent content alignment - Nova Reel is following prompts very well")
    elif avg_alignment >= 60:
        insights.append(f"   ‚Ä¢ ‚ö†Ô∏è  Good content alignment - Some room for prompt optimization")
    else:
        insights.append(f"   ‚Ä¢ ‚ùå Content alignment needs improvement - Consider refining prompts")
    
    if avg_quality >= 4:
        insights.append(f"   ‚Ä¢ ‚úÖ High quality video generation")
    elif avg_quality >= 3:
        insights.append(f"   ‚Ä¢ ‚ö†Ô∏è  Moderate quality - Some technical aspects could be improved")
    else:
        insights.append(f"   ‚Ä¢ ‚ùå Quality concerns - May need different generation parameters")
    
    # Focus area analysis
    if len(valid_results) > 0 and 'content_alignment' in valid_results[0]:
        focus_scores = {}
        for focus_area in FOCUS_AREAS:
            scores = [r['content_alignment'].get(focus_area, 0) for r in valid_results if 'content_alignment' in r]
            if scores:
                focus_scores[focus_area] = sum(scores) / len(scores)
        
        if focus_scores:
            insights.append(f"\nüéØ **Focus Area Performance**")
            sorted_areas = sorted(focus_scores.items(), key=lambda x: x[1], reverse=True)
            
            best_area = sorted_areas[0]
            worst_area = sorted_areas[-1]
            
            insights.append(f"   ‚Ä¢ üèÜ Strongest: {best_area[0].title()} ({best_area[1]:.1f}/5.0)")
            insights.append(f"   ‚Ä¢ üìà Needs work: {worst_area[0].title()} ({worst_area[1]:.1f}/5.0)")
    
    # Quality metrics analysis
    if len(valid_results) > 0 and 'quality_assessment' in valid_results[0]:
        quality_metrics = {}
        for result in valid_results:
            if 'quality_assessment' in result:
                for metric, data in result['quality_assessment'].items():
                    if metric not in quality_metrics:
                        quality_metrics[metric] = []
                    quality_metrics[metric].append(data['score'])
        
        if quality_metrics:
            insights.append(f"\n‚≠ê **Quality Metrics Performance**")
            for metric, scores in quality_metrics.items():
                avg_score = sum(scores) / len(scores)
                metric_name = metric.replace('_', ' ').title()
                status = "‚úÖ" if avg_score >= 4 else "‚ö†Ô∏è" if avg_score >= 3 else "‚ùå"
                insights.append(f"   ‚Ä¢ {status} {metric_name}: {avg_score:.1f}/5.0")
    
    # Recommendations
    insights.append(f"\nüí° **Recommendations**")
    
    if avg_alignment < 70:
        insights.append(f"   ‚Ä¢ Consider more specific and detailed prompts")
        insights.append(f"   ‚Ä¢ Add technical specifications (4k, cinematic, etc.)")
        insights.append(f"   ‚Ä¢ Include camera movement descriptions")
    
    if avg_quality < 3.5:
        insights.append(f"   ‚Ä¢ Experiment with different seeds for better quality")
        insights.append(f"   ‚Ä¢ Try shorter, more focused prompts")
        insights.append(f"   ‚Ä¢ Consider adjusting video generation parameters")
    
    insights.append(f"   ‚Ä¢ Use this evaluation data to iterate and improve prompts")
    insights.append(f"   ‚Ä¢ Focus on improving the lowest-scoring focus areas")
    
    return "\n".join(insights)

# Generate and display insights
if all_results:
    print("üß† Generating Insights and Recommendations...")
    print("=" * 60)
    insights = generate_insights(all_results)
    print(insights)
else:
    print("‚ùå No results available for insights generation")

## Summary

üéâ **Congratulations!** You've successfully completed the end-to-end ROBO-Reviewer pipeline!

### What You've Accomplished:

‚úÖ **Automated Video Discovery**: Found and processed videos with their prompts  
‚úÖ **Content Alignment Evaluation**: Measured how well videos match their prompts  
‚úÖ **Quality Assessment**: Evaluated technical and aesthetic video quality  
‚úÖ **Comprehensive Reporting**: Generated detailed HTML and data reports  
‚úÖ **Actionable Insights**: Received recommendations for improvement  

### Files Generated:
- üìä **HTML Report**: Visual evaluation report with scores and details
- üìà **CSV Export**: Structured data for further analysis
- üìã **JSON Details**: Complete evaluation results with all metadata

### Next Steps:
1. **Analyze Results**: Use the insights to understand Nova Reel's strengths and limitations
2. **Optimize Prompts**: Apply recommendations to improve future video generation
3. **Scale Up**: Use this pipeline to evaluate larger batches of videos
4. **Customize**: Modify evaluation criteria for your specific use cases

### The Power of ROBO-Reviewer:
You now have an automated system that can:
- Process hundreds of videos without manual intervention
- Provide objective, consistent evaluation criteria
- Generate professional reports for stakeholders
- Identify patterns and improvement opportunities
- Scale video evaluation to enterprise levels

**This is the future of AI video evaluation - automated, objective, and scalable!** üöÄ