# End-to-End Video to Summary Pipeline

This notebook demonstrates the complete pipeline from video input to natural language summary, combining scene graph generation and text summarization.

## Overview

1. **Video Loading**: Load video data and extract frames
2. **Object Detection**: Detect objects in each frame
3. **Scene Graph Generation**: Generate relationships between objects
4. **Text Conversion**: Convert scene graphs to natural language
5. **Summarization**: Generate concise summaries
6. **Visualization**: Display the complete pipeline results

## Prerequisites

Make sure you have the required dependencies installed:
```bash
pip install torch torchvision opencv-python matplotlib networkx transformers
```


In [None]:
# Import required libraries
import sys
import os
import torch
import cv2
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import json
from pathlib import Path
from typing import List, Dict, Any, Tuple
from datetime import datetime

# Add the src directory to the path
sys.path.append(str(Path.cwd().parent / "src"))

# Import m3sgg components
from m3sgg.core.models.sttran import STTran
from m3sgg.core.datasets.action_genome import AG, cuda_collate_fn
from m3sgg.core.config import Config
from m3sgg.core.object_detector import detector
from m3sgg.core.evaluation_recall import BasicSceneGraphEvaluator
from m3sgg.language.summarization.summarize import linearize_triples, summarize_sentences

print("Libraries imported successfully!")


## 1. Complete Pipeline Class

Let's create a comprehensive pipeline class that handles the entire video-to-summary process.


In [None]:
class VideoToSummaryPipeline:
    """Complete pipeline for converting videos to natural language summaries.
    
    This class handles the entire process from video input to text summary,
    including object detection, scene graph generation, and text summarization.
    """
    
    def __init__(self, config, model_path=None, device=None):
        """Initialize the pipeline.
        
        :param config: Configuration object
        :param model_path: Path to pre-trained model
        :param device: Device to run inference on
        """
        self.config = config
        self.device = device or torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self.model_path = model_path
        
        # Initialize components
        self.dataset = None
        self.dataloader = None
        self.obj_detector = None
        self.sgg_model = None
        self.obj_classes = None
        self.rel_classes = None
        
        # Results storage
        self.results = []
        
    def setup_dataset(self):
        """Set up the dataset and data loader."""
        try:
            self.dataset = AG(
                mode="test",
                datasize=self.config.datasize,
                data_path=self.config.data_path,
                filter_nonperson_box_frame=True,
                filter_small_box=False if self.config.mode == "predcls" else True,
            )
            
            self.dataloader = torch.utils.data.DataLoader(
                self.dataset, shuffle=False, num_workers=0, collate_fn=cuda_collate_fn
            )
            
            self.obj_classes = self.dataset.obj_classes
            self.rel_classes = self.dataset.rel_classes
            
            print(f"Dataset loaded: {len(self.dataset)} samples")
            return True
            
        except Exception as e:
            print(f"Error loading dataset: {e}")
            return False
    
    def setup_models(self):
        """Set up object detector and scene graph model."""
        success = True
        
        # Object detector
        try:
            self.obj_detector = detector(
                pretrained=True,
                object_classes=self.obj_classes,
                use_cuda=torch.cuda.is_available(),
                confidence_threshold=0.3
            )
            print("Object detector initialized")
        except Exception as e:
            print(f"Error initializing object detector: {e}")
            success = False
        
        # Scene graph model
        try:
            model_config = {
                'obj_classes': self.obj_classes,
                'rel_classes': self.rel_classes,
                'mode': self.config.mode,
                'num_gpus': 1 if torch.cuda.is_available() else 0,
                'device': self.device
            }
            
            self.sgg_model = STTran(**model_config)
            self.sgg_model.to(self.device)
            self.sgg_model.eval()
            
            print("Scene graph model initialized")
        except Exception as e:
            print(f"Error initializing scene graph model: {e}")
            success = False
        
        return success
    
    def process_video(self, max_frames=5):
        """Process a video and generate summaries.
        
        :param max_frames: Maximum number of frames to process
        :return: List of processed frame results
        """
        if not self.dataloader:
            print("Dataset not loaded. Call setup_dataset() first.")
            return []
        
        processed_frames = []
        
        try:
            # Get first batch
            batch = next(iter(self.dataloader))
            
            for frame_idx in range(min(max_frames, len(batch))):
                entry = batch[frame_idx]
                
                # Move data to device
                for key in entry:
                    if isinstance(entry[key], torch.Tensor):
                        entry[key] = entry[key].to(self.device)
                
                # Object detection
                if self.obj_detector:
                    try:
                        detections = self.obj_detector(entry)
                        entry.update(detections)
                    except Exception as e:
                        print(f"Object detection failed for frame {frame_idx}: {e}")
                
                # Scene graph generation
                if self.sgg_model:
                    try:
                        with torch.no_grad():
                            predictions = self.sgg_model(entry)
                        
                        # Convert to triples (simplified)
                        triples = self._extract_triples(entry, predictions)
                        
                        # Convert to text
                        sentences = linearize_triples(triples)
                        
                        # Generate summary
                        summary = self._generate_summary(sentences)
                        
                        frame_result = {
                            'frame_idx': frame_idx,
                            'triples': triples,
                            'sentences': sentences,
                            'summary': summary,
                            'objects': entry.get('labels', []),
                            'boxes': entry.get('boxes', [])
                        }
                        
                        processed_frames.append(frame_result)
                        print(f"Processed frame {frame_idx + 1}/{max_frames}")
                        
                    except Exception as e:
                        print(f"Scene graph generation failed for frame {frame_idx}: {e}")
        
        except Exception as e:
            print(f"Error processing video: {e}")
        
        self.results = processed_frames
        return processed_frames
    
    def _extract_triples(self, entry, predictions):
        """Extract scene graph triples from predictions.
        
        :param entry: Input data
        :param predictions: Model predictions
        :return: List of (subject, predicate, object) triples
        """
        # This is a simplified implementation
        # In practice, you would extract actual triples from the predictions
        triples = []
        
        objects = entry.get('labels', [])
        if len(objects) < 2:
            return triples
        
        # Generate some sample triples for demonstration
        for i in range(min(len(objects), 3)):
            for j in range(i + 1, min(len(objects), 3)):
                if i < len(self.obj_classes) and j < len(self.obj_classes):
                    subject = self.obj_classes[objects[i]]
                    obj = self.obj_classes[objects[j]]
                    predicate = "near"  # Simplified relationship
                    triples.append((subject, predicate, obj))
        
        return triples
    
    def _generate_summary(self, sentences):
        """Generate summary from sentences.
        
        :param sentences: List of sentences
        :return: Summary text
        """
        if not sentences:
            return "No objects or relationships detected."
        
        try:
            summary = summarize_sentences(sentences, model_type="t5")
            return summary
        except Exception as e:
            print(f"Summarization error: {e}")
            return "Error generating summary."
    
    def visualize_results(self, max_frames=4):
        """Visualize the pipeline results.
        
        :param max_frames: Maximum number of frames to display
        """
        if not self.results:
            print("No results to visualize. Process video first.")
            return
        
        n_frames = min(len(self.results), max_frames)
        fig, axes = plt.subplots(2, n_frames, figsize=(4 * n_frames, 8))
        if n_frames == 1:
            axes = axes.reshape(2, 1)
        
        for i, result in enumerate(self.results[:n_frames]):
            # Top row: Scene graph
            ax1 = axes[0, i]
            self._plot_scene_graph(result, ax1)
            
            # Bottom row: Text summary
            ax2 = axes[1, i]
            self._plot_text_summary(result, ax2)
        
        plt.tight_layout()
        plt.show()
    
    def _plot_scene_graph(self, result, ax):
        """Plot scene graph for a frame.
        
        :param result: Frame result data
        :param ax: Matplotlib axis
        """
        G = nx.DiGraph()
        
        # Add nodes
        objects = result.get('objects', [])
        for i, obj_id in enumerate(objects):
            if obj_id < len(self.obj_classes):
                obj_name = self.obj_classes[obj_id]
                G.add_node(i, label=obj_name)
        
        # Add edges
        triples = result.get('triples', [])
        for subject, predicate, obj in triples:
            # Find node indices
            subj_idx = None
            obj_idx = None
            for i, obj_id in enumerate(objects):
                if i < len(self.obj_classes) and self.obj_classes[obj_id] == subject:
                    subj_idx = i
                if i < len(self.obj_classes) and self.obj_classes[obj_id] == obj:
                    obj_idx = i
            
            if subj_idx is not None and obj_idx is not None:
                G.add_edge(subj_idx, obj_idx, label=predicate)
        
        # Draw graph
        if G.nodes():
            pos = nx.spring_layout(G, k=1, iterations=50)
            nx.draw_networkx_nodes(G, pos, node_color='lightblue', 
                                  node_size=1000, ax=ax)
            nx.draw_networkx_edges(G, pos, edge_color='gray', 
                                  arrows=True, arrowsize=20, ax=ax)
            nx.draw_networkx_labels(G, pos, 
                                  {i: G.nodes[i]['label'] for i in G.nodes()}, 
                                  font_size=8, ax=ax)
        
        ax.set_title(f"Frame {result['frame_idx']}")
        ax.axis('off')
    
    def _plot_text_summary(self, result, ax):
        """Plot text summary for a frame.
        
        :param result: Frame result data
        :param ax: Matplotlib axis
        """
        ax.text(0.05, 0.95, f"Summary:\\n{result['summary']}", 
                transform=ax.transAxes, fontsize=10, verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.8))
        ax.set_xlim(0, 1)
        ax.set_ylim(0, 1)
        ax.axis('off')
    
    def export_results(self, output_path="video_to_summary_results.json"):
        """Export results to JSON file.
        
        :param output_path: Path to save results
        :return: Exported data
        """
        export_data = {
            "metadata": {
                "timestamp": datetime.now().isoformat(),
                "total_frames": len(self.results),
                "object_classes": self.obj_classes,
                "relationship_classes": self.rel_classes
            },
            "results": self.results
        }
        
        with open(output_path, 'w') as f:
            json.dump(export_data, f, indent=2)
        
        print(f"Results exported to {output_path}")
        return export_data

print("VideoToSummaryPipeline class defined successfully!")


## 2. Initialize and Run Pipeline

Now let's set up the pipeline and run it on video data.


In [None]:
# Configuration
config = Config()
config.data_path = "../data/action_genome"  # Adjust path as needed
config.mode = "sgdet"  # Scene graph detection mode
config.datasize = 50  # Use smaller dataset for demo

# Check if CUDA is available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Initialize pipeline
pipeline = VideoToSummaryPipeline(config, device=device)
print("Pipeline initialized successfully!")


In [None]:
# Setup dataset
print("Setting up dataset...")
dataset_success = pipeline.setup_dataset()

if dataset_success:
    print("Dataset setup successful!")
    print(f"Object classes: {len(pipeline.obj_classes)}")
    print(f"Relationship classes: {len(pipeline.rel_classes)}")
else:
    print("Dataset setup failed. Please check data path and availability.")


In [None]:
# Setup models
print("Setting up models...")
models_success = pipeline.setup_models()

if models_success:
    print("Models setup successful!")
else:
    print("Models setup failed. Some components may not be available.")


In [None]:
# Process video
print("Processing video...")
results = pipeline.process_video(max_frames=4)

if results:
    print(f"Successfully processed {len(results)} frames!")
    
    # Display results summary
    print("\nResults Summary:")
    print("=" * 40)
    for i, result in enumerate(results):
        print(f"Frame {i + 1}:")
        print(f"  Objects: {len(result['objects'])}")
        print(f"  Triples: {len(result['triples'])}")
        print(f"  Sentences: {len(result['sentences'])}")
        print(f"  Summary: {result['summary'][:100]}{'...' if len(result['summary']) > 100 else ''}")
        print()
else:
    print("Video processing failed. Check error messages above.")


## 3. Visualize Results

Let's visualize the complete pipeline results showing both scene graphs and text summaries.


In [None]:
# Visualize results
if results:
    print("Visualizing results...")
    pipeline.visualize_results(max_frames=4)
else:
    print("No results to visualize.")


## 4. Export and Analyze Results

Export the results and perform some analysis on the pipeline output.


In [None]:
# Export results
if results:
    print("Exporting results...")
    export_data = pipeline.export_results("end_to_end_results.json")
    
    print(f"Exported {len(export_data['results'])} frames")
    print(f"Object classes: {len(export_data['metadata']['object_classes'])}")
    print(f"Relationship classes: {len(export_data['metadata']['relationship_classes'])}")
    
    # Analyze results
    print("\nAnalysis:")
    print("-" * 20)
    
    total_objects = sum(len(r['objects']) for r in results)
    total_triples = sum(len(r['triples']) for r in results)
    total_sentences = sum(len(r['sentences']) for r in results)
    
    print(f"Total objects detected: {total_objects}")
    print(f"Total triples generated: {total_triples}")
    print(f"Total sentences generated: {total_sentences}")
    print(f"Average objects per frame: {total_objects / len(results):.1f}")
    print(f"Average triples per frame: {total_triples / len(results):.1f}")
    print(f"Average sentences per frame: {total_sentences / len(results):.1f}")
    
    # Show sample summaries
    print("\nSample Summaries:")
    print("-" * 20)
    for i, result in enumerate(results[:3]):  # Show first 3
        print(f"Frame {i + 1}: {result['summary']}")
        print()
else:
    print("No results to export.")


## Summary

This notebook demonstrated the complete end-to-end pipeline from video to natural language summary:

1. **Pipeline Design**: Created a comprehensive `VideoToSummaryPipeline` class
2. **Data Processing**: Loaded video data and set up object detection
3. **Scene Graph Generation**: Generated relationships between detected objects
4. **Text Conversion**: Converted scene graphs to natural language
5. **Summarization**: Generated concise summaries using T5 model
6. **Visualization**: Displayed both scene graphs and text summaries
7. **Export**: Saved results for further analysis

### Key Features

- **Modular Design**: Separate components for each pipeline stage
- **Error Handling**: Robust error handling throughout the pipeline
- **Visualization**: Combined scene graph and text visualization
- **Export Functionality**: JSON export for further analysis
- **Configurable**: Easy to modify parameters and models

### Next Steps

- Try different scene graph models (Tempura, SceneLLM, etc.)
- Experiment with different summarization models
- Add temporal consistency across frames
- Implement evaluation metrics
- Optimize for real-time processing

### Troubleshooting

If you encounter issues:

1. **Data Issues**: Ensure Action Genome dataset is properly set up
2. **Model Loading**: Check model availability and compatibility
3. **Memory Issues**: Reduce batch size or number of frames
4. **CUDA Issues**: Verify GPU availability and model compatibility
