# Model Comparison and Evaluation

This notebook provides comprehensive evaluation and comparison of different scene graph generation models and approaches.

## Overview

1. **Model Setup**: Initialize different scene graph generation models
2. **Evaluation Metrics**: Implement various evaluation metrics
3. **Comparative Analysis**: Compare model performance across different tasks
4. **Visualization**: Create visualizations for model comparison
5. **Benchmarking**: Run standardized benchmarks

## Prerequisites

Make sure you have the required dependencies installed:
```bash
pip install torch torchvision matplotlib seaborn pandas numpy scikit-learn
```


In [None]:
# Import required libraries
import sys
import os
import torch
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any, Tuple, Optional
import json
from datetime import datetime
import time

# Add the src directory to the path
sys.path.append(str(Path.cwd().parent / "src"))

# Import m3sgg components
from m3sgg.core.models.sttran import STTran
from m3sgg.core.models.vlm.scene_graph_generator import VLMSceneGraphGenerator
from m3sgg.core.datasets.action_genome import AG, cuda_collate_fn
from m3sgg.core.config import Config
from m3sgg.core.evaluation_recall import BasicSceneGraphEvaluator

print("Libraries imported successfully!")


## 1. Evaluation Framework

Let's create a comprehensive evaluation framework for comparing different models.


In [None]:
class ModelEvaluator:
    """Comprehensive evaluation framework for scene graph generation models."""
    
    def __init__(self, obj_classes, rel_classes):
        """Initialize the evaluator.
        
        :param obj_classes: List of object class names
        :param rel_classes: List of relationship class names
        """
        self.obj_classes = obj_classes
        self.rel_classes = rel_classes
        self.results = {}
        
    def evaluate_model(self, model, model_name, test_data, metrics=None):
        """Evaluate a single model on test data.
        
        :param model: Model to evaluate
        :param model_name: Name of the model
        :param test_data: Test dataset
        :param metrics: List of metrics to compute
        :return: Evaluation results
        """
        if metrics is None:
            metrics = ['accuracy', 'precision', 'recall', 'f1', 'inference_time']
        
        print(f"Evaluating {model_name}...")
        
        results = {
            'model_name': model_name,
            'metrics': {},
            'predictions': [],
            'ground_truth': [],
            'inference_times': []
        }
        
        total_samples = len(test_data)
        correct_predictions = 0
        total_predictions = 0
        
        for i, sample in enumerate(test_data):
            start_time = time.time()
            
            try:
                # Get prediction from model
                prediction = self._get_model_prediction(model, sample)
                
                # Get ground truth
                ground_truth = self._get_ground_truth(sample)
                
                # Compute metrics
                sample_metrics = self._compute_sample_metrics(prediction, ground_truth)
                
                # Update counters
                if sample_metrics['correct']:
                    correct_predictions += 1
                total_predictions += 1
                
                # Record results
                results['predictions'].append(prediction)
                results['ground_truth'].append(ground_truth)
                results['inference_times'].append(time.time() - start_time)
                
            except Exception as e:
                print(f"Error evaluating sample {i}: {e}")
                results['inference_times'].append(0)
        
        # Compute overall metrics
        results['metrics']['accuracy'] = correct_predictions / total_predictions if total_predictions > 0 else 0
        results['metrics']['inference_time'] = np.mean(results['inference_times'])
        results['metrics']['total_samples'] = total_samples
        results['metrics']['successful_predictions'] = total_predictions
        
        self.results[model_name] = results
        return results
    
    def _get_model_prediction(self, model, sample):
        """Get prediction from model (simplified for demo).
        
        :param model: Model to use
        :param sample: Input sample
        :return: Model prediction
        """
        # This is a simplified implementation
        # In practice, you would call the actual model
        return {
            'objects': sample.get('objects', []),
            'relationships': sample.get('relationships', []),
            'confidence': 0.8  # Simulated confidence
        }
    
    def _get_ground_truth(self, sample):
        """Get ground truth from sample.
        
        :param sample: Input sample
        :return: Ground truth data
        """
        return {
            'objects': sample.get('gt_objects', []),
            'relationships': sample.get('gt_relationships', [])
        }
    
    def _compute_sample_metrics(self, prediction, ground_truth):
        """Compute metrics for a single sample.
        
        :param prediction: Model prediction
        :param ground_truth: Ground truth data
        :return: Sample metrics
        """
        # Simplified metric computation
        pred_objects = set(prediction.get('objects', []))
        gt_objects = set(ground_truth.get('objects', []))
        
        pred_rels = set(prediction.get('relationships', []))
        gt_rels = set(ground_truth.get('relationships', []))
        
        # Object accuracy
        object_accuracy = len(pred_objects.intersection(gt_objects)) / len(gt_objects) if gt_objects else 0
        
        # Relationship accuracy
        rel_accuracy = len(pred_rels.intersection(gt_rels)) / len(gt_rels) if gt_rels else 0
        
        # Overall correctness (simplified)
        correct = object_accuracy > 0.5 and rel_accuracy > 0.5
        
        return {
            'object_accuracy': object_accuracy,
            'relationship_accuracy': rel_accuracy,
            'correct': correct
        }
    
    def compare_models(self, model_names=None):
        """Compare multiple models.
        
        :param model_names: List of model names to compare
        :return: Comparison results
        """
        if model_names is None:
            model_names = list(self.results.keys())
        
        comparison = {
            'models': model_names,
            'metrics': {},
            'rankings': {}
        }
        
        # Extract metrics for comparison
        for metric in ['accuracy', 'inference_time']:
            comparison['metrics'][metric] = {}
            for model_name in model_names:
                if model_name in self.results:
                    comparison['metrics'][metric][model_name] = self.results[model_name]['metrics'].get(metric, 0)
        
        # Create rankings
        for metric in comparison['metrics']:
            sorted_models = sorted(
                comparison['metrics'][metric].items(),
                key=lambda x: x[1],
                reverse=(metric != 'inference_time')  # Lower is better for inference time
            )
            comparison['rankings'][metric] = [model for model, _ in sorted_models]
        
        return comparison
    
    def visualize_comparison(self, comparison, save_path=None):
        """Visualize model comparison results.
        
        :param comparison: Comparison results
        :param save_path: Path to save the plot
        """
        fig, axes = plt.subplots(1, 2, figsize=(15, 6))
        
        # Accuracy comparison
        models = comparison['models']
        accuracies = [comparison['metrics']['accuracy'].get(model, 0) for model in models]
        
        axes[0].bar(models, accuracies, color='skyblue', alpha=0.7)
        axes[0].set_title('Model Accuracy Comparison')
        axes[0].set_ylabel('Accuracy')
        axes[0].set_ylim(0, 1)
        axes[0].tick_params(axis='x', rotation=45)
        
        # Inference time comparison
        times = [comparison['metrics']['inference_time'].get(model, 0) for model in models]
        
        axes[1].bar(models, times, color='lightcoral', alpha=0.7)
        axes[1].set_title('Model Inference Time Comparison')
        axes[1].set_ylabel('Inference Time (seconds)')
        axes[1].tick_params(axis='x', rotation=45)
        
        plt.tight_layout()
        
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches='tight')
        
        plt.show()
    
    def export_results(self, output_path="model_evaluation_results.json"):
        """Export evaluation results.
        
        :param output_path: Path to save results
        :return: Exported data
        """
        export_data = {
            'metadata': {
                'timestamp': datetime.now().isoformat(),
                'total_models': len(self.results),
                'object_classes': self.obj_classes,
                'relationship_classes': self.rel_classes
            },
            'results': self.results
        }
        
        with open(output_path, 'w') as f:
            json.dump(export_data, f, indent=2)
        
        print(f"Results exported to {output_path}")
        return export_data

print("ModelEvaluator class defined successfully!")


## 2. Setup and Test Data

Let's set up the evaluation environment and create test data.


In [None]:
# Configuration
config = Config()
config.data_path = "../data/action_genome"  # Adjust path as needed
config.mode = "sgdet"  # Scene graph detection mode
config.datasize = 100  # Use larger dataset for evaluation

# Check if CUDA is available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load dataset
try:
    dataset = AG(
        mode="test",
        datasize=config.datasize,
        data_path=config.data_path,
        filter_nonperson_box_frame=True,
        filter_small_box=False if config.mode == "predcls" else True,
    )
    
    obj_classes = dataset.obj_classes
    rel_classes = dataset.rel_classes
    
    print(f"Dataset loaded successfully!")
    print(f"Object classes: {len(obj_classes)}")
    print(f"Relationship classes: {len(rel_classes)}")
    
except Exception as e:
    print(f"Error loading dataset: {e}")
    print("Using default classes for demonstration")
    obj_classes = ["person", "car", "bicycle", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light"]
    rel_classes = ["above", "below", "in front of", "behind", "next to", "on", "in", "at", "with", "near"]

# Create test data (simplified for demo)
test_data = []
for i in range(10):  # Create 10 test samples
    sample = {
        'objects': obj_classes[:3],  # First 3 object classes
        'relationships': [(obj_classes[0], rel_classes[0], obj_classes[1])],
        'gt_objects': obj_classes[:3],
        'gt_relationships': [(obj_classes[0], rel_classes[0], obj_classes[1])]
    }
    test_data.append(sample)

print(f"Created {len(test_data)} test samples")


## 3. Model Evaluation

Now let's evaluate different models using our evaluation framework.


In [None]:
# Initialize evaluator
evaluator = ModelEvaluator(obj_classes, rel_classes)

# Define models to evaluate
models_to_evaluate = [
    {
        'name': 'STTran',
        'model': None,  # Would be actual STTran model
        'description': 'Spatial-Temporal Transformer baseline'
    },
    {
        'name': 'VLM-SGG',
        'model': None,  # Would be actual VLM model
        'description': 'Vision-Language Model for scene graph generation'
    },
    {
        'name': 'Tempura',
        'model': None,  # Would be actual Tempura model
        'description': 'Temporal relationship modeling with uncertainty'
    },
    {
        'name': 'SceneLLM',
        'model': None,  # Would be actual SceneLLM model
        'description': 'Large language model integration'
    }
]

print("Models to evaluate:")
for model_info in models_to_evaluate:
    print(f"- {model_info['name']}: {model_info['description']}")

# Evaluate each model
print("\\nStarting model evaluation...")
for model_info in models_to_evaluate:
    try:
        # Simulate model evaluation
        results = evaluator.evaluate_model(
            model=model_info['model'],
            model_name=model_info['name'],
            test_data=test_data
        )
        print(f"✓ {model_info['name']} evaluation completed")
    except Exception as e:
        print(f"✗ Error evaluating {model_info['name']}: {e}")

print("\\nModel evaluation completed!")


## 4. Model Comparison and Visualization

Let's compare the evaluated models and create visualizations.


In [None]:
# Compare models
print("Comparing models...")
comparison = evaluator.compare_models()

if comparison['models']:
    print("\\nModel Comparison Results:")
    print("=" * 50)
    
    # Display accuracy results
    print("\\nAccuracy Rankings:")
    for i, model in enumerate(comparison['rankings']['accuracy'], 1):
        accuracy = comparison['metrics']['accuracy'].get(model, 0)
        print(f"{i}. {model}: {accuracy:.3f}")
    
    # Display inference time results
    print("\\nInference Time Rankings (lower is better):")
    for i, model in enumerate(comparison['rankings']['inference_time'], 1):
        time_val = comparison['metrics']['inference_time'].get(model, 0)
        print(f"{i}. {model}: {time_val:.3f}s")
    
    # Create visualizations
    print("\\nCreating visualizations...")
    evaluator.visualize_comparison(comparison, save_path="model_comparison.png")
    
    # Create detailed comparison table
    print("\\nDetailed Comparison Table:")
    print("-" * 80)
    print(f"{'Model':<15} {'Accuracy':<10} {'Inference Time':<15} {'Rank (Acc)':<12} {'Rank (Time)':<12}")
    print("-" * 80)
    
    for model in comparison['models']:
        accuracy = comparison['metrics']['accuracy'].get(model, 0)
        time_val = comparison['metrics']['inference_time'].get(model, 0)
        acc_rank = comparison['rankings']['accuracy'].index(model) + 1
        time_rank = comparison['rankings']['inference_time'].index(model) + 1
        
        print(f"{model:<15} {accuracy:<10.3f} {time_val:<15.3f} {acc_rank:<12} {time_rank:<12}")
    
    print("-" * 80)
    
else:
    print("No models evaluated successfully.")


## 5. Export Results and Summary

Finally, let's export the evaluation results and provide a comprehensive summary.


In [None]:
# Export results
print("Exporting evaluation results...")
export_data = evaluator.export_results("model_evaluation_results.json")

print(f"Exported results for {len(export_data['results'])} models")

# Summary
print("\\nModel Evaluation Summary:")
print("=" * 50)
print(f"Total models evaluated: {len(export_data['results'])}")
print(f"Test samples: {len(test_data)}")
print(f"Object classes: {len(obj_classes)}")
print(f"Relationship classes: {len(rel_classes)}")

if comparison['models']:
    print("\\nBest Performing Models:")
    print("-" * 30)
    
    # Best accuracy
    best_acc_model = comparison['rankings']['accuracy'][0]
    best_acc_value = comparison['metrics']['accuracy'].get(best_acc_model, 0)
    print(f"Best Accuracy: {best_acc_model} ({best_acc_value:.3f})")
    
    # Best inference time
    best_time_model = comparison['rankings']['inference_time'][0]
    best_time_value = comparison['metrics']['inference_time'].get(best_time_model, 0)
    print(f"Best Inference Time: {best_time_model} ({best_time_value:.3f}s)")
    
    # Overall best (balanced)
    print("\\nOverall Rankings (balanced accuracy and speed):")
    print("-" * 50)
    
    # Calculate balanced scores
    balanced_scores = {}
    for model in comparison['models']:
        accuracy = comparison['metrics']['accuracy'].get(model, 0)
        time_val = comparison['metrics']['inference_time'].get(model, 0)
        
        # Normalize scores (higher is better for both)
        acc_score = accuracy
        time_score = 1.0 / (time_val + 1e-6)  # Avoid division by zero
        
        # Weighted average (equal weights)
        balanced_score = 0.5 * acc_score + 0.5 * time_score
        balanced_scores[model] = balanced_score
    
    # Sort by balanced score
    sorted_balanced = sorted(balanced_scores.items(), key=lambda x: x[1], reverse=True)
    
    for i, (model, score) in enumerate(sorted_balanced, 1):
        print(f"{i}. {model}: {score:.3f}")

print("\\nEvaluation completed successfully!")


## Summary

This notebook provided a comprehensive evaluation and comparison framework for scene graph generation models:

1. **Evaluation Framework**: Created a flexible `ModelEvaluator` class for standardized evaluation
2. **Model Setup**: Configured multiple models for comparison
3. **Metrics**: Implemented accuracy, inference time, and other performance metrics
4. **Visualization**: Created comparative visualizations and rankings
5. **Export**: Saved results for further analysis

### Key Features

- **Standardized Evaluation**: Consistent evaluation across different models
- **Multiple Metrics**: Accuracy, inference time, and balanced scoring
- **Visualization**: Clear comparison charts and tables
- **Flexible Framework**: Easy to add new models and metrics
- **Export Functionality**: JSON export for further analysis

### Evaluation Metrics

- **Accuracy**: Overall correctness of predictions
- **Inference Time**: Speed of model execution
- **Balanced Score**: Weighted combination of accuracy and speed
- **Rankings**: Comparative performance across models

### Model Comparison

The framework allows for easy comparison of:
- Traditional models (STTran, Tempura)
- VLM-based models (VLM-SGG, SceneLLM)
- Different architectures and approaches
- Performance vs. speed trade-offs

### Next Steps

- Add more sophisticated metrics (ROUGE, BLEU, etc.)
- Implement domain-specific evaluations
- Add temporal consistency metrics
- Create interactive visualizations
- Integrate with automated benchmarking

### Troubleshooting

If you encounter issues:

1. **Model Loading**: Ensure all models are properly initialized
2. **Data Issues**: Verify test data format and availability
3. **Memory Issues**: Reduce batch size or use smaller models
4. **Evaluation Errors**: Check model compatibility and data format
5. **Visualization Issues**: Ensure matplotlib and seaborn are installed
