# VLM Model Evaluation Notebook

Compare SmolVLM2 and PerceptionLM models across video and image benchmarks.

## Models
| Model | HuggingFace Path | Parameters |
|-------|------------------|------------|
| SmolVLM2-256M | `HuggingFaceTB/SmolVLM2-256M-Video-Instruct` | 256M |
| SmolVLM2-500M | `HuggingFaceTB/SmolVLM2-500M-Video-Instruct` | 500M |
| SmolVLM2-2.2B | `HuggingFaceTB/SmolVLM2-2.2B-Instruct` | 2.2B |
| PerceptionLM-1B | `facebook/Perception-LM-1B` | 1B |
| PerceptionLM-3B | `facebook/Perception-LM-3B` | 3B |

## Benchmarks (19 total)
- **Video (5)**: Video-MME, MLVU, MVBench, WorldSense, TempCompass
- **Image/Document (9)**: TextVQA, DocVQA, ChartQA, MMMU, MathVista, OCRBench, AI2D, ScienceQA, MMStar
- **PLM-VideoBench (5)**: Fine-Grained QA, Smart Glasses QA, Region Captioning, Region Temporal Localization, Region Dense Captioning

## GPU Requirements
- **T4 (Free tier)**: Works with sequential model loading and reduced batch sizes
- **A100 (Pro)**: Faster evaluation with larger batch sizes

**Runtime:** Go to `Runtime > Change runtime type` and select GPU.

## 1. Setup Environment

In [None]:
# Check GPU
!nvidia-smi

import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(
        f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB"
    )

In [None]:
# Install dependencies
%pip install -q torch torchvision torchaudio
%pip install -q transformers>=4.40.0 accelerate>=0.27.0
%pip install -q lmms-eval>=0.2.0
%pip install -q decord av pillow einops safetensors
%pip install -q pandas matplotlib seaborn tabulate tqdm

# Try to install flash-attention (optional, improves performance on A100)
%pip install -q flash-attn --no-build-isolation 2>/dev/null || echo "Flash attention not installed (optional, A100 only)"

In [None]:
# Clone repository for evaluation utilities
!git clone https://github.com/YOUR_USERNAME/smolvlm_sandbox.git 2>/dev/null || echo "Repository already exists"
%cd smolvlm_sandbox

# Add to path
import sys

sys.path.insert(0, "/content/smolvlm_sandbox")

## 2. GPU Detection & Configuration

In [None]:
import gc
from dataclasses import dataclass, field
from enum import Enum
from typing import Dict, List

import torch


class GPUTier(Enum):
    T4_FREE = "t4_free"  # 15GB VRAM
    A100_PRO = "a100_pro"  # 40GB VRAM
    OTHER = "other"


@dataclass
class ColabConfig:
    """Colab-specific configuration based on GPU tier."""

    gpu_tier: GPUTier
    gpu_name: str
    vram_gb: float

    # Batch sizes per model size category
    batch_sizes: Dict[str, int] = field(default_factory=dict)

    # Memory management
    unload_between_models: bool = True
    unload_between_benchmarks: bool = False

    # Model loading
    device_map: str = "auto"
    dtype: str = "bfloat16"
    use_flash_attn: bool = False

    # Evaluation settings
    max_frames_video: int = 32

    @classmethod
    def detect(cls) -> "ColabConfig":
        """Auto-detect GPU and create appropriate config."""
        if not torch.cuda.is_available():
            raise RuntimeError(
                "No GPU available. Enable GPU in Runtime > Change runtime type"
            )

        gpu_name = torch.cuda.get_device_name(0)
        vram_bytes = torch.cuda.get_device_properties(0).total_memory
        vram_gb = vram_bytes / (1024**3)

        # Detect GPU tier
        if "T4" in gpu_name:
            tier = GPUTier.T4_FREE
        elif "A100" in gpu_name:
            tier = GPUTier.A100_PRO
        else:
            tier = GPUTier.OTHER

        # Configure based on tier
        if tier == GPUTier.T4_FREE:
            return cls(
                gpu_tier=tier,
                gpu_name=gpu_name,
                vram_gb=vram_gb,
                batch_sizes={
                    "256m": 8,
                    "500m": 4,
                    "1b": 2,
                    "2.2b": 1,
                    "3b": 1,
                },
                unload_between_models=True,
                unload_between_benchmarks=True,
                device_map="auto",
                dtype="bfloat16",
                use_flash_attn=False,
                max_frames_video=16,
            )
        elif tier == GPUTier.A100_PRO:
            return cls(
                gpu_tier=tier,
                gpu_name=gpu_name,
                vram_gb=vram_gb,
                batch_sizes={
                    "256m": 32,
                    "500m": 16,
                    "1b": 8,
                    "2.2b": 4,
                    "3b": 4,
                },
                unload_between_models=False,
                unload_between_benchmarks=False,
                device_map="auto",
                dtype="bfloat16",
                use_flash_attn=True,
                max_frames_video=32,
            )
        else:
            # Conservative defaults for unknown GPUs
            return cls(
                gpu_tier=tier,
                gpu_name=gpu_name,
                vram_gb=vram_gb,
                batch_sizes={
                    "256m": 4,
                    "500m": 2,
                    "1b": 1,
                    "2.2b": 1,
                    "3b": 1,
                },
                unload_between_models=True,
                unload_between_benchmarks=True,
                device_map="auto",
                dtype="bfloat16",
                use_flash_attn=False,
                max_frames_video=16,
            )

    def get_batch_size(self, model_size: str) -> int:
        """Get batch size for a model size."""
        size_map = {
            "256m": "256m",
            "500m": "500m",
            "1b": "1b",
            "2.2b": "2.2b",
            "2b": "2.2b",
            "3b": "3b",
        }
        normalized = size_map.get(model_size.lower(), "1b")
        return self.batch_sizes.get(normalized, 1)


def clear_gpu_memory():
    """Aggressively clear GPU memory."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()


# Detect and display configuration
print("=" * 60)
print("GPU Detection")
print("=" * 60)

config = ColabConfig.detect()

print(f"GPU: {config.gpu_name}")
print(f"VRAM: {config.vram_gb:.1f} GB")
print(f"Tier: {config.gpu_tier.value}")
print(f"Batch sizes: {config.batch_sizes}")
print(f"Unload between models: {config.unload_between_models}")
print(f"Max video frames: {config.max_frames_video}")
print("=" * 60)

## 3. Model & Benchmark Selection

In [None]:
# Benchmark definitions
VIDEO_BENCHMARKS = [
    "videomme",
    "mlvu",
    "mvbench",
    "worldsense",
    "tempcompass",
]

PLM_VIDEOBENCH = [
    "plm_fgqa",
    "plm_sgqa",
    "plm_rcap",
    "plm_rtloc",
    "plm_rdcap",
]

IMAGE_BENCHMARKS = [
    "textvqa",
    "docvqa",
    "chartqa",
    "mmmu_val",
    "mathvista_testmini",
    "ocrbench",
    "ai2d",
    "scienceqa_img",
    "mmstar",
]

BENCHMARK_GROUPS = {
    "video": VIDEO_BENCHMARKS,
    "image": IMAGE_BENCHMARKS,
    "plm": PLM_VIDEOBENCH,
    "all": VIDEO_BENCHMARKS + PLM_VIDEOBENCH + IMAGE_BENCHMARKS,
}


def resolve_benchmark_names(benchmark_str: str) -> List[str]:
    """Resolve benchmark string to list of task names."""
    if benchmark_str.lower() in BENCHMARK_GROUPS:
        return BENCHMARK_GROUPS[benchmark_str.lower()]
    return [b.strip() for b in benchmark_str.split(",")]


# ============================================
# USER CONFIGURATION - Modify these settings
# ============================================

# Models to evaluate (comment/uncomment to include/exclude)
MODELS_TO_EVALUATE = [
    # "HuggingFaceTB/SmolVLM2-256M-Video-Instruct",  # Optional: smallest/fastest model
    "HuggingFaceTB/SmolVLM2-500M-Video-Instruct",
    # "HuggingFaceTB/SmolVLM2-2.2B-Instruct",
    "facebook/Perception-LM-1B",
    # "facebook/Perception-LM-3B",
]

# Benchmark selection: "all", "video", "image", "plm", or comma-separated list
BENCHMARK_MODE = "all"  # Options: "all", "video", "image", "plm"

# For custom selection, use:
# BENCHMARK_MODE = "videomme,textvqa,mmmu_val"

# Output directory (Google Drive)
OUTPUT_DIR = "/content/drive/MyDrive/vlm_evaluation_results"

# Resume from checkpoint if available
RESUME_FROM_CHECKPOINT = True

# ============================================
# Resolve benchmark selection
# ============================================

benchmarks_to_run = resolve_benchmark_names(BENCHMARK_MODE)

# Display configuration
print("=" * 60)
print("Evaluation Configuration")
print("=" * 60)

print(f"\nModels ({len(MODELS_TO_EVALUATE)}):")
for model in MODELS_TO_EVALUATE:
    print(f"  - {model}")

print(f"\nBenchmarks ({len(benchmarks_to_run)}):")

# Group by category
video_selected = [b for b in benchmarks_to_run if b in VIDEO_BENCHMARKS]
image_selected = [b for b in benchmarks_to_run if b in IMAGE_BENCHMARKS]
plm_selected = [b for b in benchmarks_to_run if b in PLM_VIDEOBENCH]

if video_selected:
    print(f"  Video ({len(video_selected)}): {', '.join(video_selected)}")
if image_selected:
    print(f"  Image ({len(image_selected)}): {', '.join(image_selected)}")
if plm_selected:
    print(f"  PLM ({len(plm_selected)}): {', '.join(plm_selected)}")

# Estimate time
total_evaluations = len(MODELS_TO_EVALUATE) * len(benchmarks_to_run)
est_time_per_eval = 10 if config.gpu_tier == GPUTier.A100_PRO else 20  # minutes
est_total_hours = (total_evaluations * est_time_per_eval) / 60

print(f"\nTotal evaluations: {total_evaluations}")
print(f"Estimated time: {est_total_hours:.1f} hours")

if config.gpu_tier == GPUTier.T4_FREE and est_total_hours > 12:
    print("\n WARNING: This may exceed Colab free tier time limits.")
    print("Consider running in batches or using Colab Pro.")

## 4. Mount Google Drive & Checkpoint Setup

In [None]:
import json
import os
from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict, List, Tuple

from google.colab import drive

# Mount Google Drive
drive.mount("/content/drive")

# Create output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)
print(f"Output directory: {OUTPUT_DIR}")

# Checkpoint file for resuming
CHECKPOINT_FILE = os.path.join(OUTPUT_DIR, "evaluation_checkpoint.json")


@dataclass
class EvaluationCheckpoint:
    """Track evaluation progress for resumption."""

    completed: List[Tuple[str, str]]  # List of (model, benchmark) tuples
    results: Dict[str, Dict[str, Dict]]  # model -> benchmark -> metrics
    started_at: str
    last_updated: str

    @classmethod
    def load_or_create(cls, path: str) -> "EvaluationCheckpoint":
        """Load existing checkpoint or create new one."""
        if os.path.exists(path):
            with open(path) as f:
                data = json.load(f)
            return cls(
                completed=[tuple(x) for x in data.get("completed", [])],
                results=data.get("results", {}),
                started_at=data.get("started_at", datetime.now().isoformat()),
                last_updated=data.get("last_updated", datetime.now().isoformat()),
            )
        return cls(
            completed=[],
            results={},
            started_at=datetime.now().isoformat(),
            last_updated=datetime.now().isoformat(),
        )

    def save(self, path: str):
        """Save checkpoint to file."""
        self.last_updated = datetime.now().isoformat()
        with open(path, "w") as f:
            json.dump(
                {
                    "completed": self.completed,
                    "results": self.results,
                    "started_at": self.started_at,
                    "last_updated": self.last_updated,
                },
                f,
                indent=2,
                default=str,
            )

    def mark_complete(self, model: str, benchmark: str, metrics: Dict):
        """Mark an evaluation as complete."""
        key = (model, benchmark)
        if key not in self.completed:
            self.completed.append(key)

        if model not in self.results:
            self.results[model] = {}
        self.results[model][benchmark] = metrics

    def is_complete(self, model: str, benchmark: str) -> bool:
        """Check if an evaluation is already complete."""
        return (model, benchmark) in self.completed

    def get_remaining(
        self, models: List[str], benchmarks: List[str]
    ) -> List[Tuple[str, str]]:
        """Get list of remaining evaluations."""
        all_evals = [(m, b) for m in models for b in benchmarks]
        return [e for e in all_evals if e not in self.completed]


# Load or create checkpoint
if RESUME_FROM_CHECKPOINT:
    checkpoint = EvaluationCheckpoint.load_or_create(CHECKPOINT_FILE)
    completed_count = len(checkpoint.completed)
    if completed_count > 0:
        print(
            f"Resuming from checkpoint: {completed_count} evaluations already complete"
        )
else:
    checkpoint = EvaluationCheckpoint(
        completed=[],
        results={},
        started_at=datetime.now().isoformat(),
        last_updated=datetime.now().isoformat(),
    )

remaining = checkpoint.get_remaining(MODELS_TO_EVALUATE, benchmarks_to_run)
print(f"Remaining evaluations: {len(remaining)}")

## 5. Evaluation Runner

In [None]:
import traceback

import torch
from tqdm.notebook import tqdm
from transformers import AutoModelForImageTextToText, AutoProcessor


class ColabEvaluationRunner:
    """Colab-optimized evaluation runner with memory management."""

    def __init__(
        self,
        config: ColabConfig,
        output_dir: str,
        checkpoint: EvaluationCheckpoint,
    ):
        self.config = config
        self.output_dir = output_dir
        self.checkpoint = checkpoint
        self.current_model = None
        self.current_processor = None
        self.current_model_path = None

    def load_model(self, model_path: str):
        """Load a model, unloading previous if necessary."""
        if self.current_model_path == model_path:
            return  # Already loaded

        # Unload previous model
        if self.current_model is not None:
            print(f"Unloading {self.current_model_path}...")
            del self.current_model
            del self.current_processor
            self.current_model = None
            self.current_processor = None
            self.current_model_path = None
            clear_gpu_memory()

        print(f"Loading {model_path}...")

        # Determine settings
        dtype = torch.bfloat16 if self.config.dtype == "bfloat16" else torch.float32
        attn_impl = "flash_attention_2" if self.config.use_flash_attn else "eager"

        # Load model
        try:
            self.current_model = AutoModelForImageTextToText.from_pretrained(
                model_path,
                torch_dtype=dtype,
                device_map=self.config.device_map,
                attn_implementation=attn_impl,
                trust_remote_code=True,
            )
        except Exception:
            # Fallback without flash attention
            self.current_model = AutoModelForImageTextToText.from_pretrained(
                model_path,
                torch_dtype=dtype,
                device_map=self.config.device_map,
                trust_remote_code=True,
            )

        self.current_model.eval()

        # Load processor
        self.current_processor = AutoProcessor.from_pretrained(
            model_path,
            trust_remote_code=True,
        )

        self.current_model_path = model_path
        print("Model loaded successfully")

    def unload_model(self):
        """Unload current model to free memory."""
        if self.current_model is not None:
            del self.current_model
            del self.current_processor
            self.current_model = None
            self.current_processor = None
            self.current_model_path = None
            clear_gpu_memory()

    def run_single_evaluation(
        self,
        model_path: str,
        benchmark: str,
    ) -> Dict:
        """Run a single model-benchmark evaluation using lmms-eval."""

        # Check if already complete
        if self.checkpoint.is_complete(model_path, benchmark):
            print(f"  Skipping {benchmark} (already complete)")
            return self.checkpoint.results.get(model_path, {}).get(benchmark, {})

        # Get batch size
        model_size = self._get_model_size(model_path)
        batch_size = self.config.get_batch_size(model_size)

        # Reduce batch size for video benchmarks
        is_video = benchmark in VIDEO_BENCHMARKS or benchmark in PLM_VIDEOBENCH
        if is_video:
            batch_size = max(1, batch_size // 2)

        print(f"  Running {benchmark} (batch_size={batch_size})...")

        try:
            from lmms_eval import evaluator
            from lmms_eval.tasks import TaskManager

            # Build model args
            model_args = f"pretrained={model_path}"
            if self.config.dtype == "bfloat16":
                model_args += ",dtype=bfloat16"
            model_args += f",max_frames_num={self.config.max_frames_video}"

            # Initialize task manager
            task_manager = TaskManager()

            # Run evaluation
            results = evaluator.simple_evaluate(
                model="vlm",
                model_args=model_args,
                tasks=[benchmark],
                batch_size=batch_size,
                log_samples=True,
                task_manager=task_manager,
            )

            # Extract metrics
            metrics = results.get("results", {}).get(benchmark, {})

            # Save to checkpoint
            self.checkpoint.mark_complete(model_path, benchmark, metrics)
            self.checkpoint.save(CHECKPOINT_FILE)

            print(f"    Success: {self._format_metrics(metrics)}")
            return metrics

        except Exception as e:
            print(f"    Exception: {str(e)[:100]}")
            traceback.print_exc()
            return {"status": "error", "error": str(e)}

        finally:
            # Clear cache after video benchmarks
            if is_video:
                clear_gpu_memory()

    def run_all(
        self,
        models: List[str],
        benchmarks: List[str],
    ) -> Dict[str, Dict[str, Dict]]:
        """Run all evaluations with progress tracking."""

        remaining = self.checkpoint.get_remaining(models, benchmarks)
        total = len(remaining)

        print(f"\nStarting evaluation: {total} remaining")
        print("=" * 60)

        # Group by model for efficiency (minimize model reloads)
        model_to_benchmarks = {}
        for model, benchmark in remaining:
            if model not in model_to_benchmarks:
                model_to_benchmarks[model] = []
            model_to_benchmarks[model].append(benchmark)

        # Progress bar
        pbar = tqdm(total=total, desc="Evaluating")

        for model_path in models:
            if model_path not in model_to_benchmarks:
                continue

            model_benchmarks = model_to_benchmarks[model_path]
            print(f"\n Model: {model_path}")
            print("-" * 40)

            for benchmark in model_benchmarks:
                self.run_single_evaluation(model_path, benchmark)
                pbar.update(1)

                # Unload between benchmarks if needed (T4)
                if self.config.unload_between_benchmarks:
                    clear_gpu_memory()

            # Unload model after all benchmarks
            if self.config.unload_between_models:
                self.unload_model()

        pbar.close()
        print("\n" + "=" * 60)
        print("Evaluation complete!")

        return self.checkpoint.results

    def _get_model_size(self, model_path: str) -> str:
        """Extract model size from path."""
        path_lower = model_path.lower()
        if "256m" in path_lower:
            return "256m"
        elif "500m" in path_lower:
            return "500m"
        elif "2.2b" in path_lower or "2b" in path_lower:
            return "2.2b"
        elif "1b" in path_lower:
            return "1b"
        elif "3b" in path_lower:
            return "3b"
        return "1b"  # Default

    def _format_metrics(self, metrics: Dict) -> str:
        """Format metrics for display."""
        if not metrics:
            return "No metrics"

        parts = []
        for k, v in list(metrics.items())[:3]:
            if isinstance(v, float):
                parts.append(f"{k}={v:.3f}")
            else:
                parts.append(f"{k}={v}")
        return ", ".join(parts)


# Create runner
runner = ColabEvaluationRunner(
    config=config,
    output_dir=OUTPUT_DIR,
    checkpoint=checkpoint,
)

print("Evaluation runner ready.")

## 6. Inference Test (Sanity Check)

Test that all models can load and generate outputs before running full benchmarks.

In [None]:
from io import BytesIO

import requests
from PIL import Image

# Download test image
TEST_IMAGE_URL = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"

print("Downloading test image...")
response = requests.get(TEST_IMAGE_URL)
test_image = Image.open(BytesIO(response.content)).convert("RGB")
print(f"Test image size: {test_image.size}")

# Display test image
display(test_image.resize((256, 256)))

In [None]:
def test_model_inference(model_path: str, image: Image.Image) -> dict:
    """Test inference for a single model."""
    result = {
        "model": model_path,
        "status": "unknown",
        "output": None,
        "error": None,
    }

    try:
        print(f"\nTesting: {model_path}")
        print("-" * 50)

        # Load model
        runner.load_model(model_path)

        # Prepare prompt
        prompt = "Describe this image briefly."

        # Process inputs based on model type
        if "smolvlm" in model_path.lower():
            # SmolVLM format
            messages = [
                {
                    "role": "user",
                    "content": [{"type": "image"}, {"type": "text", "text": prompt}],
                }
            ]
            text = runner.current_processor.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=True
            )
            inputs = runner.current_processor(
                images=[image],
                text=text,
                return_tensors="pt",
            ).to(runner.current_model.device)
        else:
            # PerceptionLM format
            messages = [
                {
                    "role": "user",
                    "content": [{"type": "image"}, {"type": "text", "text": prompt}],
                }
            ]
            text = runner.current_processor.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=True
            )
            inputs = runner.current_processor(
                images=[image],
                text=text,
                return_tensors="pt",
            ).to(runner.current_model.device)

        # Generate
        with torch.no_grad():
            outputs = runner.current_model.generate(
                **inputs,
                max_new_tokens=50,
                do_sample=False,
            )

        # Decode
        generated_text = runner.current_processor.batch_decode(
            outputs[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True
        )[0]

        result["status"] = "PASS"
        result["output"] = generated_text.strip()

        print("  Status: PASS")
        print(
            f"  Output: {result['output'][:100]}..."
            if len(result["output"]) > 100
            else f"  Output: {result['output']}"
        )

    except Exception as e:
        result["status"] = "FAIL"
        result["error"] = str(e)
        print("  Status: FAIL")
        print(f"  Error: {str(e)[:200]}")

    finally:
        # Unload model
        if config.unload_between_models:
            runner.unload_model()

    return result


# Run inference tests
print("=" * 60)
print("INFERENCE SANITY CHECK")
print("=" * 60)

inference_results = []
for model_path in MODELS_TO_EVALUATE:
    result = test_model_inference(model_path, test_image)
    inference_results.append(result)

# Summary
print("\n" + "=" * 60)
print("INFERENCE TEST SUMMARY")
print("=" * 60)

passed = sum(1 for r in inference_results if r["status"] == "PASS")
failed = sum(1 for r in inference_results if r["status"] == "FAIL")

for r in inference_results:
    status_icon = "OK" if r["status"] == "PASS" else "X"
    model_name = r["model"].split("/")[-1]
    print(f"  [{status_icon}] {model_name}")

print(f"\nPassed: {passed}/{len(inference_results)}")

if failed > 0:
    print(f"\n WARNING: {failed} model(s) failed inference test.")
    print(
        "You may want to remove failed models from MODELS_TO_EVALUATE before running benchmarks."
    )
else:
    print("\n All models passed inference test! Ready to run benchmarks.")

## 7. Execute Evaluation

Run the full benchmark evaluation. This will automatically resume from checkpoint if interrupted.

In [None]:
# Run all evaluations
# This will automatically resume from checkpoint if interrupted

results = runner.run_all(
    models=MODELS_TO_EVALUATE,
    benchmarks=benchmarks_to_run,
)

print(f"\nResults saved to: {OUTPUT_DIR}")
print(f"Checkpoint file: {CHECKPOINT_FILE}")

## 8. Results Aggregation

In [None]:
import pandas as pd


def aggregate_results(results: Dict[str, Dict[str, Dict]]) -> pd.DataFrame:
    """Aggregate results into a comparison DataFrame."""

    rows = []
    for model_path, benchmarks in results.items():
        model_name = model_path.split("/")[-1]  # Short name

        for benchmark, metrics in benchmarks.items():
            if isinstance(metrics, dict) and "status" not in metrics:
                # Get primary metric
                primary = None
                for key in ["accuracy", "acc", "exact_match", "bleu", "anls"]:
                    if key in metrics:
                        primary = metrics[key]
                        break

                if primary is None and metrics:
                    primary = list(metrics.values())[0]

                # Determine category
                if benchmark in VIDEO_BENCHMARKS:
                    category = "Video"
                elif benchmark in IMAGE_BENCHMARKS:
                    category = "Image"
                elif benchmark in PLM_VIDEOBENCH:
                    category = "PLM"
                else:
                    category = "Other"

                rows.append(
                    {
                        "model": model_name,
                        "model_path": model_path,
                        "benchmark": benchmark,
                        "category": category,
                        "score": primary if isinstance(primary, (int, float)) else 0.0,
                        **{
                            k: v
                            for k, v in metrics.items()
                            if isinstance(v, (int, float))
                        },
                    }
                )

    return pd.DataFrame(rows)


# Create results DataFrame
df = aggregate_results(checkpoint.results)

if len(df) > 0:
    # Save to CSV
    csv_path = os.path.join(OUTPUT_DIR, "results_summary.csv")
    df.to_csv(csv_path, index=False)
    print(f"Results saved to: {csv_path}")

    # Display summary
    print("\n" + "=" * 60)
    print("Results Summary")
    print("=" * 60)
    display(df)
else:
    print("No results to display yet.")

## 9. Comparison Table

In [None]:
from tabulate import tabulate


def create_comparison_table(df: pd.DataFrame) -> pd.DataFrame:
    """Create a pivot table comparing models across benchmarks."""
    if len(df) == 0:
        return pd.DataFrame()

    # Pivot: benchmarks as rows, models as columns
    pivot = df.pivot_table(
        index="benchmark",
        columns="model",
        values="score",
        aggfunc="first",
    )

    # Format scores as percentages
    pivot = pivot.applymap(
        lambda x: f"{x:.1%}"
        if pd.notna(x) and x <= 1
        else (f"{x:.2f}" if pd.notna(x) else "N/A")
    )

    # Add benchmark category
    def get_category(benchmark):
        if benchmark in VIDEO_BENCHMARKS:
            return "Video"
        elif benchmark in IMAGE_BENCHMARKS:
            return "Image"
        elif benchmark in PLM_VIDEOBENCH:
            return "PLM"
        return "Other"

    pivot["Category"] = pivot.index.map(get_category)

    # Reorder columns
    cols = ["Category"] + [c for c in pivot.columns if c != "Category"]
    pivot = pivot[cols]

    # Sort by category then benchmark name
    pivot = pivot.sort_values(["Category", pivot.index.name])

    return pivot


if len(df) > 0:
    # Create comparison table
    comparison = create_comparison_table(df)

    print("\n" + "=" * 80)
    print("Model Comparison Table")
    print("=" * 80)

    # Display as formatted table
    print(tabulate(comparison, headers="keys", tablefmt="grid"))

    # Save to file
    table_path = os.path.join(OUTPUT_DIR, "comparison_table.txt")
    with open(table_path, "w") as f:
        f.write(tabulate(comparison, headers="keys", tablefmt="grid"))
    print(f"\nTable saved to: {table_path}")
else:
    print("No results to display yet.")

## 10. Bar Charts

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Set style
plt.style.use("seaborn-v0_8-whitegrid")
sns.set_palette("husl")


def plot_benchmark_comparison(
    df: pd.DataFrame,
    category: str = None,
    figsize: tuple = (14, 6),
    save_path: str = None,
):
    """Create bar chart comparing models on benchmarks."""
    if len(df) == 0:
        print("No data to plot.")
        return

    # Filter by category if specified
    plot_df = df.copy()
    if category:
        plot_df = plot_df[plot_df["category"] == category]

    if plot_df.empty:
        print(f"No data for category: {category}")
        return

    # Create figure
    fig, ax = plt.subplots(figsize=figsize)

    # Create grouped bar chart
    benchmarks = plot_df["benchmark"].unique()
    models = plot_df["model"].unique()
    x = np.arange(len(benchmarks))
    width = 0.8 / len(models)

    colors = plt.cm.tab10(np.linspace(0, 1, len(models)))

    for i, model in enumerate(models):
        model_data = plot_df[plot_df["model"] == model]
        scores = []
        for b in benchmarks:
            val = model_data[model_data["benchmark"] == b]["score"].values
            scores.append(val[0] if len(val) > 0 else 0)

        offset = (i - len(models) / 2 + 0.5) * width
        bars = ax.bar(
            [xi + offset for xi in x], scores, width, label=model, color=colors[i]
        )

        # Add value labels
        for bar, score in zip(bars, scores):
            if score > 0:
                label = f"{score:.1%}" if score <= 1 else f"{score:.1f}"
                ax.annotate(
                    label,
                    xy=(bar.get_x() + bar.get_width() / 2, bar.get_height()),
                    ha="center",
                    va="bottom",
                    fontsize=7,
                    rotation=45,
                )

    ax.set_xlabel("Benchmark")
    ax.set_ylabel("Score")
    title = f"Model Comparison{f' - {category} Benchmarks' if category else ''}"
    ax.set_title(title)
    ax.set_xticks(x)
    ax.set_xticklabels(benchmarks, rotation=45, ha="right")
    ax.legend(loc="upper right", bbox_to_anchor=(1.15, 1))
    ax.set_ylim(0, 1.1)

    plt.tight_layout()

    if save_path:
        plt.savefig(save_path, dpi=150, bbox_inches="tight")
        print(f"Saved: {save_path}")

    plt.show()


if len(df) > 0:
    print("=" * 60)
    print("Benchmark Comparison Charts")
    print("=" * 60)

    # Plot by category
    for category in ["Video", "Image", "PLM"]:
        category_df = df[df["category"] == category]
        if not category_df.empty:
            plot_benchmark_comparison(
                df,
                category=category,
                figsize=(12, 6),
                save_path=os.path.join(
                    OUTPUT_DIR, f"comparison_{category.lower()}.png"
                ),
            )
else:
    print("No results to plot yet.")

## 11. Radar Chart

In [None]:
def plot_radar_chart(
    df: pd.DataFrame,
    figsize: tuple = (10, 10),
    save_path: str = None,
):
    """Create radar chart showing model capabilities across benchmark categories."""
    if len(df) == 0:
        print("No data to plot.")
        return

    # Calculate average score per model per category
    category_scores = (
        df.groupby(["model", "category"])["score"].mean().unstack(fill_value=0)
    )

    # Setup radar chart
    categories = list(category_scores.columns)
    N = len(categories)

    if N < 3:
        print("Need at least 3 categories for radar chart.")
        return

    # Compute angle for each category
    angles = [n / float(N) * 2 * np.pi for n in range(N)]
    angles += angles[:1]  # Close the loop

    # Create figure
    fig, ax = plt.subplots(figsize=figsize, subplot_kw=dict(polar=True))

    # Plot each model
    colors = plt.cm.tab10(np.linspace(0, 1, len(category_scores)))

    for idx, (model, scores) in enumerate(category_scores.iterrows()):
        values = scores.tolist()
        values += values[:1]  # Close the loop

        ax.plot(angles, values, "o-", linewidth=2, label=model, color=colors[idx])
        ax.fill(angles, values, alpha=0.1, color=colors[idx])

    # Set category labels
    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(categories, size=12)

    # Set y-axis limits
    ax.set_ylim(0, 1)

    # Add legend
    ax.legend(loc="upper right", bbox_to_anchor=(1.3, 1.1))

    plt.title("Model Capabilities Across Benchmark Categories", size=14, y=1.08)

    if save_path:
        plt.savefig(save_path, dpi=150, bbox_inches="tight")
        print(f"Saved: {save_path}")

    plt.show()


if len(df) > 0:
    plot_radar_chart(df, save_path=os.path.join(OUTPUT_DIR, "radar_chart.png"))
else:
    print("No results to plot yet.")

## 12. Leaderboard

In [None]:
def create_leaderboard(df: pd.DataFrame) -> pd.DataFrame:
    """Create ranked leaderboard of models."""
    if len(df) == 0:
        return pd.DataFrame()

    leaderboard_rows = []

    for model in df["model"].unique():
        model_df = df[df["model"] == model]

        row = {
            "Model": model,
            "Overall Avg": model_df["score"].mean(),
            "Video Avg": model_df[model_df["category"] == "Video"]["score"].mean()
            if "Video" in model_df["category"].values
            else None,
            "Image Avg": model_df[model_df["category"] == "Image"]["score"].mean()
            if "Image" in model_df["category"].values
            else None,
            "PLM Avg": model_df[model_df["category"] == "PLM"]["score"].mean()
            if "PLM" in model_df["category"].values
            else None,
            "Benchmarks": len(model_df),
        }
        leaderboard_rows.append(row)

    leaderboard = pd.DataFrame(leaderboard_rows)
    leaderboard = leaderboard.sort_values("Overall Avg", ascending=False)
    leaderboard = leaderboard.reset_index(drop=True)
    leaderboard.index = leaderboard.index + 1  # Rank starting from 1
    leaderboard.index.name = "Rank"

    # Format percentages
    for col in ["Overall Avg", "Video Avg", "Image Avg", "PLM Avg"]:
        leaderboard[col] = leaderboard[col].apply(
            lambda x: f"{x:.1%}" if pd.notna(x) else "N/A"
        )

    return leaderboard


if len(df) > 0:
    # Create and display leaderboard
    leaderboard = create_leaderboard(df)

    print("\n" + "=" * 60)
    print("MODEL LEADERBOARD")
    print("=" * 60)
    print(tabulate(leaderboard, headers="keys", tablefmt="fancy_grid"))

    # Save leaderboard
    leaderboard_path = os.path.join(OUTPUT_DIR, "leaderboard.csv")
    leaderboard.to_csv(leaderboard_path)
    print(f"\nLeaderboard saved to: {leaderboard_path}")
else:
    print("No results to display yet.")

## 13. Download Results

In [None]:
import zipfile


def create_results_archive(output_dir: str) -> str:
    """Create a zip archive of all results."""

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    archive_name = f"vlm_evaluation_results_{timestamp}"
    archive_path = f"/content/{archive_name}.zip"

    with zipfile.ZipFile(archive_path, "w", zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(output_dir):
            for file in files:
                file_path = os.path.join(root, file)
                arcname = os.path.relpath(file_path, output_dir)
                zipf.write(file_path, arcname)

    return archive_path


# Create archive
archive_path = create_results_archive(OUTPUT_DIR)
print(f"Archive created: {archive_path}")

# Download (in Colab)
from google.colab import files

files.download(archive_path)

print("\nDownload started. Check your browser's download folder.")

## 14. Summary

### Evaluation Complete!

#### Results Summary
- All results saved to Google Drive
- Comparison tables and visualizations generated
- Checkpoint saved for potential resumption

#### Files Generated:
- `results_summary.csv` - Raw results data
- `comparison_table.txt` - Formatted comparison table
- `leaderboard.csv` - Ranked model leaderboard
- `comparison_*.png` - Bar charts by category
- `radar_chart.png` - Radar chart of capabilities
- `evaluation_checkpoint.json` - Checkpoint for resumption

#### Next Steps:
1. Review the comparison charts to identify model strengths
2. Use the leaderboard to select the best model for your use case
3. Consider running additional benchmarks if needed
4. Fine-tune promising models on your specific task