# Cross-Model Comparison and Analysis - Google Colab Standalone

This notebook aggregates and compares results from **all model families**.

## Prerequisites:
Run notebooks 01-04 first and download their CSV result files:
- phowhisper_results_*.csv
- whisper_results_*.csv
- wav2vec2_results_*.csv
- wav2vn_results_*.csv (optional - mock results)

## This Notebook Will:
1. Upload and combine all CSV results
2. Identify best models overall and per-dataset
3. Perform statistical analysis (ANOVA)
4. Create comprehensive visualizations
5. Generate deployment recommendations

**Runtime**: CPU sufficient (no model loading)

## Step 1: Install Dependencies

In [None]:
print('[SETUP] Installing packages...')
!pip install -q pandas numpy matplotlib seaborn scipy
print('[OK] All packages installed!')

## Step 2: Upload Result Files

Upload the CSV files you downloaded from notebooks 01-04:

In [None]:
from google.colab import files
import pandas as pd
import io

print('[INFO] Please upload CSV result files from notebooks 01-04')
print('[INFO] You can select multiple files at once')
print()

uploaded = files.upload()

print(f'\n[OK] Uploaded {len(uploaded)} file(s)')
for filename in uploaded.keys():
    print(f'  - {filename}')

## Step 3: Load and Combine Results

In [None]:
import pandas as pd
import numpy as np

# Load all uploaded CSV files
all_results = []

for filename in uploaded.keys():
    try:
        df = pd.read_csv(io.BytesIO(uploaded[filename]))
        
        # Add model_family column based on filename or model name
        if 'phowhisper' in filename.lower() or 'phowhisper' in str(df['model'].iloc[0]).lower():
            df['model_family'] = 'PhoWhisper'
        elif 'whisper' in filename.lower() and 'pho' not in filename.lower():
            df['model_family'] = 'Whisper'
        elif 'wav2vec2' in filename.lower():
            df['model_family'] = 'Wav2Vec2'
        elif 'wav2vn' in filename.lower():
            df['model_family'] = 'Wav2Vn'
        else:
            df['model_family'] = 'Unknown'
        
        all_results.append(df)
        print(f'[OK] Loaded {filename}: {len(df)} rows, family={df["model_family"].iloc[0]}')
    except Exception as e:
        print(f'[ERROR] Failed to load {filename}: {e}')

# Combine all results
if all_results:
    combined_df = pd.concat(all_results, ignore_index=True)
    print(f'\n[OK] Combined {len(all_results)} files into {len(combined_df)} total results')
    print(f'[INFO] Model families: {combined_df["model_family"].unique().tolist()}')
    print(f'[INFO] Datasets: {combined_df["dataset"].unique().tolist()}')
    print(f'[INFO] Models: {len(combined_df["model"].unique())}')
else:
    print('[ERROR] No results loaded!')
    combined_df = pd.DataFrame()

## Step 4: Display Complete Results

In [None]:
if not combined_df.empty:
    print('[INFO] Complete Evaluation Results')
    print('='*100)
    display_cols = ['model_family', 'model', 'dataset', 'WER', 'CER', 'MER', 'RTF', 'samples_processed']
    print(combined_df[display_cols].to_string(index=False))
    
    # Summary statistics
    print('\n\n[CHART] Average Performance by Model Family:')
    print('='*80)
    family_avg = combined_df.groupby('model_family')[['WER', 'CER', 'MER', 'RTF']].mean()
    print(family_avg.to_string())
    
    print('\n\n[CHART] Average Performance by Model:')
    print('='*80)
    model_avg = combined_df.groupby('model')[['WER', 'CER', 'MER', 'RTF']].mean().sort_values('WER')
    print(model_avg.to_string())

## Step 5: Find Best Models

In [None]:
if not combined_df.empty:
    print('[TARGET] BEST MODELS OVERALL')
    print('='*80)
    
    # Best by WER
    best_wer_idx = combined_df['WER'].idxmin()
    best_wer = combined_df.loc[best_wer_idx]
    print(f'\n[1] Best WER:')
    print(f'    Model: {best_wer["model"]}')
    print(f'    Dataset: {best_wer["dataset"]}')
    print(f'    WER: {best_wer["WER"]:.4f}')
    print(f'    CER: {best_wer["CER"]:.4f}')
    print(f'    RTF: {best_wer["RTF"]:.4f}')
    
    # Best by RTF (fastest)
    best_rtf_idx = combined_df['RTF'].idxmin()
    best_rtf = combined_df.loc[best_rtf_idx]
    print(f'\n[2] Fastest (Best RTF):')
    print(f'    Model: {best_rtf["model"]}')
    print(f'    Dataset: {best_rtf["dataset"]}')
    print(f'    WER: {best_rtf["WER"]:.4f}')
    print(f'    RTF: {best_rtf["RTF"]:.4f}')
    
    # Best balanced (WER * RTF)
    combined_df['balance_score'] = combined_df['WER'] * combined_df['RTF']
    best_balance_idx = combined_df['balance_score'].idxmin()
    best_balance = combined_df.loc[best_balance_idx]
    print(f'\n[3] Best Balanced (Speed + Accuracy):')
    print(f'    Model: {best_balance["model"]}')
    print(f'    Dataset: {best_balance["dataset"]}')
    print(f'    WER: {best_balance["WER"]:.4f}')
    print(f'    RTF: {best_balance["RTF"]:.4f}')
    
    # Best per dataset
    print('\n\n[LIST] Best Model per Dataset (by WER):')
    print('='*80)
    for dataset in combined_df['dataset'].unique():
        dataset_df = combined_df[combined_df['dataset'] == dataset]
        best_idx = dataset_df['WER'].idxmin()
        best = dataset_df.loc[best_idx]
        print(f'\n{dataset}:')
        print(f'  Model: {best["model"]}')
        print(f'  WER: {best["WER"]:.4f} | CER: {best["CER"]:.4f} | RTF: {best["RTF"]:.4f}')

## Step 6: Statistical Analysis

In [None]:
from scipy import stats

if not combined_df.empty and len(combined_df['model_family'].unique()) >= 2:
    print('[INFO] Statistical Significance Testing (ANOVA)')
    print('='*80)
    
    # ANOVA for WER across model families
    families = [group['WER'].values for name, group in combined_df.groupby('model_family')]
    f_stat, p_value = stats.f_oneway(*families)
    
    print(f'\nWER across Model Families:')
    print(f'  F-statistic: {f_stat:.4f}')
    print(f'  P-value: {p_value:.6f}')
    
    if p_value < 0.05:
        print(f'  Result: Statistically SIGNIFICANT difference (p < 0.05)')
        print(f'  Interpretation: Model families perform differently')
    else:
        print(f'  Result: No significant difference (p >= 0.05)')
        print(f'  Interpretation: Model families perform similarly')
    
    # Correlation analysis
    print(f'\n\n[CHART] Metric Correlations:')
    metric_cols = ['WER', 'CER', 'MER', 'WIL', 'WIP', 'SER', 'RTF']
    available_metrics = [m for m in metric_cols if m in combined_df.columns]
    correlation = combined_df[available_metrics].corr()
    print(correlation.to_string())

## Step 7: Comprehensive Visualizations

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

if not combined_df.empty:
    # Plot 1: WER Comparison
    fig, ax = plt.subplots(figsize=(16, 8))
    pivot_wer = combined_df.pivot_table(index='dataset', columns='model', values='WER', aggfunc='mean')
    pivot_wer.plot(kind='bar', ax=ax, width=0.8)
    ax.set_title('Word Error Rate (WER) - All Models Comparison', fontsize=16, fontweight='bold')
    ax.set_xlabel('Dataset', fontsize=13)
    ax.set_ylabel('WER (Lower is Better)', fontsize=13)
    ax.legend(title='Model', bbox_to_anchor=(1.02, 1), loc='upper left', fontsize=9)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    # Plot 2: Model Family Boxplots
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    sns.boxplot(data=combined_df, x='model_family', y='WER', ax=axes[0, 0])
    axes[0, 0].set_title('WER Distribution by Model Family', fontweight='bold')
    axes[0, 0].set_ylabel('WER')
    
    sns.boxplot(data=combined_df, x='model_family', y='CER', ax=axes[0, 1])
    axes[0, 1].set_title('CER Distribution by Model Family', fontweight='bold')
    axes[0, 1].set_ylabel('CER')
    
    sns.boxplot(data=combined_df, x='model_family', y='RTF', ax=axes[1, 0])
    axes[1, 0].set_title('RTF Distribution by Model Family', fontweight='bold')
    axes[1, 0].set_ylabel('RTF')
    axes[1, 0].axhline(y=1.0, color='r', linestyle='--', linewidth=1, label='Real-time')
    axes[1, 0].legend()
    
    sns.boxplot(data=combined_df, x='model_family', y='WIP', ax=axes[1, 1])
    axes[1, 1].set_title('WIP Distribution by Model Family', fontweight='bold')
    axes[1, 1].set_ylabel('WIP (Higher is Better)')
    
    for ax in axes.flat:
        ax.set_xlabel('Model Family')
    
    plt.tight_layout()
    plt.show()
    
    # Plot 3: Speed vs Accuracy Scatter
    fig, ax = plt.subplots(figsize=(14, 8))
    
    for family in combined_df['model_family'].unique():
        family_df = combined_df[combined_df['model_family'] == family]
        ax.scatter(family_df['RTF'], family_df['WER'], label=family, s=100, alpha=0.6)
    
    ax.axvline(x=1.0, color='red', linestyle='--', linewidth=2, alpha=0.5, label='Real-time threshold')
    ax.set_xlabel('Real-Time Factor (RTF) - Lower is Faster', fontsize=13)
    ax.set_ylabel('Word Error Rate (WER) - Lower is Better', fontsize=13)
    ax.set_title('Speed vs Accuracy Trade-off: RTF vs WER', fontsize=16, fontweight='bold')
    ax.legend(loc='best')
    ax.grid(True, alpha=0.3)
    ax.text(0.05, 0.95, 'IDEAL\n(Fast + Accurate)', transform=ax.transAxes, 
           fontsize=12, verticalalignment='top', 
           bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.3))
    plt.tight_layout()
    plt.show()
    
    # Plot 4: Comprehensive Heatmap
    fig, ax = plt.subplots(figsize=(18, max(12, len(combined_df) * 0.4)))
    heatmap_data = combined_df.set_index(['model', 'dataset'])[['WER', 'CER', 'MER', 'WIL', 'WIP', 'SER', 'RTF']]
    sns.heatmap(heatmap_data, annot=True, fmt='.3f', cmap='RdYlGn_r', 
                cbar_kws={'label': 'Metric Value'}, ax=ax, linewidths=0.5)
    ax.set_title('Comprehensive Metrics Heatmap - All Models & Datasets', fontsize=16, fontweight='bold')
    ax.set_xlabel('Metric', fontsize=13)
    ax.set_ylabel('Model + Dataset', fontsize=13)
    plt.tight_layout()
    plt.show()
    
    print('[OK] All visualizations generated!')

## Step 8: Production Deployment Recommendations

In [None]:
if not combined_df.empty:
    print('[TARGET] PRODUCTION DEPLOYMENT RECOMMENDATIONS')
    print('='*80)
    
    # Best for accuracy
    model_wer_avg = combined_df.groupby('model')['WER'].mean().sort_values()
    best_accuracy = model_wer_avg.index[0]
    best_accuracy_wer = model_wer_avg.iloc[0]
    
    # Best for speed
    model_rtf_avg = combined_df.groupby('model')['RTF'].mean().sort_values()
    best_speed = model_rtf_avg.index[0]
    best_speed_rtf = model_rtf_avg.iloc[0]
    
    # Best balanced
    model_scores = combined_df.groupby('model').agg({'WER': 'mean', 'RTF': 'mean'})
    model_scores['balance'] = model_scores['WER'] * model_scores['RTF']
    best_balanced = model_scores['balance'].idxmin()
    balanced_wer = model_scores.loc[best_balanced, 'WER']
    balanced_rtf = model_scores.loc[best_balanced, 'RTF']
    
    print(f'\n1. BEST FOR ACCURACY (Lowest WER):')
    print(f'   Recommendation: {best_accuracy}')
    print(f'   Average WER: {best_accuracy_wer:.4f}')
    print(f'   Use case: Offline transcription, high accuracy requirements')
    
    print(f'\n2. BEST FOR SPEED (Lowest RTF):')
    print(f'   Recommendation: {best_speed}')
    print(f'   Average RTF: {best_speed_rtf:.4f}')
    print(f'   Use case: Real-time transcription, latency-sensitive applications')
    
    print(f'\n3. BEST BALANCED (Speed + Accuracy):')
    print(f'   Recommendation: {best_balanced}')
    print(f'   Average WER: {balanced_wer:.4f}')
    print(f'   Average RTF: {balanced_rtf:.4f}')
    print(f'   Use case: General-purpose transcription')
    
    print(f'\n4. DATASET-SPECIFIC RECOMMENDATIONS:')
    for dataset in combined_df['dataset'].unique():
        dataset_df = combined_df[combined_df['dataset'] == dataset]
        best_model = dataset_df.loc[dataset_df['WER'].idxmin(), 'model']
        best_wer = dataset_df['WER'].min()
        print(f'   {dataset}: {best_model} (WER: {best_wer:.4f})')
    
    print(f'\n' + '='*80)

## Step 9: Save Combined Results

In [None]:
from datetime import datetime

if not combined_df.empty:
    # Save combined results
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output_file = f'/content/combined_results_{timestamp}.csv'
    combined_df.to_csv(output_file, index=False)
    
    print(f'[OK] Combined results saved to: {output_file}')
    print(f'[INFO] Total rows: {len(combined_df)}')
    print(f'[INFO] Columns: {list(combined_df.columns)}')
    
    # Download
    try:
        files.download(output_file)
        print(f'\n[OK] File ready for download!')
    except:
        print(f'\n[INFO] File saved at: {output_file}')

## Summary

### What This Notebook Did:
1. Uploaded and combined CSV results from notebooks 01-04
2. Identified best models overall and per-dataset
3. Performed statistical significance testing (ANOVA)
4. Created comprehensive comparison visualizations
5. Generated production deployment recommendations
6. Saved combined results to CSV

### Key Insights:
- **Best Overall**: (See recommendations above)
- **Fastest**: (See speed analysis)
- **Best Balanced**: (See balanced recommendations)
- **Statistical Significance**: (See ANOVA results)

### Files Generated:
- combined_results_*.csv - All results in one file

---

**Vietnamese ASR Evaluation Framework v1.0 - Cross-Model Comparison Edition**