# PhoWhisper Model Evaluation - Google Colab Standalone

This notebook evaluates **5 PhoWhisper models** on Vietnamese datasets.

## Models Evaluated:
- vinai/PhoWhisper-tiny
- vinai/PhoWhisper-base
- vinai/PhoWhisper-small
- vinai/PhoWhisper-medium
- vinai/PhoWhisper-large

## Features:
- Complete standalone execution (no external files needed)
- Downloads datasets from HuggingFace automatically
- Calculates comprehensive metrics (WER, CER, MER, WIL, WIP, SER, RTF)
- Generates visualizations
- Exports results as CSV for cross-model comparison

**Runtime**: GPU recommended (T4 or better)

## Step 1: Install Dependencies

In [None]:
print('[SETUP] Installing required packages...')
!pip install -q transformers==4.57.1 torch torchcodec torchaudio librosa soundfile jiwer datasets accelerate pandas matplotlib seaborn scipy tqdm
print('[OK] All packages installed successfully!')

# Check GPU availability
import torch
print(f'\n[INFO] CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'[INFO] GPU: {torch.cuda.get_device_name(0)}')
else:
    print('[WARNING] Running on CPU - evaluation will be slower')

[SETUP] Installing required packages...
[OK] All packages installed successfully!

[INFO] CUDA available: True
[INFO] GPU: NVIDIA GeForce RTX 3050 Laptop GPU


## Step 2: Embedded Helper Functions

All necessary code from the Vietnamese ASR framework is embedded below:

In [2]:
# Embedded Vietnamese ASR Evaluation Code
# This cell contains all necessary functions from src/ modules

import time
import warnings
import numpy as np
import pandas as pd
from pathlib import Path
from dataclasses import dataclass
from typing import List, Dict, Optional
from tqdm.auto import tqdm

warnings.filterwarnings('ignore')

# ============================================================================
# METRICS MODULE
# ============================================================================

from jiwer import wer, cer, mer, wil, wip, process_words

class ASRMetrics:
    """Metrics calculator for ASR evaluation."""
    
    @staticmethod
    def calculate_all_metrics(references: List[str], hypotheses: List[str]) -> Dict:
        """Calculate all metrics for batch of utterances."""
        # Join all references and hypotheses
        ref_text = ' '.join(references)
        hyp_text = ' '.join(hypotheses)
        
        # Calculate error details
        output = process_words(ref_text, hyp_text)
        
        return {
            'wer': wer(ref_text, hyp_text),
            'cer': cer(ref_text, hyp_text),
            'mer': mer(ref_text, hyp_text),
            'wil': wil(ref_text, hyp_text),
            'wip': wip(ref_text, hyp_text),
            'ser': sum(1 for r, h in zip(references, hypotheses) if r != h) / len(references),
            'insertions': output.insertions,
            'deletions': output.deletions,
            'substitutions': output.substitutions
        }

class RTFTimer:
    """Context manager for measuring Real-Time Factor."""
    def __init__(self):
        self.start_time = None
        self.elapsed_time = None
    
    def __enter__(self):
        self.start_time = time.time()
        return self
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        self.elapsed_time = time.time() - self.start_time

# ============================================================================
# DATASET LOADER MODULE
# ============================================================================

from datasets import load_dataset, Audio
import soundfile as sf
import tempfile

@dataclass
class AudioSample:
    """Data structure for audio samples."""
    audio_path: str
    transcription: str
    duration: float = 0.0
    sample_rate: int = 16000
    dataset: str = ''
    split: str = ''
    speaker_id: Optional[str] = None
    metadata: Optional[Dict] = None

def load_huggingface_dataset(dataset_name: str, max_samples: int = None) -> Dict[str, List[AudioSample]]:
    """
    Load dataset from HuggingFace Hub.
    
    Supported datasets:
    - ViMD: nguyendv02/ViMD_Dataset
    - BUD500: linhtran92/viet_bud500
    - LSVSC: doof-ferb/LSVSC
    - VLSP2020: doof-ferb/vlsp2020_vinai_100h
    - VietMed: leduckhai/VietMed
    """
    # Dataset configurations
    configs = {
        'ViMD': {'id': 'nguyendv02/ViMD_Dataset', 'splits': ['train', 'test', 'valid'], 'audio_col': 'audio', 'text_col': 'text'},
        'BUD500': {'id': 'linhtran92/viet_bud500', 'splits': ['train', 'validation', 'test'], 'audio_col': 'audio', 'text_col': 'transcription'},
        'LSVSC': {'id': 'doof-ferb/LSVSC', 'splits': ['train', 'validation', 'test'], 'audio_col': 'audio', 'text_col': 'transcription'},
        'VLSP2020': {'id': 'doof-ferb/vlsp2020_vinai_100h', 'splits': ['train'], 'audio_col': 'audio', 'text_col': 'transcription'},
        'VietMed': {'id': 'leduckhai/VietMed', 'splits': ['train', 'test', 'dev'], 'audio_col': 'audio', 'text_col': 'text'}
    }
    
    if dataset_name not in configs:
        raise ValueError(f"Unknown dataset: {dataset_name}. Available: {list(configs.keys())}")
    
    config = configs[dataset_name]
    print(f"[INFO] Loading {dataset_name} from HuggingFace Hub...")
    
    samples_by_split = {'train': [], 'val': [], 'test': []}
    temp_dir = Path(tempfile.gettempdir()) / 'asr_audio' / dataset_name
    temp_dir.mkdir(parents=True, exist_ok=True)
    
    for split in config['splits']:
        try:
            # Load split
            print(f"  Loading {split} split...")
            dataset = load_dataset(config['id'], split=split, trust_remote_code=True)
            
            # Cast audio to 16kHz
            if config['audio_col'] in dataset.column_names:
                dataset = dataset.cast_column(config['audio_col'], Audio(sampling_rate=16000))
            
            # Limit samples if specified
            if max_samples and len(dataset) > max_samples:
                dataset = dataset.select(range(max_samples))
            
            # Convert to AudioSample objects
            samples = []
            for idx, item in enumerate(tqdm(dataset, desc=f"{split}", leave=False)):
                try:
                    audio_data = item[config['audio_col']]
                    
                    # Save audio to temporary file
                    audio_path = str(temp_dir / f"{split}_{idx}.wav")
                    sf.write(audio_path, audio_data['array'], audio_data['sampling_rate'])
                    
                    # Get duration
                    duration = len(audio_data['array']) / audio_data['sampling_rate']
                    
                    # Get transcription
                    transcription = str(item[config['text_col']]).strip().lower()
                    
                    sample = AudioSample(
                        audio_path=audio_path,
                        transcription=transcription,
                        duration=duration,
                        sample_rate=audio_data['sampling_rate'],
                        dataset=dataset_name,
                        split=split
                    )
                    samples.append(sample)
                except Exception as e:
                    print(f"    [WARNING] Failed to process sample {idx}: {e}")
                    continue
            
            # Map to standard split names
            if split in ['train', 'training']:
                samples_by_split['train'].extend(samples)
            elif split in ['val', 'validation', 'dev', 'valid']:
                samples_by_split['val'].extend(samples)
            elif split in ['test', 'testing']:
                samples_by_split['test'].extend(samples)
            
            print(f"  [OK] Loaded {len(samples)} samples from {split}")
        except Exception as e:
            print(f"  [WARNING] Failed to load {split}: {e}")
    
    return samples_by_split

# ============================================================================
# MODEL EVALUATOR MODULE
# ============================================================================

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

class PhoWhisperModel:
    """PhoWhisper model wrapper."""
    
    def __init__(self, model_id: str):
        self.model_id = model_id
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.processor = None
        self.model = None
    
    def load_model(self):
        """Load model from HuggingFace."""
        print(f"[INFO] Loading {self.model_id}...")
        try:
            self.processor = WhisperProcessor.from_pretrained(self.model_id)
            self.model = WhisperForConditionalGeneration.from_pretrained(self.model_id)
            self.model.to(self.device)
            self.model.eval()
            print(f"[OK] Model loaded on {self.device}")
        except Exception as e:
            print(f"[ERROR] Failed to load model: {e}")
            raise
    
    def transcribe(self, audio_path: str) -> str:
        """Transcribe audio file."""
        try:
            # Load audio
            audio, sr = librosa.load(audio_path, sr=16000)
            
            # Process audio
            input_features = self.processor(
                audio,
                sampling_rate=16000,
                return_tensors="pt"
            ).input_features.to(self.device)
            
            # Generate transcription
            with torch.no_grad():
                predicted_ids = self.model.generate(input_features)
            
            transcription = self.processor.batch_decode(
                predicted_ids,
                skip_special_tokens=True
            )[0]
            
            return transcription.strip().lower()
        except Exception as e:
            print(f"[ERROR] Transcription failed: {e}")
            return ""

print('[OK] All helper functions loaded successfully!')

  from .autonotebook import tqdm as notebook_tqdm


[OK] All helper functions loaded successfully!


## Step 3: Configuration

In [3]:
from datetime import datetime

# Model configuration - Select models to evaluate
MODELS_TO_TEST = [
    'vinai/PhoWhisper-tiny',
    'vinai/PhoWhisper-base',
    'vinai/PhoWhisper-small',
    # 'vinai/PhoWhisper-medium',  # Uncomment if you have enough GPU memory (16GB+)
    # 'vinai/PhoWhisper-large',   # Uncomment if you have enough GPU memory (24GB+)
]

# Dataset configuration - Select datasets to evaluate
DATASETS_TO_TEST = [
    'ViMD',
    # 'BUD500',     # Uncomment to include (slower)
    # 'LSVSC',      # Uncomment to include (slower)
    # 'VLSP2020',   # Uncomment to include (slower)
    # 'VietMed',    # Uncomment to include (slower)
]

# Sampling configuration
MAX_SAMPLES_PER_SPLIT = 50  # Limit samples for faster evaluation. Set to None for full dataset

# Output configuration
TIMESTAMP = datetime.now().strftime("%Y%m%d_%H%M%S")
OUTPUT_CSV = f"/content/phowhisper_results_{TIMESTAMP}.csv"

print(f'[CONFIG] Models to evaluate: {len(MODELS_TO_TEST)}')
print(f'[CONFIG] Datasets to evaluate: {len(DATASETS_TO_TEST)}')
print(f'[CONFIG] Max samples per split: {MAX_SAMPLES_PER_SPLIT or "All"}')
print(f'[CONFIG] Output file: {OUTPUT_CSV}')

[CONFIG] Models to evaluate: 3
[CONFIG] Datasets to evaluate: 1
[CONFIG] Max samples per split: 50
[CONFIG] Output file: /content/phowhisper_results_20251102_213332.csv


## Step 4: Load Datasets

In [None]:
# Load all datasets
datasets_loaded = {}

for dataset_name in DATASETS_TO_TEST:
    print(f"\n{'='*60}")
    print(f"Loading dataset: {dataset_name}")
    print(f"{'='*60}")
    
    try:
        splits = load_huggingface_dataset(dataset_name, max_samples=MAX_SAMPLES_PER_SPLIT)
        datasets_loaded[dataset_name] = splits
        
        print(f"\n[OK] {dataset_name} loaded:")
        print(f"  - Train: {len(splits['train'])} samples")
        print(f"  - Val: {len(splits['val'])} samples")
        print(f"  - Test: {len(splits['test'])} samples")
    except Exception as e:
        print(f"[ERROR] Failed to load {dataset_name}: {e}")
        datasets_loaded[dataset_name] = {'train': [], 'val': [], 'test': []}

print(f"\n[OK] Successfully loaded {len(datasets_loaded)} datasets")

`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'nguyendv02/ViMD_Dataset' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.



Loading dataset: ViMD
[INFO] Loading ViMD from HuggingFace Hub...
  Loading train split...


Repo card metadata block was not found. Setting CardData to empty.
Downloading data: 100%|██████████| 103/103 [00:00<00:00, 119.83files/s]
Generating train split: 2190 examples [00:54, 39.96 examples/s]
`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'nguyendv02/ViMD_Dataset' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.


  Loading test split...


Repo card metadata block was not found. Setting CardData to empty.
Downloading data: 100%|██████████| 103/103 [00:00<00:00, 242.08files/s]
Generating train split: 0 examples [00:00, ? examples/s]

## Step 5: Run Evaluation

This will evaluate all models on all datasets. **This may take 30-60 minutes depending on GPU and dataset size.**

In [None]:
# Main evaluation loop
results = []
metrics_calculator = ASRMetrics()

total_start_time = time.time()

for model_id in MODELS_TO_TEST:
    print(f"\n\n{'='*70}")
    print(f"EVALUATING MODEL: {model_id}")
    print(f"{'='*70}\n")
    
    # Load model
    try:
        model = PhoWhisperModel(model_id)
        model.load_model()
    except Exception as e:
        print(f"[ERROR] Skipping {model_id}: {e}")
        continue
    
    # Evaluate on each dataset
    for dataset_name, splits in datasets_loaded.items():
        # Use test split for evaluation
        test_samples = splits['test']
        
        if not test_samples:
            print(f"[WARNING] No test samples for {dataset_name}, skipping...")
            continue
        
        print(f"\n[INFO] Evaluating on {dataset_name} ({len(test_samples)} samples)...")
        
        # Transcribe all samples
        references = []
        hypotheses = []
        audio_durations = []
        processing_times = []
        
        for sample in tqdm(test_samples, desc=f"{dataset_name}", leave=False):
            try:
                # Transcribe with timing
                with RTFTimer() as timer:
                    hypothesis = model.transcribe(sample.audio_path)
                
                references.append(sample.transcription)
                hypotheses.append(hypothesis)
                audio_durations.append(sample.duration)
                processing_times.append(timer.elapsed_time)
            except Exception as e:
                print(f"  [WARNING] Failed to process sample: {e}")
                continue
        
        # Calculate metrics
        if references and hypotheses:
            metrics = metrics_calculator.calculate_all_metrics(references, hypotheses)
            
            # Calculate RTF
            total_audio_duration = sum(audio_durations)
            total_processing_time = sum(processing_times)
            rtf = total_processing_time / total_audio_duration if total_audio_duration > 0 else 0
            
            # Store results
            result = {
                'model': model_id,
                'dataset': dataset_name,
                'samples_processed': len(references),
                'WER': metrics['wer'],
                'CER': metrics['cer'],
                'MER': metrics['mer'],
                'WIL': metrics['wil'],
                'WIP': metrics['wip'],
                'SER': metrics['ser'],
                'RTF': rtf,
                'insertions': metrics['insertions'],
                'deletions': metrics['deletions'],
                'substitutions': metrics['substitutions'],
                'total_audio_duration': total_audio_duration,
                'total_processing_time': total_processing_time
            }
            results.append(result)
            
            print(f"  [OK] WER: {metrics['wer']:.4f} | CER: {metrics['cer']:.4f} | RTF: {rtf:.4f}")
        else:
            print(f"  [WARNING] No valid results for {dataset_name}")
    
    # Free memory
    del model
    torch.cuda.empty_cache()

total_evaluation_time = time.time() - total_start_time

print(f"\n\n{'='*70}")
print(f"EVALUATION COMPLETE!")
print(f"{'='*70}")
print(f"Total time: {total_evaluation_time:.2f}s ({total_evaluation_time/60:.2f} minutes)")
print(f"Total results: {len(results)}")

## Step 6: Display Results

In [None]:
# Create results DataFrame
results_df = pd.DataFrame(results)

# Display all results
print("\n[INFO] Complete Evaluation Results:")
print("="*100)
display_cols = ['model', 'dataset', 'WER', 'CER', 'MER', 'RTF', 'samples_processed']
print(results_df[display_cols].to_string(index=False))

# Summary statistics
print("\n\n[CHART] Average Performance by Model:")
print("="*80)
model_avg = results_df.groupby('model')[['WER', 'CER', 'MER', 'RTF']].mean()
print(model_avg.to_string())

# Find best model
best_idx = results_df['WER'].idxmin()
best = results_df.loc[best_idx]
print("\n\n[TARGET] Best Model (Lowest WER):")
print("="*60)
print(f"Model: {best['model']}")
print(f"Dataset: {best['dataset']}")
print(f"WER: {best['WER']:.4f}")
print(f"CER: {best['CER']:.4f}")
print(f"RTF: {best['RTF']:.4f}")

## Step 7: Visualizations

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

# Plot 1: WER Comparison
plt.figure(figsize=(14, 6))
pivot_wer = results_df.pivot(index='dataset', columns='model', values='WER')
pivot_wer.plot(kind='bar', ax=plt.gca())
plt.title('Word Error Rate (WER) Comparison - PhoWhisper Models', fontsize=14, fontweight='bold')
plt.xlabel('Dataset', fontsize=12)
plt.ylabel('WER (Lower is Better)', fontsize=12)
plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Plot 2: CER Comparison
plt.figure(figsize=(14, 6))
pivot_cer = results_df.pivot(index='dataset', columns='model', values='CER')
pivot_cer.plot(kind='bar', ax=plt.gca())
plt.title('Character Error Rate (CER) Comparison - PhoWhisper Models', fontsize=14, fontweight='bold')
plt.xlabel('Dataset', fontsize=12)
plt.ylabel('CER (Lower is Better)', fontsize=12)
plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Plot 3: RTF Comparison
plt.figure(figsize=(14, 6))
pivot_rtf = results_df.pivot(index='dataset', columns='model', values='RTF')
pivot_rtf.plot(kind='bar', ax=plt.gca())
plt.axhline(y=1.0, color='r', linestyle='--', label='Real-time threshold')
plt.title('Real-Time Factor (RTF) Comparison - PhoWhisper Models', fontsize=14, fontweight='bold')
plt.xlabel('Dataset', fontsize=12)
plt.ylabel('RTF (Lower is Better, <1.0 = Real-time)', fontsize=12)
plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Plot 4: Metrics Heatmap
plt.figure(figsize=(16, 10))
heatmap_data = results_df.set_index(['model', 'dataset'])[['WER', 'CER', 'MER', 'WIL', 'WIP', 'SER', 'RTF']]
sns.heatmap(heatmap_data, annot=True, fmt='.4f', cmap='RdYlGn_r', cbar_kws={'label': 'Metric Value'})
plt.title('All Metrics Heatmap - PhoWhisper Models', fontsize=14, fontweight='bold')
plt.xlabel('Metric', fontsize=12)
plt.ylabel('Model + Dataset', fontsize=12)
plt.tight_layout()
plt.show()

## Step 8: Export Results to CSV

**Download this CSV file** to use in the cross-model comparison notebook (05).

In [None]:
# Save to CSV
results_df.to_csv(OUTPUT_CSV, index=False)
print(f"[OK] Results saved to: {OUTPUT_CSV}")
print(f"\n[INFO] Download this file to use in notebook 05 (cross-model comparison)")
print(f"[INFO] File location: {OUTPUT_CSV}")

# Display download link (for Colab)
try:
    from google.colab import files
    print("\n[ACTION] Click below to download the results:")
    files.download(OUTPUT_CSV)
except ImportError:
    print("[INFO] Not running in Colab - file saved locally")

## Summary

### What This Notebook Did:
1. Installed all required packages
2. Loaded datasets from HuggingFace Hub
3. Downloaded and evaluated PhoWhisper models
4. Calculated comprehensive metrics (WER, CER, MER, WIL, WIP, SER, RTF)
5. Generated visualization plots
6. Exported results to CSV

### Next Steps:
1. Download the CSV file from `/content/phowhisper_results_*.csv`
2. Run notebooks 02 (Whisper), 03 (Wav2Vec2), 04 (Wav2Vn) similarly
3. Upload all CSV files to notebook 05 for cross-model comparison

### Key Findings:
- Best model: (See results above)
- Average WER: (See summary statistics)
- Real-time capable: (Models with RTF < 1.0)

---

**Vietnamese ASR Evaluation Framework v1.0 - Standalone Colab Edition**