# Vietnamese ASR Evaluation - HuggingFace Integration

This notebook demonstrates how to use HuggingFace datasets with the Vietnamese ASR evaluation pipeline.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/quangnt03/vietnamese-asr-benchmark/blob/main/huggingface_integration_example.ipynb)

**Note**: This notebook is compatible with Google Colab. The setup cells below will automatically install dependencies and clone the repository.

## [CONFIG] Google Colab Setup

Run these cells if you're using Google Colab. They will:
1. Detect if running on Colab
2. Clone the repository
3. Install all required dependencies
4. Set up the environment

In [1]:
# Check if running on Google Colab
try:
    import google.colab
    IN_COLAB = True
    print("[OK] Running on Google Colab")
except:
    IN_COLAB = False
    print("[OK] Running locally")

[OK] Running locally


In [2]:
# Clone repository and install dependencies (Colab only)
if IN_COLAB:
    print("Setting up environment for Google Colab...\n")
    
    # Clone the repository
    print(" Cloning repository...")
    !git clone https://github.com/quangnt03/vietnamese-asr-benchmark.git
    
    # Change to repository directory
    import os
    os.chdir('vietnamese-asr-benchmark')
    print(f"[OK] Changed directory to: {os.getcwd()}")
    
    # Install dependencies
    print("\n[PACKAGE] Installing dependencies...")
    !pip install -q -r requirements.txt
    
    print("\n[OK] Setup complete! You can now run the notebook.")
else:
    print("Skipping Colab setup (running locally)")

Skipping Colab setup (running locally)


In [3]:
# Verify installation
if IN_COLAB:
    import sys
    from pathlib import Path
    
    print("Verifying installation...")
    print(f"Python version: {sys.version.split()[0]}")
    print(f"Working directory: {Path.cwd()}")
    
    # Check if key files exist in new structure
    key_files = ['src/metrics.py', 'src/dataset_loader.py', 
                 'src/model_evaluator.py', 'src/visualization.py',
                 'scripts/fetch_datasets.py', "configs/dataset_profile.json"]
    for file in key_files:
        if Path(file).exists():
            print(f"  [OK] {file}")
        else:
            print(f"  [FAILED] {file} - NOT FOUND")
    
    print("\n[OK] Verification complete")

## 1. Setup and Imports

In [4]:
import sys
import pandas as pd
import numpy as np
from pathlib import Path

# Add src and scripts directories to path for local execution
sys.path.insert(0, str(Path.cwd().parent))

# Import custom modules
from src.metrics import ASRMetrics, format_metrics_report
from src.dataset_loader import DatasetManager, HuggingFaceDatasetLoader
from src.model_evaluator import ModelEvaluator, ModelFactory
from src.visualization import ASRVisualizer
from scripts.fetch_datasets import HuggingFaceDatasetFetcher
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


## 2. List Available HuggingFace Datasets

In [None]:
import os
# Add src and scripts directories to path for local execution
os.chdir(Path.cwd().parent)
os.getcwd()

'/media/pc/shared'

In [11]:
# Add src and scripts directories to path for local execution
sys.path.insert(0, str(Path.cwd().parent))

fetcher = HuggingFaceDatasetFetcher(cache_dir="./data/huggingface_cache", config_file="configs/dataset_profile.json")

# List available datasets
fetcher.list_available_datasets()

FileNotFoundError: Dataset configuration file not found: configs/dataset_profile.json

## 3. Fetch Datasets from HuggingFace

### Option A: Fetch a single dataset (quick test)

In [None]:
# Fetch Common Voice Vietnamese (limited to 10 samples for quick test)
datasets = fetcher.fetch_dataset(
    dataset_key='common_voice_vi',
    splits=['test'],  # Only test split
    max_samples=10,
    save_to_disk=True
)

print(f"\nFetched {sum(len(ds) for ds in datasets.values())} samples")

### Option B: Fetch multiple datasets

In [None]:
# Fetch multiple datasets (10 samples each for testing)
dataset_results = fetcher.fetch_multiple_datasets(
    dataset_keys=['common_voice_vi', 'vivos'],
    max_samples=10
)

## 4. Load HuggingFace Datasets with DatasetManager

In [None]:
# Initialize dataset manager
manager = DatasetManager(base_data_dir="./data")

# Load HuggingFace datasets
datasets = manager.load_all_datasets(
    use_huggingface=True,
    hf_datasets=['common_voice_vi', 'vivos']
)

# Get statistics
stats_df = manager.get_dataset_statistics()
print("\nDataset Statistics:")
display(stats_df)

## 5. Explore a HuggingFace Dataset Sample

In [None]:
# Get a sample from Common Voice
if 'common_voice_vi' in datasets and len(datasets['common_voice_vi']) > 0:
    sample = datasets['common_voice_vi'][0]
    
    print("Sample Information:")
    print(f"  Audio Path: {sample.audio_path}")
    print(f"  Transcription: {sample.transcription}")
    print(f"  Duration: {sample.duration:.2f}s")
    print(f"  Sample Rate: {sample.sample_rate}Hz")
    print(f"  Dataset: {sample.dataset}")
    print(f"  Split: {sample.split}")
    
    # Check if audio array is available
    if sample.metadata and 'audio_array' in sample.metadata:
        audio_array = sample.metadata['audio_array']
        print(f"  Audio array shape: {audio_array.shape if hasattr(audio_array, 'shape') else len(audio_array)}")
else:
    print("No Common Voice dataset loaded. Please fetch it first.")

## 6. Prepare Train/Val/Test Splits

In [None]:
# Prepare splits
splits = manager.prepare_train_test_splits(
    train_ratio=0.7,
    val_ratio=0.15,
    test_ratio=0.15
)

# Show split sizes
for dataset_name, split_data in splits.items():
    print(f"\n{dataset_name}:")
    for split_name, samples in split_data.items():
        print(f"  {split_name}: {len(samples)} samples")

## 7. Load ASR Models

In [None]:
# List available models
print("Available models:")
for model_key in ModelFactory.get_available_models():
    config = ModelFactory.MODEL_CONFIGS[model_key]
    print(f"  {model_key}: {config.name}")

In [None]:
# Load specific models for evaluation
evaluator = ModelEvaluator(
    models_to_evaluate=['phowhisper-small']  # Start with one model
)

evaluator.load_models()
models = evaluator.get_loaded_models()

## 8. Evaluate a Model on HuggingFace Dataset

In [None]:
from tqdm.notebook import tqdm

def evaluate_on_hf_dataset(model, samples, max_samples=5):
    """
    Evaluate a model on HuggingFace dataset samples.
    """
    references = []
    hypotheses = []
    
    for sample in tqdm(samples[:max_samples], desc="Transcribing"):
        # The model can handle AudioSample objects with audio arrays
        hypothesis = model.transcribe(sample)
        references.append(sample.transcription)
        hypotheses.append(hypothesis)
        
        # Show example
        if len(references) == 1:
            print(f"\nExample:")
            print(f"  Reference:  {sample.transcription}")
            print(f"  Hypothesis: {hypothesis}")
    
    # Calculate metrics
    calculator = ASRMetrics()
    metrics = calculator.calculate_batch_metrics(references, hypotheses)
    
    return metrics, references, hypotheses

# Run evaluation
if models and splits and 'common_voice_vi' in splits:
    model_name = list(models.keys())[0]
    test_samples = splits['common_voice_vi']['test']
    
    print(f"\nEvaluating {model_name} on Common Voice Vietnamese (test set)...\n")
    metrics, refs, hyps = evaluate_on_hf_dataset(
        models[model_name],
        test_samples,
        max_samples=5
    )
    
    print(format_metrics_report(metrics, f"{model_name} on Common Voice VI"))
else:
    print("Models or datasets not loaded. Please run previous cells.")

## 9. Compare Multiple HuggingFace Datasets

In [None]:
# Evaluate across all loaded HuggingFace datasets
if models:
    model_name = list(models.keys())[0]
    model = models[model_name]
    
    results = []
    
    for dataset_name, split_data in splits.items():
        test_samples = split_data['test'][:5]  # Limit to 5 samples
        
        print(f"\nEvaluating on {dataset_name}...")
        metrics, _, _ = evaluate_on_hf_dataset(model, test_samples, max_samples=5)
        
        results.append({
            'Dataset': dataset_name,
            'Model': model_name,
            'WER': metrics['wer'],
            'CER': metrics['cer'],
            'MER': metrics['mer'],
            'SER': metrics['ser']
        })
    
    # Create comparison DataFrame
    results_df = pd.DataFrame(results)
    print("\n" + "="*70)
    print("Comparison Across HuggingFace Datasets")
    print("="*70)
    display(results_df)

## 10. Visualize Results

In [None]:
# Create visualizations
if 'results_df' in locals() and len(results_df) > 0:
    visualizer = ASRVisualizer(output_dir="./hf_evaluation_plots")
    
    # Create metric comparison plot
    visualizer.plot_metric_comparison(results_df, metric='wer')
    visualizer.plot_metric_comparison(results_df, metric='cer')
    
    print("\nPlots saved to: ./hf_evaluation_plots/")

## 11. Using HuggingFace Datasets Directly (Advanced)

In [None]:
# Direct loading from HuggingFace (without our wrapper)
from datasets import load_dataset, Audio

# Load Common Voice Vietnamese directly
print("Loading Common Voice Vietnamese directly from HuggingFace...")
raw_dataset = load_dataset(
    "mozilla-foundation/common_voice_13_0",
    "vi",
    split="test",
    trust_remote_code=True
)

# Take a small sample
raw_dataset = raw_dataset.select(range(min(5, len(raw_dataset))))

# Cast audio to 16kHz
raw_dataset = raw_dataset.cast_column("audio", Audio(sampling_rate=16000))

print(f"\nLoaded {len(raw_dataset)} samples")
print(f"\nFeatures: {raw_dataset.features}")

# Show first sample
if len(raw_dataset) > 0:
    sample = raw_dataset[0]
    print(f"\nFirst sample:")
    print(f"  Sentence: {sample['sentence']}")
    print(f"  Audio array shape: {len(sample['audio']['array'])}")
    print(f"  Sample rate: {sample['audio']['sampling_rate']}Hz")

## 12. Export Results

In [None]:
# Export evaluation results
if 'results_df' in locals():
    # CSV
    results_df.to_csv('./hf_evaluation_results.csv', index=False)
    print("[OK] Results saved to: ./hf_evaluation_results.csv")
    
    # JSON
    results_df.to_json('./hf_evaluation_results.json', orient='records', indent=2)
    print("[OK] Results saved to: ./hf_evaluation_results.json")

## Summary

This notebook demonstrated:

1. [OK] Listing available Vietnamese datasets on HuggingFace
2. [OK] Fetching datasets from HuggingFace Hub
3. [OK] Loading HuggingFace datasets with our pipeline
4. [OK] Exploring dataset samples
5. [OK] Preparing train/val/test splits
6. [OK] Evaluating ASR models on HuggingFace datasets
7. [OK] Comparing performance across datasets
8. [OK] Visualizing results
9. [OK] Direct HuggingFace datasets usage
10. [OK] Exporting results

### Next Steps

**For CLI evaluation with HuggingFace datasets:**
```bash
# Fetch datasets first
python fetch_datasets.py --fetch common_voice_vi vivos --max-samples 100

# Run evaluation
python main_evaluation.py \
    --use-huggingface \
    --hf-datasets common_voice_vi vivos \
    --models phowhisper-small whisper-small
```

**For full-scale evaluation:**
```bash
# Fetch all datasets (no sample limit)
python fetch_datasets.py --fetch-all

# Run complete evaluation
python main_evaluation.py --use-huggingface
```