# Wav2Vn Model Evaluation

This notebook evaluates the Wav2Vn model on 5 Vietnamese datasets:
- ViMD (Vietnamese Medical Dataset)
- BUD500 (VLSP Dataset)
- LSVSC (Large-Scale Vietnamese Speech Corpus)
- VLSP 2020
- VietMed

## Wav2Vn Model:
- `wav2vn` - Vietnamese ASR model

## [WARNING] Important Note

**Wav2Vn is not publicly available on HuggingFace Hub.**  
This notebook currently uses mock transcription for demonstration purposes.

If you have access to the Wav2Vn model:
1. Update `src/model_evaluator.py` with the correct model ID
2. Implement the actual loading logic in `Wav2VnModel.load_model()`
3. Re-run this notebook for real evaluation

**Compatible with**: Local & Google Colab  
**Report output**: `/docs/reports/`

## 1. Environment Setup & Dependencies

In [None]:
# Cell 1: Environment detection and setup
%load_ext autoreload
%autoreload 3
import sys
from pathlib import Path

# Import notebook utilities
try:
    from src.notebook_utils import (
        detect_environment,
        setup_paths,
        install_dependencies,
        print_environment_info,
        ReportGenerator,
        create_evaluation_summary
    )
except ImportError:
    # If not in notebooks directory, add parent to path
    notebook_dir = Path.cwd()
    if notebook_dir.name != 'notebooks':
        sys.path.insert(0, str(notebook_dir.parent))
    from src.notebook_utils import (
        detect_environment,
        setup_paths,
        install_dependencies,
        print_environment_info,
        ReportGenerator,
        create_evaluation_summary
    )

# Detect environment
ENV = detect_environment()
print(f"[INFO] Running in: {ENV}")

# Install dependencies if needed (mainly for Colab)
install_dependencies(ENV)

# Setup paths
PATHS = setup_paths()
print(f"\n[OK] Project root: {PATHS['project_root']}")
print(f"[OK] Data directory: {PATHS['data_dir']}")
print(f"[OK] Config file: {PATHS['config_file']}")
print(f"[OK] Reports directory: {PATHS['reports_dir']}")

In [None]:
# Cell 2: Print environment info
print_environment_info()

In [None]:
# Cell 3: Import project modules
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
from datetime import datetime
import time
from tqdm.auto import tqdm

# Import project modules with proper src. prefix for IDE type hints
from src.dataset_loader import DatasetManager, AudioSample
from src.model_evaluator import ModelEvaluator, ModelFactory
from src.metrics import ASRMetrics, RTFTimer
from src.visualization import ASRVisualizer

print("[OK] All modules imported successfully")

## 2. Configuration

In [None]:
# Cell 4: Configuration
# Model configuration
MODEL_FAMILY = "Wav2Vn"
MODELS_TO_TEST = [
    "wav2vn",  # Note: This uses mock transcription (not publicly available)
]

# Dataset configuration
DATASETS_TO_TEST = [
    "ViMD",
    "BUD500",
    "LSVSC",
    "VLSP2020",
    "VietMed"
]

# Evaluation configuration
MAX_SAMPLES_PER_DATASET = None  # None = all samples, or set to e.g., 50 for quick testing
TRAIN_RATIO = 0.7
VAL_RATIO = 0.15
TEST_RATIO = 0.15

# Output configuration
TIMESTAMP = datetime.now().strftime("%Y%m%d_%H%M%S")
OUTPUT_DIR = PATHS['output_dir'] / f"wav2vn_{TIMESTAMP}"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print("[WARNING] Wav2Vn is not publicly available - using mock transcription")
print(f"[CONFIG] Model family: {MODEL_FAMILY}")
print(f"[CONFIG] Models to test: {len(MODELS_TO_TEST)}")
print(f"[CONFIG] Datasets to test: {len(DATASETS_TO_TEST)}")
print(f"[CONFIG] Max samples per dataset: {MAX_SAMPLES_PER_DATASET or 'All'}")
print(f"[CONFIG] Output directory: {OUTPUT_DIR}")

## 3. Load Datasets

In [None]:
# Cell 5: Initialize dataset manager
dataset_manager = DatasetManager(config_file=PATHS['config_file'])
print("[OK] Dataset manager initialized")

In [None]:
# Cell 6: Load all datasets
datasets_loaded = {}
dataset_stats = []

for dataset_name in tqdm(DATASETS_TO_TEST, desc="Loading datasets"):
    try:
        # Load dataset
        samples = dataset_manager.load_dataset(
            dataset_name=dataset_name
        )
        # Get test split
        test_samples = samples['test']
        
        # Limit samples if specified
        if MAX_SAMPLES_PER_DATASET:
            test_samples = test_samples[:MAX_SAMPLES_PER_DATASET]
        
        datasets_loaded[dataset_name] = test_samples
        
        # Collect stats
        dataset_stats.append({
            'Dataset': dataset_name,
            'Total Samples': len(samples['train']) + len(samples['val']) + len(samples['test']),
            'Test Samples': len(test_samples),
            'Used Samples': len(test_samples)
        })
        
        print(f"[OK] {dataset_name}: {len(test_samples)} test samples loaded")
    except Exception as e:
        print(f"[WARNING] Failed to load {dataset_name}: {e}")
        datasets_loaded[dataset_name] = []

# Display stats
stats_df = pd.DataFrame(dataset_stats)
print("\n[INFO] Dataset Statistics:")
print(stats_df.to_string(index=False))

## 4. Initialize Models

In [None]:
# Cell 7: Initialize model evaluator
model_evaluator = ModelEvaluator()
metrics_calculator = ASRMetrics()

print("[OK] Model evaluator and metrics calculator initialized")

## 5. Run Evaluation

**Note:** Since Wav2Vn uses mock transcription, the results will be simulated.

In [None]:
# Cell 8: Main evaluation loop
results = []
total_start_time = time.time()
total_samples_processed = 0

# Iterate through each model
for model_name in MODELS_TO_TEST:
    print(f"\n{'='*60}")
    print(f"[INFO] Evaluating model: {model_name}")
    print(f"[WARNING] Using mock transcription (Wav2Vn not publicly available)")
    print(f"{'='*60}")
    
    # Load model
    try:
        model = ModelFactory.create_model(model_name)
        model.load_model()
        print(f"[OK] Model loaded (mock implementation)")
    except Exception as e:
        print(f"[ERROR] Failed to load model {model_name}: {e}")
        continue
    
    # Evaluate on each dataset
    for dataset_name, test_samples in datasets_loaded.items():
        if not test_samples:
            print(f"[WARNING] Skipping {dataset_name} - no samples")
            continue
        
        print(f"\n[INFO] Testing on {dataset_name} ({len(test_samples)} samples)...")
        
        # Prepare for evaluation
        references = []
        hypotheses = []
        audio_durations = []
        processing_times = []
        
        # Process each sample
        for sample in tqdm(test_samples, desc=f"{dataset_name}", leave=False):
            try:
                # Transcribe with RTF measurement
                with RTFTimer() as timer:
                    hypothesis = model.transcribe(sample.audio_path)
                
                # Store results
                references.append(sample.transcription)
                hypotheses.append(hypothesis)
                
                # Get audio duration for RTF calculation
                import librosa
                duration = librosa.get_duration(path=sample.audio_path)
                audio_durations.append(duration)
                processing_times.append(timer.elapsed_time)
                
                total_samples_processed += 1
            except Exception as e:
                print(f"[WARNING] Failed to process sample {sample.file_id}: {e}")
                continue
        
        # Calculate metrics
        if references and hypotheses:
            metrics = metrics_calculator.calculate_all_metrics(
                references=references,
                hypotheses=hypotheses
            )
            
            # Calculate RTF
            total_audio_duration = sum(audio_durations)
            total_processing_time = sum(processing_times)
            rtf = total_processing_time / total_audio_duration if total_audio_duration > 0 else 0
            
            # Store results
            result = {
                'model': model_name,
                'dataset': dataset_name,
                'samples_processed': len(references),
                'WER': metrics['wer'],
                'CER': metrics['cer'],
                'MER': metrics['mer'],
                'WIL': metrics['wil'],
                'WIP': metrics['wip'],
                'SER': metrics['ser'],
                'RTF': rtf,
                'insertions': metrics['insertions'],
                'deletions': metrics['deletions'],
                'substitutions': metrics['substitutions'],
                'total_audio_duration': total_audio_duration,
                'total_processing_time': total_processing_time
            }
            results.append(result)
            
            print(f"[OK] WER: {metrics['wer']:.4f} | CER: {metrics['cer']:.4f} | RTF: {rtf:.4f}")
            print(f"[NOTE] Results are from mock transcription")
        else:
            print(f"[WARNING] No valid results for {dataset_name}")

total_evaluation_time = time.time() - total_start_time

print(f"\n\n{'='*60}")
print(f"[OK] Evaluation completed!")
print(f"[INFO] Total time: {total_evaluation_time:.2f}s ({total_evaluation_time/60:.2f} minutes)")
print(f"[INFO] Total samples processed: {total_samples_processed}")
print(f"[WARNING] Results are from mock transcription - not real Wav2Vn performance")
print(f"{'='*60}")

## 6. Results Analysis

**Disclaimer:** These results are from mock transcription and do not represent actual Wav2Vn performance.

In [None]:
# Cell 9: Create results DataFrame
results_df = pd.DataFrame(results)

# Display results
print("[INFO] Complete Results (Mock Transcription):")
print(results_df.to_string(index=False))

# Save to CSV
csv_path = OUTPUT_DIR / f"wav2vn_results_{TIMESTAMP}.csv"
results_df.to_csv(csv_path, index=False)
print(f"\n[OK] Results saved to: {csv_path}")
print(f"[NOTE] These are mock results - update model_evaluator.py for real evaluation")

In [None]:
# Cell 10: Summary statistics
print("\n[CHART] Average Performance by Model:")
model_avg = results_df.groupby('model')[['WER', 'CER', 'MER', 'RTF']].mean()
print(model_avg.to_string())

print("\n[CHART] Average Performance by Dataset:")
dataset_avg = results_df.groupby('dataset')[['WER', 'CER', 'MER', 'RTF']].mean()
print(dataset_avg.to_string())

print("\n[WARNING] Results are simulated - not actual Wav2Vn performance")

In [None]:
# Cell 11: Find best model
best_wer_idx = results_df['WER'].idxmin()
best_model_info = results_df.loc[best_wer_idx]

print("[TARGET] Best Model (Lowest WER) - Mock Results:")
print(f"  Model: {best_model_info['model']}")
print(f"  Dataset: {best_model_info['dataset']}")
print(f"  WER: {best_model_info['WER']:.4f}")
print(f"  CER: {best_model_info['CER']:.4f}")
print(f"  RTF: {best_model_info['RTF']:.4f}")

## 7. Visualizations

**Note:** Visualizations based on mock transcription results.

In [None]:
# Cell 12: Create visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Create plots directory
plots_dir = OUTPUT_DIR / "plots"
plots_dir.mkdir(exist_ok=True)

# Initialize visualizer
visualizer = ASRVisualizer(output_dir=str(plots_dir))

print("[OK] Visualizer initialized")

In [None]:
# Cell 13: WER comparison plot
plt.figure(figsize=(14, 6))
pivot_wer = results_df.pivot(index='dataset', columns='model', values='WER')
pivot_wer.plot(kind='bar', ax=plt.gca())
plt.title('Word Error Rate (WER) - Wav2Vn Model (Mock Results)', fontsize=14, fontweight='bold')
plt.xlabel('Dataset', fontsize=12)
plt.ylabel('WER (Lower is Better)', fontsize=12)
plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig(plots_dir / 'wer_comparison.png', dpi=300, bbox_inches='tight')
plt.show()
print("[OK] WER comparison plot saved (mock results)")

In [None]:
# Cell 14: CER comparison plot
plt.figure(figsize=(14, 6))
pivot_cer = results_df.pivot(index='dataset', columns='model', values='CER')
pivot_cer.plot(kind='bar', ax=plt.gca())
plt.title('Character Error Rate (CER) - Wav2Vn Model (Mock Results)', fontsize=14, fontweight='bold')
plt.xlabel('Dataset', fontsize=12)
plt.ylabel('CER (Lower is Better)', fontsize=12)
plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig(plots_dir / 'cer_comparison.png', dpi=300, bbox_inches='tight')
plt.show()
print("[OK] CER comparison plot saved (mock results)")

## 8. Generate Report

In [None]:
# Cell 15: Generate comprehensive report
report_generator = ReportGenerator(reports_dir=PATHS['reports_dir'])

# Prepare report data
report_data = {
    'models': MODELS_TO_TEST,
    'datasets': DATASETS_TO_TEST,
    'metrics_summary': {i: row.to_dict() for i, row in results_df.iterrows()},
    'best_model': {
        'model_name': best_model_info['model'],
        'dataset': best_model_info['dataset'],
        'WER': best_model_info['WER'],
        'CER': best_model_info['CER'],
        'RTF': best_model_info['RTF']
    },
    'evaluation_time': total_evaluation_time,
    'total_samples': total_samples_processed
}

# Generate Markdown report
report_path = report_generator.generate_model_report(
    model_family=MODEL_FAMILY,
    results=report_data,
    output_filename=f"Báo_cáo_Wav2Vn_{TIMESTAMP}.md"
)

# Save JSON results
json_path = report_generator.save_results_json(
    results=report_data,
    filename=f"wav2vn_results_{TIMESTAMP}.json"
)

print(f"\n[OK] Markdown report: {report_path}")
print(f"[OK] JSON results: {json_path}")
print(f"[WARNING] Report contains mock results only")

In [None]:
# Cell 16: Print evaluation summary
summary = create_evaluation_summary(
    model_family=MODEL_FAMILY,
    models_tested=MODELS_TO_TEST,
    datasets_tested=DATASETS_TO_TEST,
    total_samples=total_samples_processed,
    total_time=total_evaluation_time
)
print(summary)
print("\n[WARNING] Results are from mock transcription")

## 9. How to Enable Real Wav2Vn Evaluation

To use the actual Wav2Vn model (when available):

### Step 1: Update Model Configuration

Edit `src/model_evaluator.py`:

```python
MODEL_CONFIGS = {
    'wav2vn': ModelConfig(
        name='Wav2Vn',
        model_id='your-org/wav2vn-actual-model-id',  # Update this
        model_type='wav2vn'
    ),
}
```

### Step 2: Implement Loading Logic

Edit the `Wav2VnModel` class:

```python
class Wav2VnModel(BaseASRModel):
    def load_model(self):
        # Replace mock implementation with actual loading
        from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
        self.processor = Wav2Vec2Processor.from_pretrained(self.config.model_id)
        self.model = Wav2Vec2ForCTC.from_pretrained(self.config.model_id)
        # ... etc
```

### Step 3: Re-run This Notebook

After updating the code, restart the kernel and re-run all cells.

## 10. Export & Conclusion

In [None]:
# Cell 17: Final summary
print("[OK] Wav2Vn Evaluation Complete!")
print("\n[INFO] Generated outputs:")
print(f"  1. Results CSV: {csv_path}")
print(f"  2. Markdown Report: {report_path}")
print(f"  3. JSON Results: {json_path}")
print(f"  4. Visualizations: {plots_dir}/")
print("\n[NOTE] All files are saved in:")
print(f"  - Results: {OUTPUT_DIR}")
print(f"  - Reports: {PATHS['reports_dir']}")
print("\n" + "="*60)
print("[WARNING] IMPORTANT REMINDER")
print("="*60)
print("These results are from MOCK TRANSCRIPTION only.")
print("Wav2Vn is not publicly available on HuggingFace.")
print("\nTo evaluate the real Wav2Vn model:")
print("1. Obtain access to the Wav2Vn model")
print("2. Update src/model_evaluator.py with correct model ID")
print("3. Implement actual loading logic in Wav2VnModel class")
print("4. Re-run this notebook")
print("="*60)