# Ghana Speech-to-Speech Pipeline

## Unified Multilingual Model for Akan, Ewe, Ga, and Dagbani

---

This comprehensive notebook covers the complete pipeline for building a Speech-to-Speech (S2S) system for Ghanaian languages:

1. **Part 0**: Setup & System Verification
2. **Part 1**: Dataset Download & Organization
3. **Part 2**: Data Processing & Preparation
4. **Part 3**: Model Training (ASR, TTS, Translation)
5. **Part 4**: Unified Pipeline & Inference
6. **Part 5**: Evaluation & Benchmarking
7. **Part 6**: Deployment & Serving

---

### Architecture Overview

```
+------------------+     +------------------+     +------------------+
|    THE EAR       |     |    THE BRAIN     |     |    THE MOUTH     |
|    (ASR)         | --> |    (Translation) | --> |    (TTS)         |
|    Meta MMS      |     |    NLLB-200      |     |    XTTS v2       |
+------------------+     +------------------+     +------------------+
     Speech             Text (Twi)         Text (English)        Speech
```

### Supported Languages
- **Akan (Twi/Fante)** - aka
- **Ewe** - ewe
- **Ga** - gaa
- **Dagbani** - dag
- **English** - eng

---

**Hardware Requirements:**
- GPU: NVIDIA RTX 3090/4090 (24GB VRAM) recommended
- RAM: 32GB+ recommended
- Storage: 250GB+ free space for datasets

**Estimated Time:**
- Dataset Download: 2-6 hours (depending on connection)
- ASR Training: 2-4 hours
- TTS Training: 4-8 hours
- Total: ~10-20 hours

---
# Part 0: Setup & System Verification
---

## 0.1 GPU/CUDA Check

In [None]:
import torch
import sys
import platform

print("=" * 60)
print("SYSTEM INFORMATION")
print("=" * 60)
print(f"Python Version: {sys.version}")
print(f"Platform: {platform.platform()}")
print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    
    # Set default device
    device = "cuda"
else:
    print("WARNING: No GPU detected! Training will be very slow.")
    device = "cpu"

print(f"\nUsing device: {device}")
print("=" * 60)

## 0.2 Storage Verification

In [None]:
import shutil
from pathlib import Path

# Check available disk space
total, used, free = shutil.disk_usage(".")

print("STORAGE INFORMATION")
print("=" * 60)
print(f"Total: {total / 1024**3:.1f} GB")
print(f"Used: {used / 1024**3:.1f} GB")
print(f"Free: {free / 1024**3:.1f} GB")

# Minimum recommended space
MIN_SPACE_GB = 100  # For sample mode
FULL_SPACE_GB = 300  # For full datasets

if free / 1024**3 < MIN_SPACE_GB:
    print(f"\nWARNING: Less than {MIN_SPACE_GB}GB free. Consider using SAMPLE_MODE=True")
elif free / 1024**3 < FULL_SPACE_GB:
    print(f"\nNOTE: Less than {FULL_SPACE_GB}GB free. Full dataset download may not fit.")
else:
    print("\nStorage: OK for full dataset download")
print("=" * 60)

## 0.3 Install Dependencies

In [None]:
# Install all required packages
# Uncomment and run if packages are not installed

# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# !pip install transformers accelerate datasets peft bitsandbytes
# !pip install TTS  # Coqui TTS
# !pip install librosa soundfile scipy
# !pip install pandas openpyxl tqdm requests
# !pip install jiwer evaluate sacrebleu  # For evaluation
# !pip install gradio fastapi uvicorn python-multipart  # For serving

print("Dependencies check...")

required_packages = [
    "torch", "transformers", "datasets", "accelerate",
    "librosa", "soundfile", "pandas", "tqdm", "requests"
]

missing = []
for pkg in required_packages:
    try:
        __import__(pkg)
        print(f"  [OK] {pkg}")
    except ImportError:
        print(f"  [MISSING] {pkg}")
        missing.append(pkg)

if missing:
    print(f"\nPlease install missing packages: pip install {' '.join(missing)}")
else:
    print("\nAll required packages installed!")

## 0.4 Configuration Setup

In [None]:
# Import project configuration
import sys
from pathlib import Path

# Add project root to path
PROJECT_ROOT = Path.cwd()
sys.path.insert(0, str(PROJECT_ROOT))

from config import config, print_config, DATA_DIR, MODEL_DIR, OUTPUT_DIR

# ============================================================================
# CONFIGURATION - MODIFY THESE SETTINGS
# ============================================================================

# Toggle sample mode for quick testing (uses ~20GB instead of ~200GB)
SAMPLE_MODE = True  # Set to False for full training
SAMPLE_SIZE = 1000  # Samples per language in sample mode

# Target languages
LANGUAGES = ["aka", "ewe", "gaa", "dag"]  # Akan, Ewe, Ga, Dagbani

# Update config
config.dataset.sample_mode = SAMPLE_MODE
config.dataset.sample_size = SAMPLE_SIZE
config.dataset.languages = LANGUAGES

# Print configuration
print_config()

print(f"\nSAMPLE_MODE: {SAMPLE_MODE}")
print(f"SAMPLE_SIZE: {SAMPLE_SIZE}")
print(f"LANGUAGES: {LANGUAGES}")

---
# Part 1: Dataset Download & Organization
---

This section downloads the required datasets:
- **UGSpeechData**: ~5,400 hours of Ghanaian speech (for ASR)
- **BibleTTS**: High-quality studio recordings (for TTS)
- **FISD**: Financial inclusion speech dataset (for Ga and domain adaptation)

## 1.1 Initialize Downloader

In [None]:
from utils.data_processing import DatasetDownloader

# Create downloader instance
downloader = DatasetDownloader()

print(f"Download directory: {downloader.output_dir}")
print(f"Download directory exists: {downloader.output_dir.exists()}")

## 1.2 Download UGSpeechData (ASR Training Data)

In [None]:
# Download UGSpeechData - transcribed portions only (much smaller)
# Full dataset is ~336GB, transcribed subset is ~50GB

print("Downloading UGSpeechData (transcribed portions)...")
print("This will download metadata and transcribed audio for:")
print("  - Akan (~18,000 files, ~104 hours)")
print("  - Ewe (~19,000 files, ~106 hours)")
print("  - Dagbani (~similar size)")
print("\n")

# Uncomment to download
# ugspeech_paths = downloader.download_ugspeechdata(
#     languages=["akan", "ewe", "dagbani"],
#     transcribed_only=True  # Only download transcribed subset
# )

# For now, we'll set up the expected paths
ugspeech_paths = {
    "akan": DATA_DIR / "raw" / "ugspeechdata" / "akan",
    "ewe": DATA_DIR / "raw" / "ugspeechdata" / "ewe",
    "dagbani": DATA_DIR / "raw" / "ugspeechdata" / "dagbani"
}

print("Expected UGSpeechData paths:")
for lang, path in ugspeech_paths.items():
    exists = "[EXISTS]" if path.exists() else "[NOT FOUND]"
    print(f"  {lang}: {path} {exists}")

## 1.3 Download BibleTTS (TTS Training Data)

In [None]:
# Download BibleTTS - high-quality studio recordings
# Each language is ~15-20GB

print("Downloading BibleTTS...")
print("High-quality single-speaker recordings for TTS training:")
print("  - Asante Twi (~15GB)")
print("  - Akuapem Twi (~16GB)")
print("  - Ewe (~19GB)")
print("\n")

# Uncomment to download
# bibletis_paths = downloader.download_bibletis(
#     languages=["asante_twi", "akuapem_twi", "ewe"]
# )

# Expected paths
bibletis_paths = {
    "asante_twi": DATA_DIR / "raw" / "bibletis" / "asante_twi",
    "akuapem_twi": DATA_DIR / "raw" / "bibletis" / "akuapem_twi",
    "ewe": DATA_DIR / "raw" / "bibletis" / "ewe"
}

print("Expected BibleTTS paths:")
for lang, path in bibletis_paths.items():
    exists = "[EXISTS]" if path.exists() else "[NOT FOUND]"
    print(f"  {lang}: {path} {exists}")

## 1.4 Download FISD (Ga Language Data)

In [None]:
# Download FISD - Financial Inclusion Speech Dataset
# Important for Ga language which is missing from BibleTTS

print("Downloading FISD (Financial Inclusion Speech Dataset)...")
print("Telephony-quality speech for mobile applications:")
print("  - Ga (~148 hours total)")
print("  - Asante Twi")
print("  - Akuapim Twi")
print("  - Fante")
print("\n")

# Uncomment to download
# fisd_paths = downloader.download_fisd(
#     languages=["ga", "asante_twi"]
# )

# Expected paths
fisd_paths = {
    "ga": DATA_DIR / "raw" / "fisd" / "ga",
    "asante_twi": DATA_DIR / "raw" / "fisd" / "asante_twi"
}

print("Expected FISD paths:")
for lang, path in fisd_paths.items():
    exists = "[EXISTS]" if path.exists() else "[NOT FOUND]"
    print(f"  {lang}: {path} {exists}")

## 1.5 Dataset Overview

In [None]:
import os
from pathlib import Path

def get_folder_size(path):
    """Calculate folder size in GB."""
    total = 0
    if path.exists():
        for f in path.rglob("*"):
            if f.is_file():
                total += f.stat().st_size
    return total / 1024**3

def count_audio_files(path, extensions=[".wav", ".mp3", ".flac"]):
    """Count audio files in directory."""
    count = 0
    if path.exists():
        for ext in extensions:
            count += len(list(path.rglob(f"*{ext}")))
    return count

print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)

datasets = {
    "UGSpeechData": DATA_DIR / "raw" / "ugspeechdata",
    "BibleTTS": DATA_DIR / "raw" / "bibletis",
    "FISD": DATA_DIR / "raw" / "fisd"
}

for name, path in datasets.items():
    print(f"\n{name}:")
    print(f"  Path: {path}")
    print(f"  Exists: {path.exists()}")
    if path.exists():
        size = get_folder_size(path)
        files = count_audio_files(path)
        print(f"  Size: {size:.2f} GB")
        print(f"  Audio Files: {files:,}")

print("\n" + "=" * 60)

---
# Part 2: Data Processing & Preparation
---

This section processes the downloaded datasets into formats suitable for training:
- **ASR**: 16kHz, mono, normalized audio + transcriptions
- **TTS**: 22.05kHz, mono, trimmed audio + metadata.csv

## 2.1 Initialize Processors

In [None]:
from utils.data_processing import (
    AudioProcessor, 
    ASRDatasetFormatter, 
    TTSDatasetFormatter
)

# Initialize processors
audio_processor = AudioProcessor(
    target_sr_asr=16000,  # 16kHz for ASR
    target_sr_tts=22050   # 22.05kHz for TTS
)

asr_formatter = ASRDatasetFormatter(audio_processor)
tts_formatter = TTSDatasetFormatter(audio_processor)

print("Audio Processor initialized")
print(f"  ASR sample rate: {audio_processor.target_sr_asr} Hz")
print(f"  TTS sample rate: {audio_processor.target_sr_tts} Hz")

## 2.2 Process Audio for ASR

In [None]:
# Process UGSpeechData for ASR training
# This resamples to 16kHz and normalizes audio

from config import config

asr_data_dir = config.dataset.asr_data_dir
asr_data_dir.mkdir(parents=True, exist_ok=True)

print("Processing audio for ASR training...")
print(f"Output directory: {asr_data_dir}")
print("\nThis step:")
print("  1. Resamples audio to 16kHz")
print("  2. Converts to mono")
print("  3. Normalizes amplitude")
print("  4. Saves as WAV format")

# Example processing (uncomment when data is downloaded)
# for lang in ["akan", "ewe", "dagbani"]:
#     input_dir = DATA_DIR / "raw" / "ugspeechdata" / lang / "transcribed"
#     output_dir = asr_data_dir / lang / "wavs"
#     
#     if input_dir.exists():
#         print(f"\nProcessing {lang}...")
#         processed = audio_processor.batch_process(
#             input_dir=input_dir,
#             output_dir=output_dir,
#             mode="asr",
#             show_progress=True
#         )
#         print(f"  Processed {len(processed)} files")

## 2.3 Process Audio for TTS

In [None]:
# Process BibleTTS for TTS training
# This creates the XTTS-compatible format

tts_data_dir = config.dataset.tts_data_dir
tts_data_dir.mkdir(parents=True, exist_ok=True)

print("Processing audio for TTS training...")
print(f"Output directory: {tts_data_dir}")
print("\nThis step:")
print("  1. Resamples audio to 22.05kHz")
print("  2. Converts to mono")
print("  3. Trims leading/trailing silence")
print("  4. Creates metadata.csv in XTTS format")
print("\nXTTS metadata format: filename|text|speaker|language")

# Example processing (uncomment when data is downloaded)
# for lang in ["asante_twi", "akuapem_twi", "ewe"]:
#     input_dir = DATA_DIR / "raw" / "bibletis" / lang
#     
#     if input_dir.exists():
#         print(f"\nProcessing {lang} for TTS...")
#         
#         # Parse BibleTTS structure
#         df = tts_formatter.parse_bibletis_structure(input_dir, lang)
#         
#         # Sample if in sample mode
#         if config.dataset.sample_mode and len(df) > config.dataset.sample_size:
#             df = df.sample(n=config.dataset.sample_size, random_state=42)
#         
#         # Create XTTS dataset
#         output_dir = tts_data_dir / lang
#         tts_formatter.create_xtts_dataset(df, output_dir)
#         
#         print(f"  Created TTS dataset with {len(df)} samples")

## 2.4 Create Unified Multi-Language Dataset

In [None]:
# Merge all language TTS datasets into one unified dataset
# This is the "One Model" approach

print("Creating unified multi-language TTS dataset...")
print("\nThis combines all language datasets with distinct speakers:")
print("  - Twi_Speaker (from Asante/Akuapem Twi)")
print("  - Ewe_Speaker (from Ewe)")
print("  - Ga_Speaker (from FISD Ga)")
print("  - Dagbani_Speaker (from UGSpeechData)")

# Example merging (uncomment when individual datasets are ready)
# tts_datasets = [
#     tts_data_dir / "asante_twi",
#     tts_data_dir / "akuapem_twi",
#     tts_data_dir / "ewe",
# ]
# 
# existing_datasets = [d for d in tts_datasets if d.exists()]
# 
# if existing_datasets:
#     unified_dir = tts_data_dir / "unified"
#     tts_formatter.merge_datasets(existing_datasets, unified_dir)
#     print(f"\nUnified dataset created at: {unified_dir}")

## 2.5 Create HuggingFace Datasets

In [None]:
# Create HuggingFace Dataset objects for ASR training
from datasets import Dataset, Audio, DatasetDict
import pandas as pd

print("Creating HuggingFace Dataset objects...")
print("\nThese will be used for training with the Trainer API.")

# Example dataset creation (uncomment when data is processed)
# def create_asr_dataset(lang):
#     """Create HuggingFace dataset for a language."""
#     audio_dir = asr_data_dir / lang / "wavs"
#     metadata_path = DATA_DIR / "raw" / "ugspeechdata" / lang / f"{lang.capitalize()}.xlsx"
#     
#     # Parse metadata
#     df = asr_formatter.parse_ugspeechdata_metadata(
#         metadata_path, audio_dir, lang
#     )
#     
#     # Create dataset
#     dataset = asr_formatter.create_hf_dataset(df)
#     
#     # Split into train/val/test
#     splits = asr_formatter.create_train_val_test_split(dataset)
#     
#     return splits
# 
# # Create datasets for each language
# asr_datasets = {}
# for lang in ["akan", "ewe", "dagbani"]:
#     try:
#         asr_datasets[lang] = create_asr_dataset(lang)
#         print(f"  Created dataset for {lang}")
#     except Exception as e:
#         print(f"  Error creating dataset for {lang}: {e}")

---
# Part 3: Model Training
---

This section covers training/fine-tuning the three components:
- **3A**: ASR - The Ear (Meta MMS)
- **3B**: TTS - The Mouth (XTTS v2)
- **3C**: Translation - The Brain (NLLB-200)

---
## Part 3A: ASR Training (The Ear)

Fine-tune Meta's MMS (Massively Multilingual Speech) model using LoRA adapters for memory efficiency.

### 3A.1 Load MMS Model with Quantization

In [None]:
from transformers import Wav2Vec2ForCTC, AutoProcessor, BitsAndBytesConfig
import torch

# Model ID
ASR_MODEL_ID = "facebook/mms-1b-all"

print(f"Loading ASR model: {ASR_MODEL_ID}")
print("Using 8-bit quantization for memory efficiency...")

# Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

# Load processor
asr_processor = AutoProcessor.from_pretrained(ASR_MODEL_ID)

# Load model with quantization
asr_model = Wav2Vec2ForCTC.from_pretrained(
    ASR_MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto"
)

print(f"\nModel loaded!")
print(f"Model parameters: {asr_model.num_parameters():,}")

### 3A.2 Apply LoRA Adapters

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

# LoRA configuration
# This makes only ~1% of parameters trainable
peft_config = LoraConfig(
    task_type=TaskType.FEATURE_EXTRACTION,
    inference_mode=False,
    r=16,              # Rank of update matrices
    lora_alpha=32,     # Scaling factor
    lora_dropout=0.1,  # Dropout probability
    target_modules=["q_proj", "v_proj"]  # Target attention layers
)

# Apply LoRA
asr_model = get_peft_model(asr_model, peft_config)

# Print trainable parameters
asr_model.print_trainable_parameters()

print("\nLoRA adapters applied!")
print("Only attention layers will be fine-tuned.")

### 3A.3 Set Target Language

In [None]:
# Set the target language for training
# MMS uses ISO 639-3 codes
TARGET_LANG = "aka"  # Akan

print(f"Setting target language: {TARGET_LANG}")

# Set language in processor
asr_processor.tokenizer.set_target_lang(TARGET_LANG)

# Load language adapter
asr_model.load_adapter(TARGET_LANG)

print(f"Language adapter loaded for: {TARGET_LANG}")

### 3A.4 Training Configuration

In [None]:
from transformers import TrainingArguments

# Training arguments optimized for RTX 3090/4090
training_args = TrainingArguments(
    output_dir=str(MODEL_DIR / "asr" / "checkpoints"),
    
    # Batch size settings
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch size = 16
    
    # Learning rate
    learning_rate=1e-4,
    warmup_steps=100,
    
    # Training duration
    num_train_epochs=5,
    max_steps=2000,  # Override epochs if set
    
    # Memory optimization
    fp16=True,
    gradient_checkpointing=True,
    
    # Logging & Saving
    logging_steps=50,
    save_steps=200,
    eval_steps=200,
    save_total_limit=3,
    
    # Evaluation
    eval_strategy="steps",
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    
    # Other
    remove_unused_columns=False,
    push_to_hub=False,
)

print("Training arguments configured:")
print(f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Max steps: {training_args.max_steps}")
print(f"  FP16: {training_args.fp16}")

### 3A.5 Data Collator

In [None]:
import torch
from dataclasses import dataclass
from typing import Dict, List, Union

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator for CTC-based ASR training.
    Pads inputs and labels to the maximum length in the batch.
    """
    processor: AutoProcessor
    padding: Union[bool, str] = True
    max_length: int = 16000 * 30  # 30 seconds max
    
    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # Separate inputs and labels
        input_features = [{"input_values": f["input_values"]} for f in features]
        label_features = [{"input_ids": f["labels"]} for f in features]
        
        # Pad inputs
        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            return_tensors="pt"
        )
        
        # Pad labels
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                return_tensors="pt"
            )
        
        # Replace padding with -100 for CTC loss
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100
        )
        
        batch["labels"] = labels
        
        return batch

# Create data collator
data_collator = DataCollatorCTCWithPadding(processor=asr_processor)
print("Data collator created")

### 3A.6 Evaluation Metrics

In [None]:
import evaluate
import numpy as np

# Load WER metric
wer_metric = evaluate.load("wer")

def compute_metrics(pred):
    """Compute Word Error Rate (WER) for evaluation."""
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)
    
    # Replace -100 with pad token id
    pred.label_ids[pred.label_ids == -100] = asr_processor.tokenizer.pad_token_id
    
    # Decode predictions and references
    pred_str = asr_processor.batch_decode(pred_ids)
    label_str = asr_processor.batch_decode(pred.label_ids, group_tokens=False)
    
    # Compute WER
    wer = wer_metric.compute(predictions=pred_str, references=label_str)
    
    return {"wer": wer}

print("WER metric configured")

### 3A.7 Train ASR Model

In [None]:
from transformers import Trainer

print("=" * 60)
print("ASR TRAINING")
print("=" * 60)
print("\nTo train the ASR model, you need to:")
print("1. Download and process the datasets (Parts 1-2)")
print("2. Create train/eval datasets")
print("3. Uncomment and run the training code below")

# Uncomment to train (requires datasets to be ready)
# trainer = Trainer(
#     model=asr_model,
#     args=training_args,
#     train_dataset=train_dataset,
#     eval_dataset=eval_dataset,
#     tokenizer=asr_processor.feature_extractor,
#     data_collator=data_collator,
#     compute_metrics=compute_metrics,
# )
# 
# # Train
# trainer.train()
# 
# # Save model
# trainer.save_model(str(MODEL_DIR / "asr" / "final"))
# print("\nASR model saved!")

---
## Part 3B: TTS Training (The Mouth)

Fine-tune XTTS v2 to speak Ghanaian languages.

### 3B.1 Install Coqui TTS

In [None]:
# Install Coqui TTS if not already installed
# !pip install TTS

# Install espeak-ng for phonemization
# Linux: sudo apt-get install espeak-ng
# Windows: Download from https://github.com/espeak-ng/espeak-ng/releases

try:
    from TTS.api import TTS
    print("Coqui TTS imported successfully")
except ImportError:
    print("Please install Coqui TTS: pip install TTS")

### 3B.2 Download Base XTTS Model

In [None]:
from TTS.utils.manage import ModelManager
import os

# XTTS v2 model name
TTS_MODEL_NAME = "tts_models/multilingual/multi-dataset/xtts_v2"

print(f"Downloading base XTTS v2 model...")
print("This will download ~2-3GB on first run.")

# Download model
model_manager = ModelManager()

# Check if already downloaded
model_path = os.path.join(
    model_manager.output_prefix,
    TTS_MODEL_NAME.replace("/", "--")
)

if not os.path.exists(model_path):
    print("Downloading model...")
    # model_manager.download_model(TTS_MODEL_NAME)
    print("Download complete!")
else:
    print(f"Model already downloaded at: {model_path}")

### 3B.3 TTS Training Configuration

In [None]:
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.configs.shared_configs import BaseDatasetConfig

# Dataset configuration
DATASET_PATH = str(config.dataset.tts_data_dir / "unified")
OUTPUT_PATH = str(MODEL_DIR / "tts")

# Create configs
print("TTS Training Configuration")
print("=" * 60)
print(f"Dataset path: {DATASET_PATH}")
print(f"Output path: {OUTPUT_PATH}")
print("\nTraining settings:")
print(f"  Batch size: 2 (small for memory)")
print(f"  Epochs: 10")
print(f"  Learning rate: 5e-6")
print(f"  Carrier language: 'en' (English base)")
print("\nWhat gets trained:")
print(f"  GPT layers: Yes (learns language patterns)")
print(f"  HiFi-GAN vocoder: No (already good)")
print(f"  Speaker encoder: No (uses cloning)")

### 3B.4 Train TTS Model

In [None]:
from trainer import Trainer, TrainerArgs
from TTS.tts.models.xtts import Xtts

print("=" * 60)
print("TTS TRAINING")
print("=" * 60)
print("\nTo train the TTS model, you need to:")
print("1. Download and process BibleTTS datasets")
print("2. Create unified TTS dataset with metadata.csv")
print("3. Uncomment and run the training code below")

# Uncomment to train (requires datasets to be ready)
# 
# # Load config
# config = XttsConfig()
# config.load_json(f"{model_path}/config.json")
# 
# # Configure dataset
# config.dataset_config = BaseDatasetConfig(
#     formatter="coqui",
#     dataset_name="ghana_unified",
#     path=DATASET_PATH,
#     meta_file_train="metadata.csv",
#     language="en"
# )
# 
# # Training settings
# config.batch_size = 2
# config.epochs = 10
# config.lr = 5e-6
# config.output_path = OUTPUT_PATH
# 
# # What to train
# config.train_gpt = True
# config.train_hifi_gan = False
# config.train_speaker_encoder = False
# 
# # Initialize model
# model = Xtts.init_from_config(config)
# model.load_checkpoint(config, checkpoint_dir=model_path, eval=True)
# 
# # Train
# trainer = Trainer(
#     TrainerArgs(),
#     config,
#     output_path=OUTPUT_PATH,
#     model=model,
# )
# 
# trainer.fit()
# print("\nTTS training complete!")

---
## Part 3C: Translation Setup (The Brain)

Set up NLLB-200 for translation between languages.

### 3C.1 Load NLLB Model

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# NLLB model - using distilled version for speed
MT_MODEL_ID = "facebook/nllb-200-distilled-600M"
# For better quality, use: "facebook/nllb-200-3.3B"

print(f"Loading translation model: {MT_MODEL_ID}")

mt_tokenizer = AutoTokenizer.from_pretrained(MT_MODEL_ID)
mt_model = AutoModelForSeq2SeqLM.from_pretrained(MT_MODEL_ID).to(device)

print(f"Translation model loaded!")
print(f"Parameters: {mt_model.num_parameters():,}")

### 3C.2 Language Codes Reference

In [None]:
# NLLB language codes for Ghanaian languages
NLLB_CODES = {
    "Akan (Twi)": "aka_Latn",
    "Ewe": "ewe_Latn",
    "Ga": "gaa_Latn",
    "Dagbani": "dag_Latn",
    "English": "eng_Latn",
    "French": "fra_Latn",
    "Hausa": "hau_Latn",
}

print("NLLB Language Codes:")
print("=" * 40)
for lang, code in NLLB_CODES.items():
    print(f"  {lang:15} -> {code}")

### 3C.3 Translation Function

In [None]:
def translate(text, source_lang="aka_Latn", target_lang="eng_Latn"):
    """
    Translate text between languages using NLLB.
    
    Args:
        text: Text to translate
        source_lang: Source language NLLB code
        target_lang: Target language NLLB code
    
    Returns:
        Translated text
    """
    # Set source language
    mt_tokenizer.src_lang = source_lang
    
    # Tokenize
    inputs = mt_tokenizer(text, return_tensors="pt").to(device)
    
    # Get target language token ID
    target_token_id = mt_tokenizer.convert_tokens_to_ids(target_lang)
    
    # Generate
    with torch.no_grad():
        outputs = mt_model.generate(
            **inputs,
            forced_bos_token_id=target_token_id,
            max_length=200,
            num_beams=4
        )
    
    # Decode
    translation = mt_tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    
    return translation

print("Translation function ready!")

### 3C.4 Test Translation

In [None]:
# Test translations
print("Testing Translation")
print("=" * 60)

# English to Twi
english_text = "Good morning, how are you today?"
twi_translation = translate(english_text, "eng_Latn", "aka_Latn")
print(f"\nEnglish: {english_text}")
print(f"Twi:     {twi_translation}")

# Twi to English (if you have Twi text)
# twi_text = "Maakye, wo ho te sen?"
# english_back = translate(twi_text, "aka_Latn", "eng_Latn")
# print(f"\nTwi:     {twi_text}")
# print(f"English: {english_back}")

---
# Part 4: Unified Pipeline & Inference
---

Combine all components into a single Speech-to-Speech pipeline.

## 4.1 Initialize Pipeline

In [None]:
from utils.pipeline import GhanaS2SPipeline

print("Initializing Ghana S2S Pipeline...")
print("This will load all models (LID, ASR, Translation, TTS)")
print("\nNote: This requires ~15-20GB GPU memory")

# Initialize pipeline
# Set paths to fine-tuned models if available
pipeline = GhanaS2SPipeline(
    device="cuda" if torch.cuda.is_available() else "cpu",
    asr_model_path=None,  # Use base MMS (or path to fine-tuned)
    tts_model_path=None,  # Use base XTTS (or path to fine-tuned)
    load_lid=True,        # Enable automatic language detection
    load_asr=True,
    load_tts=True,
    load_translation=True,
    use_8bit=True  # Memory optimization
)

## 4.2 Test Individual Components

In [None]:
# Test ASR (requires audio file)
# result = pipeline.listen("test_audio.wav", language="aka")
# print(f"Transcription: {result.text}")

# Test Translation
result = pipeline.think("Hello, how are you?", source_lang="eng", target_lang="aka")
print(f"Translation: {result.translated_text}")

# Test TTS
result = pipeline.speak(
    text="Good morning, welcome to Ghana!",
    output_path="test_output.wav",
    speaker="Twi_Speaker"
)
print(f"Audio saved to: {result.audio_path}")

## 4.3 Full Pipeline Demo

In [None]:
# Full Speech-to-Speech pipeline
# This requires an input audio file

print("Full S2S Pipeline Demo")
print("=" * 60)
print("\nTo run the full pipeline:")
print("1. Record or upload an audio file")
print("2. Specify source and target languages")
print("3. Optionally provide a speaker reference for voice cloning")

# Example usage (uncomment with actual audio file)
# result = pipeline.run_pipeline(
#     audio_input="input_english.wav",
#     source_lang="eng",
#     target_lang="aka",  # Translate to Twi
#     speaker_ref=None,   # Optional: 6-second reference clip
#     translate=True
# )
# 
# print(f"\nResults:")
# print(f"  Transcription: {result.transcription.text}")
# print(f"  Translation: {result.translation.translated_text}")
# print(f"  Output audio: {result.synthesis.audio_path}")
# print(f"  Total latency: {result.total_latency:.2f}s")

## 4.4 Audio Recording Widget (for Notebooks)

In [None]:
# Audio recording for Jupyter/Colab
# This creates a "Record" button in the notebook

try:
    from IPython.display import Javascript, Audio
    from google.colab import output
    from base64 import b64decode
    
    RECORD_JS = """
    const sleep = time => new Promise(resolve => setTimeout(resolve, time))
    const b2text = blob => new Promise(resolve => {
      const reader = new FileReader()
      reader.onloadend = e => resolve(e.srcElement.result)
      reader.readAsDataURL(blob)
    })
    var record = time => new Promise(async resolve => {
      stream = await navigator.mediaDevices.getUserMedia({ audio: true })
      recorder = new MediaRecorder(stream)
      chunks = []
      recorder.ondataavailable = e => chunks.push(e.data)
      recorder.start()
      await sleep(time)
      recorder.onstop = async ()=>{
        blob = new Blob(chunks)
        text = await b2text(blob)
        resolve(text)
      }
      recorder.stop()
    })
    """
    
    def record_audio(seconds=5):
        """Record audio for specified seconds."""
        display(Javascript(RECORD_JS))
        s = output.eval_js(f'record({seconds * 1000})')
        b = b64decode(s.split(',')[1])
        
        with open('recorded_audio.wav', 'wb') as f:
            f.write(b)
        
        return 'recorded_audio.wav'
    
    print("Audio recording available! Use record_audio(seconds) to record.")
    
except:
    print("Audio recording not available in this environment.")
    print("Please upload audio files manually.")

## 4.5 Automatic Language Detection

In [None]:
# Automatic Language Detection using MMS-LID
# The pipeline can automatically detect which Ghanaian language is being spoken

# Example 1: Detect language from audio file
# audio_path = "test_audio.wav"
# detected_lang = pipeline.detect_language(audio_path)
# print(f"Detected language: {detected_lang}")

# Example 2: Get confidence scores for top languages
# scores = pipeline.detect_language(audio_path, top_k=5, return_all_scores=True)
# for lang, score in scores.items():
#     print(f"  {lang}: {score:.2%}")

# Example 3: Transcribe with auto-detection
# result = pipeline.listen("audio.wav", language="auto")
# print(f"Detected: {result.language}, Text: {result.text}")

# Example 4: Full S2S pipeline with auto-detection
# result = pipeline.run_pipeline(
#     audio_input="input.wav",
#     source_lang="auto",  # Auto-detect input language
#     target_lang="eng"    # Translate to English
# )

print("Language Detection Available!")
print("\nSupported languages for auto-detection:")
for code, name in pipeline.SUPPORTED_LANGUAGES.items():
    print(f"  {code}: {name}")
print("\nUsage: pipeline.detect_language('audio.wav')")
print("   or: pipeline.listen('audio.wav', language='auto')")

---
# Part 5: Evaluation & Benchmarking
---

## 5.1 ASR Evaluation (WER)

In [None]:
import evaluate

# Load metrics
wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

def evaluate_asr(predictions, references):
    """
    Evaluate ASR performance.
    
    Args:
        predictions: List of predicted transcriptions
        references: List of ground truth transcriptions
    
    Returns:
        Dictionary with WER and CER scores
    """
    wer = wer_metric.compute(predictions=predictions, references=references)
    cer = cer_metric.compute(predictions=predictions, references=references)
    
    return {
        "wer": wer * 100,  # Percentage
        "cer": cer * 100
    }

print("ASR Evaluation Metrics:")
print("  - WER (Word Error Rate): Lower is better")
print("  - CER (Character Error Rate): Lower is better")
print("\nTarget: WER < 20% for production quality")

## 5.2 Translation Evaluation (BLEU)

In [None]:
from sacrebleu import corpus_bleu

def evaluate_translation(predictions, references):
    """
    Evaluate translation quality using BLEU score.
    
    Args:
        predictions: List of translated texts
        references: List of reference translations (can be list of lists)
    
    Returns:
        BLEU score
    """
    # Ensure references are in correct format
    if isinstance(references[0], str):
        references = [[ref] for ref in references]
    
    bleu = corpus_bleu(predictions, references)
    
    return {
        "bleu": bleu.score,
        "precisions": bleu.precisions
    }

print("Translation Evaluation Metrics:")
print("  - BLEU Score: Higher is better (0-100)")
print("\nTarget: BLEU > 20 for reasonable quality")

## 5.3 End-to-End Latency Test

In [None]:
import time

def benchmark_pipeline(pipeline, audio_path, num_runs=5):
    """
    Benchmark pipeline latency.
    
    Args:
        pipeline: GhanaS2SPipeline instance
        audio_path: Path to test audio
        num_runs: Number of runs for averaging
    
    Returns:
        Dictionary with latency statistics
    """
    latencies = []
    
    for i in range(num_runs):
        start = time.time()
        
        result = pipeline.run_pipeline(
            audio_input=audio_path,
            source_lang="eng",
            target_lang="aka",
            translate=True
        )
        
        latencies.append(time.time() - start)
    
    return {
        "mean_latency": sum(latencies) / len(latencies),
        "min_latency": min(latencies),
        "max_latency": max(latencies)
    }

print("Latency Benchmarking:")
print("  - Target: < 3 seconds for interactive use")
print("  - Acceptable: < 10 seconds for offline processing")

---
# Part 6: Deployment & Serving
---

## 6A: Gradio Web Interface

In [None]:
from utils.serving import create_gradio_interface, launch_gradio

print("Launching Gradio Interface...")
print("\nFeatures:")
print("  - Microphone recording")
print("  - File upload")
print("  - Language selection")
print("  - Voice cloning with reference audio")

# Launch Gradio (uncomment to run)
# interface = create_gradio_interface(pipeline)
# interface.launch(
#     share=False,  # Set True for public link
#     server_port=7860
# )

## 6B: FastAPI REST Endpoints

In [None]:
from utils.serving import create_fastapi_app

print("FastAPI Endpoints:")
print("=" * 60)
print("\nAvailable endpoints:")
print("  POST /api/transcribe    - Speech to text")
print("  POST /api/translate     - Text translation")
print("  POST /api/synthesize    - Text to speech")
print("  POST /api/speech-to-speech - Full pipeline")
print("  GET  /api/languages     - List supported languages")
print("  GET  /health            - Health check")

# Create app (for running with uvicorn)
# app = create_fastapi_app(pipeline)

# To run: uvicorn utils.serving:create_fastapi_app --host 0.0.0.0 --port 8000

## 6C: Docker Deployment

In [None]:
from utils.serving import generate_docker_files\n\nprint("Generating Docker Configuration...")\n\n# Generate Docker files\n# generate_docker_files()\n\nprint("\nDocker commands:")\nprint("  Build:  docker-compose build")\nprint("  Run:    docker-compose up")\nprint("  Stop:   docker-compose down")\nprint("\nThe API will be available at:")\nprint("  - REST API: http://localhost:8000")\nprint("  - Gradio:   http://localhost:7860")

---\n# Part 7: Upload Models to HuggingFace\n---\n\nShare your trained models with the community by uploading to HuggingFace Hub.

## 7.1 Install HuggingFace Hub

In [None]:
# Install huggingface_hub if not already installed\n# !pip install huggingface_hub\n\nfrom huggingface_hub import HfApi, HfFolder, Repository, create_repo, upload_folder\nfrom huggingface_hub import login as hf_login\nimport os\nfrom pathlib import Path\n\nprint("HuggingFace Hub imported successfully")

## 7.2 Authenticate with HuggingFace

In [None]:
# Login to HuggingFace\n# You need a HuggingFace account and access token\n# Get your token from: https://huggingface.co/settings/tokens\n\n# Option 1: Interactive login (will prompt for token)\n# hf_login()\n\n# Option 2: Login with token directly\n# hf_login(token="your_token_here")\n\n# Option 3: Set environment variable\n# os.environ["HF_TOKEN"] = "your_token_here"\n\nprint("Authentication options:")\nprint("  1. Run: hf_login() - Interactive prompt")\nprint("  2. Run: hf_login(token='hf_xxx') - Direct token")\nprint("  3. Set: os.environ['HF_TOKEN'] = 'hf_xxx'")\nprint("\nGet your token at: https://huggingface.co/settings/tokens")

## 7.3 Configure Upload Settings

In [None]:
# ============================================================================\n# UPLOAD CONFIGURATION - MODIFY THESE\n# ============================================================================\n\n# Your HuggingFace username or organization\nHF_USERNAME = "your-username"  # Change this!\n\n# Repository names for each model\nASR_REPO_NAME = "ghana-asr-mms-akan-ewe-ga-dagbani"\nTTS_REPO_NAME = "ghana-tts-xtts-akan-ewe-ga-dagbani"\n\n# Local model paths\nASR_MODEL_PATH = MODEL_DIR / "asr" / "final"\nTTS_MODEL_PATH = MODEL_DIR / "tts"  # Will look for latest checkpoint\n\n# Model metadata\nMODEL_CARD_LANGUAGES = ["ak", "ee", "gaa", "dag", "en"]  # ISO 639-1 codes\nMODEL_TAGS = [\n    "speech-recognition",\n    "text-to-speech",\n    "akan",\n    "twi",\n    "ewe",\n    "ga",\n    "dagbani",\n    "ghana",\n    "african-languages",\n    "low-resource",\n]\n\nprint(f"ASR Repo: {HF_USERNAME}/{ASR_REPO_NAME}")\nprint(f"TTS Repo: {HF_USERNAME}/{TTS_REPO_NAME}")\nprint(f"\nASR Model Path: {ASR_MODEL_PATH}")\nprint(f"TTS Model Path: {TTS_MODEL_PATH}")

## 7.4 Create Model Cards

In [None]:
def create_asr_model_card(repo_name: str, languages: list) -> str:
    """Create a model card for the ASR model."""
    lang_yaml = '\n'.join([f'- {lang}' for lang in languages])
    model_card = f"""---
language:
{lang_yaml}
license: cc-by-nc-4.0
tags:
- automatic-speech-recognition
- mms
- akan
- twi
- ewe
- ga
- dagbani
- ghana
- african-languages
- pytorch
- lora
datasets:
- UGSpeechData
base_model: facebook/mms-1b-all
pipeline_tag: automatic-speech-recognition
---

# Ghana ASR: Multilingual Speech Recognition for Ghanaian Languages

This model is a fine-tuned version of [facebook/mms-1b-all](https://huggingface.co/facebook/mms-1b-all) for Ghanaian languages.

## Supported Languages

| Language | Code | Hours Trained |
|----------|------|---------------|
| Akan (Twi/Fante) | aka | ~100 |
| Ewe | ewe | ~100 |
| Ga | gaa | ~50 |
| Dagbani | dag | ~100 |

## Usage

```python
from transformers import Wav2Vec2ForCTC, AutoProcessor
import torch
import librosa

model = Wav2Vec2ForCTC.from_pretrained("{HF_USERNAME}/{repo_name}")
processor = AutoProcessor.from_pretrained("{HF_USERNAME}/{repo_name}")

# Set target language
processor.tokenizer.set_target_lang("aka")  # Akan
model.load_adapter("aka")

# Load audio
audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

# Transcribe
with torch.no_grad():
    logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)
```

## Training

- **Base Model**: facebook/mms-1b-all
- **Fine-tuning Method**: LoRA adapters
- **Dataset**: UGSpeechData (University of Ghana)
- **Training Hardware**: NVIDIA RTX 3090/4090

## Citation

If you use this model, please cite:

```bibtex
@misc{{ghana-asr,
  title={{Ghana ASR: Multilingual Speech Recognition for Ghanaian Languages}},
  year={{2024}},
  url={{https://huggingface.co/{HF_USERNAME}/{repo_name}}}
}}
```
"""
    return model_card


def create_tts_model_card(repo_name: str, languages: list) -> str:
    """Create a model card for the TTS model."""
    lang_yaml = '\n'.join([f'- {lang}' for lang in languages])
    model_card = f"""---
language:
{lang_yaml}
license: cc-by-nc-4.0
tags:
- text-to-speech
- xtts
- akan
- twi
- ewe
- ga
- dagbani
- ghana
- african-languages
- voice-cloning
datasets:
- BibleTTS
base_model: coqui/XTTS-v2
pipeline_tag: text-to-speech
---

# Ghana TTS: Multilingual Text-to-Speech for Ghanaian Languages

This model is a fine-tuned version of XTTS v2 for Ghanaian languages.

## Supported Languages/Speakers

| Speaker Name | Language |
|--------------|----------|
| Twi_Speaker | Akan (Twi) |
| Ewe_Speaker | Ewe |
| Ga_Speaker | Ga |
| Dagbani_Speaker | Dagbani |

## Usage

```python
from TTS.api import TTS

# Load model
tts = TTS(model_path="{HF_USERNAME}/{repo_name}")

# Generate speech
tts.tts_to_file(
    text="Maakye! Wo ho te sen?",
    speaker="Twi_Speaker",
    language="en",  # Carrier language
    file_path="output.wav"
)
```

## Training

- **Base Model**: XTTS v2
- **Dataset**: BibleTTS (OpenSLR)
- **Training Hardware**: NVIDIA RTX 3090/4090

## Citation

```bibtex
@misc{{ghana-tts,
  title={{Ghana TTS: Multilingual Text-to-Speech for Ghanaian Languages}},
  year={{2024}},
  url={{https://huggingface.co/{HF_USERNAME}/{repo_name}}}
}}
```
"""
    return model_card


print("Model card templates created!")

## 7.5 Upload ASR Model

In [None]:
def upload_asr_model(
    model_path: Path,
    repo_name: str,
    username: str,
    private: bool = False
):
    """
    Upload ASR model to HuggingFace Hub.
    
    Args:
        model_path: Path to trained model directory
        repo_name: Repository name on HuggingFace
        username: HuggingFace username or organization
        private: Whether to make the repo private
    """
    from huggingface_hub import HfApi, create_repo, upload_folder
    
    repo_id = f"{username}/{repo_name}"
    
    print(f"Uploading ASR model to: {repo_id}")
    print(f"Model path: {model_path}")
    
    # Check if model exists
    if not model_path.exists():
        print(f"ERROR: Model not found at {model_path}")
        print("Please train the model first (Part 3A).")
        return None
    
    # Create repository
    try:
        create_repo(repo_id, private=private, exist_ok=True)
        print(f"Repository created/exists: {repo_id}")
    except Exception as e:
        print(f"Error creating repo: {e}")
        return None
    
    # Create and save model card
    model_card = create_asr_model_card(repo_name, MODEL_CARD_LANGUAGES)
    readme_path = model_path / "README.md"
    with open(readme_path, "w") as f:
        f.write(model_card)
    print("Model card created")
    
    # Upload folder
    api = HfApi()
    api.upload_folder(
        folder_path=str(model_path),
        repo_id=repo_id,
        repo_type="model",
    )
    
    print(f"\nUpload complete!")
    print(f"View model at: https://huggingface.co/{repo_id}")
    return repo_id


# Uncomment to upload
# upload_asr_model(ASR_MODEL_PATH, ASR_REPO_NAME, HF_USERNAME)

print("ASR upload function ready.")
print("Run: upload_asr_model(ASR_MODEL_PATH, ASR_REPO_NAME, HF_USERNAME)")

## 7.6 Upload TTS Model

In [None]:
def upload_tts_model(
    model_path: Path,
    repo_name: str,
    username: str,
    private: bool = False
):
    """
    Upload TTS model to HuggingFace Hub.
    
    Args:
        model_path: Path to trained model directory
        repo_name: Repository name on HuggingFace
        username: HuggingFace username or organization
        private: Whether to make the repo private
    """
    from huggingface_hub import HfApi, create_repo, upload_folder
    import glob
    
    repo_id = f"{username}/{repo_name}"
    
    print(f"Uploading TTS model to: {repo_id}")
    
    # Find the latest checkpoint
    if not model_path.exists():
        print(f"ERROR: Model directory not found at {model_path}")
        print("Please train the model first (Part 3B).")
        return None
    
    # Look for checkpoint directories (XTTS saves with timestamps)
    checkpoints = list(model_path.glob("run-*"))
    if checkpoints:
        latest_checkpoint = max(checkpoints, key=lambda x: x.stat().st_mtime)
        print(f"Found checkpoint: {latest_checkpoint}")
        upload_path = latest_checkpoint
    else:
        upload_path = model_path
    
    # Create repository
    try:
        create_repo(repo_id, private=private, exist_ok=True)
        print(f"Repository created/exists: {repo_id}")
    except Exception as e:
        print(f"Error creating repo: {e}")
        return None
    
    # Create and save model card
    model_card = create_tts_model_card(repo_name, MODEL_CARD_LANGUAGES)
    readme_path = upload_path / "README.md"
    with open(readme_path, "w") as f:
        f.write(model_card)
    print("Model card created")
    
    # Upload folder
    api = HfApi()
    api.upload_folder(
        folder_path=str(upload_path),
        repo_id=repo_id,
        repo_type="model",
    )
    
    print(f"\nUpload complete!")
    print(f"View model at: https://huggingface.co/{repo_id}")
    return repo_id


# Uncomment to upload
# upload_tts_model(TTS_MODEL_PATH, TTS_REPO_NAME, HF_USERNAME)

print("TTS upload function ready.")
print("Run: upload_tts_model(TTS_MODEL_PATH, TTS_REPO_NAME, HF_USERNAME)")

## 7.7 Upload Both Models

In [None]:
def upload_all_models(username: str, private: bool = False):
    """Upload both ASR and TTS models to HuggingFace."""
    print("=" * 60)
    print("UPLOADING ALL MODELS TO HUGGINGFACE")
    print("=" * 60)
    
    results = {}
    
    # Upload ASR
    print("\n[1/2] Uploading ASR Model...")
    asr_result = upload_asr_model(ASR_MODEL_PATH, ASR_REPO_NAME, username, private)
    results['asr'] = asr_result
    
    # Upload TTS
    print("\n[2/2] Uploading TTS Model...")
    tts_result = upload_tts_model(TTS_MODEL_PATH, TTS_REPO_NAME, username, private)
    results['tts'] = tts_result
    
    # Summary
    print("\n" + "=" * 60)
    print("UPLOAD SUMMARY")
    print("=" * 60)
    for model, result in results.items():
        status = "SUCCESS" if result else "FAILED"
        print(f"  {model.upper()}: {status}")
        if result:
            print(f"    URL: https://huggingface.co/{result}")
    
    return results


# Uncomment to upload both models
# upload_all_models(HF_USERNAME, private=False)

print("Ready to upload! Steps:")
print("  1. Set HF_USERNAME to your HuggingFace username")
print("  2. Run: hf_login() to authenticate")
print("  3. Run: upload_all_models(HF_USERNAME)")

## 7.8 Push Model with Git LFS (Alternative Method)

In [None]:
# Alternative: Use Git LFS for large files
# This is useful if you have very large model files

print("""
ALTERNATIVE: Git LFS Upload
===========================

For very large models, you can use Git LFS:

1. Install Git LFS:
   git lfs install

2. Clone your HuggingFace repo:
   git clone https://huggingface.co/YOUR_USERNAME/YOUR_REPO
   cd YOUR_REPO

3. Copy model files:
   cp -r /path/to/model/* .

4. Track large files with LFS:
   git lfs track "*.bin"
   git lfs track "*.safetensors"
   git lfs track "*.pth"

5. Commit and push:
   git add .
   git commit -m "Add model files"
   git push

This method is recommended for files > 5GB.
""")

---\n# Appendix\n---

## A1: Troubleshooting Guide

In [None]:
print("""
TROUBLESHOOTING GUIDE
====================

1. OUT OF MEMORY (OOM) ERROR
   - Reduce batch_size (try 1 or 2)
   - Enable gradient_checkpointing
   - Use 8-bit quantization
   - Clear cache: torch.cuda.empty_cache()

2. SLOW TRAINING
   - Enable fp16 (mixed precision)
   - Increase num_workers for data loading
   - Use gradient accumulation instead of larger batch

3. ROBOTIC TTS OUTPUT
   - Train for more epochs
   - Use cleaner audio data
   - Try different speaker reference

4. POOR ASR ACCURACY
   - Add more training data
   - Fine-tune longer
   - Mix clean and noisy data

5. WRONG PRONUNCIATION
   - Phonetize text input
   - Add more examples in training data
   - Lower learning rate
""")

## A2: Language Code Reference

In [None]:
print("""
LANGUAGE CODE REFERENCE
=======================

Language        MMS (ASR)    NLLB (Translation)    TTS Speaker
-----------    ---------    ------------------    -----------
Akan (Twi)      aka          aka_Latn              Twi_Speaker
Ewe             ewe          ewe_Latn              Ewe_Speaker
Ga              gaa          gaa_Latn              Ga_Speaker
Dagbani         dag          dag_Latn              Dagbani_Speaker
Dagaare         dga          N/A                   N/A
English         eng          eng_Latn              N/A
Hausa           hau          hau_Latn              N/A
""")

## A3: Model Card & Citations

In [None]:
print("""
MODEL CITATIONS
===============

MMS (Massively Multilingual Speech):
  @article{pratap2023mms,
    title={Scaling Speech Technology to 1,000+ Languages},
    author={Pratap, Vineel and others},
    journal={arXiv preprint arXiv:2305.13516},
    year={2023}
  }

NLLB-200:
  @article{costa2022no,
    title={No Language Left Behind},
    author={Costa-juss\`a, Marta R and others},
    journal={arXiv preprint arXiv:2207.04672},
    year={2022}
  }

XTTS:
  @misc{coqui-ai-tts,
    author={Coqui AI},
    title={XTTS: Cross-Lingual Text-to-Speech},
    year={2023},
    url={https://github.com/coqui-ai/TTS}
  }

UGSpeechData:
  @article{ugspeechdata2024,
    title={UGSpeechData: A Multilingual Speech Dataset of Ghanaian Languages},
    author={University of Ghana HCI Lab},
    year={2024}
  }
""")

## A4: Next Steps & Improvements

In [None]:
print("""
FUTURE IMPROVEMENTS
===================

1. ADD MORE LANGUAGES
   - Dagaare, Ikposo from UGSpeechData
   - Nzema, Dangme from other sources

2. IMPROVE TTS QUALITY
   - Collect more studio-quality recordings
   - Train language-specific vocoders
   - Add emotion/prosody control

3. REDUCE LATENCY
   - Implement streaming ASR
   - Use smaller distilled models
   - Optimize with TensorRT/ONNX

4. EXPAND DOMAINS
   - Healthcare terminology
   - Agricultural content
   - Educational materials

5. MOBILE DEPLOYMENT
   - Quantize models for mobile
   - Create Android/iOS SDK
   - Offline capability
""")

---
## End of Notebook

**Congratulations!** You've completed the Ghana Speech-to-Speech Pipeline tutorial.

For questions or contributions, please open an issue on the project repository.

---