<a href="https://colab.research.google.com/github/samiabat/thai-colab/blob/copilot%2Ffix-8cbcd932-40d8-4df6-b956-9da8ebff2d14/almostworking-Thai_TTS_Finetune_MMS_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 🇹🇭 Thai TTS Fine-tuning (MMS-TTS → VITS) — Google Colab Notebook

This notebook fine-tunes **Meta MMS-TTS (Thai)** on your own dataset using the **`finetune-hf-vits`** recipe.
It assumes you already have:
- a folder of WAV files (ideally mono 16 kHz), and
- a transcript file (CSV/TSV) or a way to map each audio to its Thai text.

> **Note:** MMS-TTS is released under **CC-BY-NC 4.0**. If you need commercial use, consider training from scratch or a different base.

## ✅ Recent Fixes
- ✅ Fixed configuration format to match `run_vits_finetuning.py` expectations
- ✅ Fixed dataset column naming: corrected `file_name` → `path` to match training script expectations
- ✅ Resolved `ValueError: You are trying to load a dataset that was saved using save_to_disk` issue by switching to CSV format
- ✅ Fixed `Some keys are not used by the HfArgumentParser` error
- ✅ **NEW**: Fixed audio loading issues by implementing disk-based audio loading
- ✅ **NEW**: Updated to use correct repository: `samiabat/finetune-hf-vits` instead of `ylacombe/finetune-hf-vits`

## 🔧 How Audio Loading Works Now

The training script now automatically detects CSV files and loads audio from disk instead of using HuggingFace datasets:
- **CSV Format**: Your dataset should be a CSV with `path` and `text` columns
- **Audio Loading**: Audio files are loaded directly from disk using librosa
- **Robust**: Handles various audio formats and sampling rates automatically
- **Memory Efficient**: Audio is loaded on-demand during training


In [1]:

import torch, sys, platform
print('Python:', sys.version)
print('Platform:', platform.platform())
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('CUDA device:', torch.cuda.get_device_name(0))
else:
    print('⚠️ No GPU detected. In Colab: Runtime → Change runtime type → T4/V100/A100 GPU')


Python: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
Platform: Linux-6.1.123+-x86_64-with-glibc2.35
CUDA available: True
CUDA device: Tesla T4


In [2]:

!pip -q install --upgrade pip
# Core
!pip -q install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip -q install transformers accelerate datasets soundfile librosa pythainlp pandas jiwer
# Recipe
!git clone -q https://github.com/samiabat/finetune-hf-vits.git
%cd finetune-hf-vits
!pip -q install -r requirements.txt
%cd -


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m0.8/1.8 MB[0m [31m25.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
[?25h/content/finetune-hf-vits
/content


In [3]:

from google.colab import drive
drive.mount('/content/drive')
print('✅ Google Drive mounted')


Mounted at /content/drive
✅ Google Drive mounted



## Configure your paths

- `DATA_ROOT`: the directory containing your WAV files (recursively scanned).
- Either provide `TRANSCRIPT_CSV` (with `path,text`) **or** let the notebook build one from a folder structure+text file.
- Output processed dataset and configs will be written under `WORK_DIR`.


In [4]:

from pathlib import Path

# ✅ EDIT THESE
DATA_ROOT = Path('/content/drive/MyDrive/cloned-thai-dataset/audio-data')   # folder with your wavs
TRANSCRIPT_CSV = Path('/content/drive/MyDrive/cloned-thai-dataset/metadata.csv')  # CSV with columns: path,text (path absolute or relative to DATA_ROOT). Leave None if you don't have it.
WORK_DIR = Path('/content/drive/MyDrive/thai_tts/work')    # where to write normalized data, configs, checkpoints

# MMS language code for Thai
LANG_CODE = 'tha'

WORK_DIR.mkdir(parents=True, exist_ok=True)
print('DATA_ROOT  =', DATA_ROOT)
print('CSV        =', TRANSCRIPT_CSV)
print('WORK_DIR   =', WORK_DIR)


DATA_ROOT  = /content/drive/MyDrive/cloned-thai-dataset/audio-data
CSV        = /content/drive/MyDrive/cloned-thai-dataset/metadata.csv
WORK_DIR   = /content/drive/MyDrive/thai_tts/work



### (Optional) Create `metadata.csv` if you don't already have one

If you **don't** have a CSV, this cell creates a simple CSV by scanning for `.wav` files and reading a paired `.txt` with the same basename for text (e.g., `utt001.wav` + `utt001.txt`).  
Adjust as needed for your layout.


In [5]:

import pandas as pd
from pathlib import Path

def build_metadata_from_sidecar_txt(data_root: Path, out_csv: Path):
    rows = []
    for wav in data_root.rglob('*.wav'):
        txt = wav.with_suffix('.txt')
        if txt.exists():
            text = txt.read_text(encoding='utf-8').strip()
            rows.append({'path': str(wav.resolve()), 'text': text})
    if not rows:
        raise ValueError('No (wav, txt) pairs found. Provide TRANSCRIPT_CSV instead.')
    df = pd.DataFrame(rows)
    df.to_csv(out_csv, index=False)
    return out_csv

if str(TRANSCRIPT_CSV).lower() == 'none':
    TRANSCRIPT_CSV = WORK_DIR / 'metadata.csv'
    created = build_metadata_from_sidecar_txt(DATA_ROOT, TRANSCRIPT_CSV)
    print('Created CSV at', created)
else:
    print('Using existing metadata CSV:', TRANSCRIPT_CSV)


Using existing metadata CSV: /content/drive/MyDrive/cloned-thai-dataset/metadata.csv



## Preprocess audio & normalize Thai text

- Resample/convert to **mono 16 kHz WAV**
- Light Thai normalization (using PyThaiNLP)
- Filter clips (1–12 s recommended)
- Produce a cleaned `metadata_clean.csv`


In [6]:

import librosa, soundfile as sf, pandas as pd, numpy as np, os
from pythainlp.tokenize import word_tokenize

PROC_AUDIO_DIR = WORK_DIR / 'wavs_16k_mono'
PROC_AUDIO_DIR.mkdir(exist_ok=True, parents=True)
OUT_CSV = WORK_DIR / 'metadata_clean.csv'

MIN_DUR = 1.0
MAX_DUR = 12.0
TARGET_SR = 16000

def normalize_thai(text: str) -> str:
    # Minimal normalization: collapse spaces; optional segmentation (helps prosody)
    seg = word_tokenize(text.strip(), engine='newmm')
    return ' '.join(seg)

df = pd.read_csv(TRANSCRIPT_CSV)
clean_rows = []

for i, row in df.iterrows():
    src = Path('/content/drive/MyDrive/cloned-thai-dataset/' + row['file_name'])
    text = str(row['text'])
    if not src.exists():
        print('Skip (missing):', src)
        continue
    try:
        wav, sr = librosa.load(src, sr=None, mono=True)
        dur = len(wav)/sr
        if dur < MIN_DUR or dur > MAX_DUR:
            continue
        if sr != TARGET_SR:
            wav = librosa.resample(wav, orig_sr=sr, target_sr=TARGET_SR)
            sr = TARGET_SR
        # Write processed wav under WORK_DIR, mirroring file name
        out_wav = PROC_AUDIO_DIR / f"{src.stem}_16k.wav"
        sf.write(out_wav, wav, sr, subtype='PCM_16')
        clean_rows.append({'path': str(out_wav), 'text': normalize_thai(text)})
    except Exception as e:
        print('Error:', src, e)

clean_df = pd.DataFrame(clean_rows)
clean_df.to_csv(OUT_CSV, index=False)
print('Wrote cleaned CSV:', OUT_CSV)
print('Total usable clips:', len(clean_df))
clean_df.head()


Wrote cleaned CSV: /content/drive/MyDrive/thai_tts/work/metadata_clean.csv
Total usable clips: 521


Unnamed: 0,file_name,text
0,/content/drive/MyDrive/thai_tts/work/wavs_16k_...,เช้า ต้อง เห็น หน้า เย็น ก็ ต้อง ...
1,/content/drive/MyDrive/thai_tts/work/wavs_16k_...,ดิ้น ดิ้น
2,/content/drive/MyDrive/thai_tts/work/wavs_16k_...,กำจัด
3,/content/drive/MyDrive/thai_tts/work/wavs_16k_...,เคาะ ไล่ อากาศ
4,/content/drive/MyDrive/thai_tts/work/wavs_16k_...,เว็บ ดีไซน์



### Prepare dataset for training

The `finetune-hf-vits` script can read from a HuggingFace dataset or from local CSV.  
This step creates a cleaned CSV file that can be directly used by the training script.


In [7]:

import pandas as pd

# Read the CSV and create a cleaned version for training
df = pd.read_csv(OUT_CSV)

# Ensure column names match training script expectations
# The training script expects 'path' and 'text' columns
if 'file_name' in df.columns:
    df = df.rename(columns={'file_name': 'path'})

# Save the cleaned CSV that can be directly used by the training script
cleaned_csv_path = WORK_DIR / 'metadata_clean.csv'
df.to_csv(cleaned_csv_path, index=False)
print(f'Saved cleaned CSV for training at: {cleaned_csv_path}')
print(f'Dataset shape: {df.shape}')
print(f'Columns: {list(df.columns)}')


Saving the dataset (0/1 shards):   0%|          | 0/521 [00:00<?, ? examples/s]

Saved HuggingFace dataset at: /content/drive/MyDrive/thai_tts/work/hf_dataset



## Prepare MMS-TTS (Thai) checkpoint for training

This uses `convert_original_discriminator_checkpoint.py` from the recipe to create a trainable VITS-style checkpoint.


In [8]:

%cd /content/finetune-hf-vits

!python convert_original_discriminator_checkpoint.py   --language_code {LANG_CODE}   --pytorch_dump_folder_path /content/mms-tha-train

# (Optional) list files
!ls -lah /content/mms-tha-train

%cd /content


/content/finetune-hf-vits
2025-09-19 05:38:01.243442: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1758260281.263555    2603 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1758260281.269682    2603 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1758260281.285422    2603 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1758260281.285450    2603 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1758260281.285454    2603 computation_placer.cc:1


## Training config

Adjust batch size, learning rate, max steps/epochs to your GPU and dataset size.


In [9]:

import json, os
cfg = {
  # Model arguments
  "model_name_or_path": "/content/mms-tha-train",
  
  # Data arguments - using CSV instead of save_to_disk format
  "dataset_name": str(WORK_DIR / 'metadata_clean.csv'),
  "audio_column_name": "path",
  "text_column_name": "text",
  "train_split_name": "train",
  "eval_split_name": "train",
  
  # Training arguments
  "output_dir": "/content/tts-tha-checkpoints",
  "do_train": True,
  "per_device_train_batch_size": 8,
  "gradient_accumulation_steps": 2,
  "learning_rate": 0.0001,
  "max_steps": 20000,
  "logging_steps": 50,
  "save_steps": 1000,
  "eval_steps": 1000,
  "warmup_steps": 500,
  "fp16": True
}
os.makedirs('/content/config', exist_ok=True)
with open('/content/config/train_thai.json', 'w') as f:
    json.dump(cfg, f, indent=2, ensure_ascii=False)
print(open('/content/config/train_thai.json').read())


{
  "model_name_or_path": "/content/mms-tha-train",
  "dataset_name": "/content/drive/MyDrive/thai_tts/work/hf_dataset",
  "audio_column_name": "path",
  "text_column_name": "text",
  "train_split_name": "train",
  "eval_split_name": "train",
  "output_dir": "/content/tts-tha-checkpoints",
  "do_train": true,
  "per_device_train_batch_size": 8,
  "gradient_accumulation_steps": 2,
  "learning_rate": 0.0001,
  "max_steps": 20000,
  "logging_steps": 50,
  "save_steps": 1000,
  "eval_steps": 1000,
  "warmup_steps": 500,
  "fp16": true
}



## Launch training

This will start fine-tuning the Thai MMS-TTS model on your dataset.


In [13]:
pwd

'/content'

In [14]:
%cd /content/finetune-hf-vits/monotonic_align

/content/finetune-hf-vits/monotonic_align


In [17]:
!python setup.py build_ext --inplace

In [18]:

%cd /content/finetune-hf-vits
!accelerate launch run_vits_finetuning.py /content/config/train_thai.json
%cd /content


/content/finetune-hf-vits
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
2025-09-19 05:45:06.625833: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1758260706.661256    4490 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1758260706.671421    4490 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1758260706.698965    4490 computation_placer.cc:177] computation placer already registered. Please check linkage and avo


## Quick inference test

Generate a sample audio using the fine-tuned checkpoint.


In [None]:

from transformers import pipeline
import soundfile as sf

ckpt_dir = "/content/tts-tha-checkpoints"
tts = pipeline("text-to-speech", model=ckpt_dir, device=0 if torch.cuda.is_available() else -1)
sample_text = "สวัสดีครับ ยินดีที่ได้รู้จัก นี่คือระบบสังเคราะห์เสียงภาษาไทยที่ฝึกด้วยข้อมูลของเรา"
out = tts(sample_text)
sf.write('/content/sample_thai_tts.wav', out["audio"], out["sampling_rate"], subtype='PCM_16')
print('Saved:', '/content/sample_thai_tts.wav')



## Save to Drive (and optionally push to Hugging Face Hub)


In [None]:

import shutil
drive_ckpt = str(WORK_DIR / 'checkpoints_mms_thai')
shutil.copytree('/content/tts-tha-checkpoints', drive_ckpt, dirs_exist_ok=True)
print('Copied checkpoints to:', drive_ckpt)

# Optional: push to Hub (uncomment and set your repo)
# !pip -q install huggingface_hub
# from huggingface_hub import HfApi, create_repo
# repo_id = "yourname/mms-thai-finetuned"
# create_repo(repo_id, private=True, exist_ok=True)
# !huggingface-cli login
# !git lfs install
# !git init /content/tts-tha-checkpoints
# %cd /content/tts-tha-checkpoints
# !git remote add origin https://huggingface.co/yourname/mms-thai-finetuned
# !git add . && git commit -m "Add Thai TTS fine-tuned" && git push -u origin main
# %cd /content



### Notes & Tips
- Recommended 1–3 hours clean audio for decent cloning; 5–10+ hours for robust style control.
- Keep clips 1–12 seconds; remove background noise as much as possible.
- If you get OOM (out-of-memory), reduce `per_device_train_batch_size` or increase `gradient_accumulation_steps`.
- If training is slow, reduce `max_steps` initially to validate the pipeline, then scale up.

### 🎯 Current Approach: CSV-based Audio Loading
This notebook now uses CSV-based audio loading by default, which is more robust:
1. ✅ Audio files are loaded directly from disk (not stored in HuggingFace datasets)
2. ✅ Better memory management and compatibility
3. ✅ Automatic audio format conversion and resampling
4. ✅ The training script automatically detects `.csv` files and uses the new loading method

**Your CSV should have these columns:**
- `path`: Full path to the audio file
- `text`: Corresponding text transcription

**Example CSV:**
```
path,text
/path/to/audio1.wav,สวัสดีครับ
/path/to/audio2.wav,ขอบคุณมากครับ
```
