# FINE-TUNING XTTS-v2 For  Hindi Language #

## Creating Your Dataset:


### Dataset of this Notebook:
The VoxPopuli dataset was chosen for Hindi TTS training and modified to match the LJSpeech format.

Simply discard any audio that is significantly worse quality than the rest. Examples of unpromising source audio include: constant background noise (e.g., coughing, clapping, laughter), excessive clipping in waveform view of Audacity, poor quality recording with constant whine/noise/etc. .

### Making an LJSpeech Style Dataset:
The format for LJSpeech is a dir that contains two things: a metadata.csv file and a dir called 'wavs' that contains your voice recordings. Each line of the metadata.csv file includes:

1. The name of an audio file
2. The text for that file. E.g., "Jane eyre by Charlotte Bronte. Chapter 1."
3. The normalised text. E.g., "Jane eyre by Charlotte Bronte. Chapter one."

**If you are fine-tuning XTTS-v2 you don't need to worry about normalising your text, because it gets done for you automatically at training time. So your second and third columns can be identical.**

### Note on Model Performance:
Some degree of repetition/mushy mouth sounds seems to be inherent to the model. Even the pre-trained voices that comes packaged with TTS suffer from this problem to a small extent. There are two ways I'm aware of to improve your performance (these are already covered in other parts of this/my other notebook, but I'm putting it here again since it's pretty important):

1. Improve the quality of your training data. Cull problematic items. Get more training data if your dataset is really small.
2. The model does not generalise well to unseen sequence lengths. If you only fine-tune on 10s long audio clips and then try to produce a 1s clip at inference time, it will probably struggle. Make sure you have a good distribution of training lengths. Note that when you try to generate audio from a long text string, *this program is automatically splitting that long string of text into several shorter strings*, because the model cannot generate sequences of arbitrary length. If you are suffering from garbled/repetitious outputs, then I recommend putting some print statements in the 'split_sentence" function in TTS.tts.layers.xtts.tokenizer. This will show you how your long text is being split up. If you see that your bad outputs are only occuring when the model is trying to generate audio for very short sequences or very long sequences, then you know what needs to be addressed. 

In [1]:
!pip install git+https://github.com/coqui-ai/TTS

Collecting git+https://github.com/coqui-ai/TTS
  Cloning https://github.com/coqui-ai/TTS to /tmp/pip-req-build-8r49_zoc
  Running command git clone --filter=blob:none --quiet https://github.com/coqui-ai/TTS /tmp/pip-req-build-8r49_zoc
  Resolved https://github.com/coqui-ai/TTS to commit dbf1a08a0d4e47fdad6172e433eeb34bc6b13b4e
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting scikit-learn>=1.3.0 (from TTS==0.22.0)
  Downloading scikit_learn-1.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting inflect>=5.6.0 (from TTS==0.22.0)
  Downloading inflect-7.4.0-py3-none-any.whl.metadata (21 kB)
Collecting anyascii>=0.3.0 (from TTS==0.22.0)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting packaging>=23.1 (from TTS==0.22.0)
  Using cached packaging-24.1-py3-none-any.whl.metadata (3.2 kB)
Collecting mutagen=

In [2]:
!pip install transformers==4.37.1

Collecting transformers==4.37.1
  Downloading transformers-4.37.1-py3-none-any.whl.metadata (129 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.4/129.4 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.19,>=0.14 (from transformers==4.37.1)
  Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.37.1-py3-none-any.whl (8.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m69.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hDownloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m80.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.20.0
    Uninstalling tokenizers-0.20.0

In [None]:
###updated training zone####

In [3]:
from trainer import Trainer, TrainerArgs
#from trainer.logging.wandb_logger import WandbLogger
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.layers.xtts.trainer.gpt_trainer import GPTArgs, GPTTrainer, GPTTrainerConfig, XttsAudioConfig
from TTS.utils.manage import ModelManager


import sys
import os
import wandb

### Monkey Patching for wandb (!!!) ###

XTTS-v2 uses tensorboard for logging by default. Officially wandb is supported, but it breaks things when I've used it (after a few epochs creating massive amounts of artifact files). For this reason I've monkey patched the offending method so that no artifacts are added.


In [4]:
from trainer.logging.wandb_logger import WandbLogger

In [5]:
def add_artifact(self, file_or_dir, name, artifact_type, aliases=None):
    ###instead of adding artifact, do nothing###
    print(f"========Ignoring artifact: {name} {file_or_dir}========")
    return


WandbLogger.add_artifact = add_artifact

In [6]:
# Logging parameters
RUN_NAME = "kaggletest"
PROJECT_NAME = "gore" 
DASHBOARD_LOGGER = "wandb" 
LOGGER_URI = None

### Dir for Training Run ###

Set the training run to store model files in the persistent /kaggle/working dir. 


In [7]:
OUT_PATH = '/kaggle/working/run/'
os.makedirs(OUT_PATH, exist_ok=True)

Retreive the base model files. 

In [8]:
# Define the path where XTTS v2.0.1 files will be downloaded
CHECKPOINTS_OUT_PATH = os.path.join(OUT_PATH, "XTTS_v2.0_original_model_files/")
os.makedirs(CHECKPOINTS_OUT_PATH, exist_ok=True)

# DVAE files
DVAE_CHECKPOINT_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/dvae.pth"
MEL_NORM_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/mel_stats.pth"

# Set the path to the downloaded files
DVAE_CHECKPOINT = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(DVAE_CHECKPOINT_LINK))
MEL_NORM_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(MEL_NORM_LINK))

# download DVAE files if needed
if not os.path.isfile(DVAE_CHECKPOINT) or not os.path.isfile(MEL_NORM_FILE):
    print(" > Downloading DVAE files!")
    ModelManager._download_model_files([MEL_NORM_LINK, DVAE_CHECKPOINT_LINK], CHECKPOINTS_OUT_PATH, progress_bar=True)

# Download XTTS v2.0 checkpoint if needed
TOKENIZER_FILE_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/vocab.json"
XTTS_CHECKPOINT_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/model.pth"

# XTTS transfer learning parameters: You we need to provide the paths of XTTS model checkpoint that you want to do the fine tuning.
TOKENIZER_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(TOKENIZER_FILE_LINK))  # vocab.json file
XTTS_CHECKPOINT = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(XTTS_CHECKPOINT_LINK))  # model.pth file

# download XTTS v2.0 files if needed
if not os.path.isfile(TOKENIZER_FILE) or not os.path.isfile(XTTS_CHECKPOINT):
    print(" > Downloading XTTS v2.0 files!")
    ModelManager._download_model_files(
        [TOKENIZER_FILE_LINK, XTTS_CHECKPOINT_LINK], CHECKPOINTS_OUT_PATH, progress_bar=True
    )

 > Downloading DVAE files!


  0%|          | 0.00/1.07k [00:00<?, ?iB/s]
100%|██████████| 1.07k/1.07k [00:00<00:00, 4.31kiB/s]

  1%|          | 1.06M/211M [00:00<00:19, 10.6MiB/s][A
  3%|▎         | 5.54M/211M [00:00<00:06, 30.7MiB/s][A
  5%|▌         | 11.0M/211M [00:00<00:04, 41.7MiB/s][A
  8%|▊         | 16.5M/211M [00:00<00:04, 47.0MiB/s][A
 10%|█         | 21.9M/211M [00:00<00:03, 49.5MiB/s][A
 13%|█▎        | 27.4M/211M [00:00<00:03, 51.3MiB/s][A
 15%|█▌        | 32.5M/211M [00:00<00:03, 51.0MiB/s][A
 18%|█▊        | 37.7M/211M [00:00<00:03, 51.1MiB/s][A
 21%|██        | 43.2M/211M [00:00<00:03, 52.3MiB/s][A
 23%|██▎       | 48.5M/211M [00:01<00:03, 52.7MiB/s][A
 26%|██▌       | 54.0M/211M [00:01<00:02, 53.5MiB/s][A
 28%|██▊       | 59.5M/211M [00:01<00:02, 53.7MiB/s][A
 31%|███       | 64.9M/211M [00:01<00:02, 53.9MiB/s][A
 33%|███▎      | 70.3M/211M [00:01<00:02, 53.9MiB/s][A
 36%|███▌      | 75.7M/211M [00:01<00:02, 53.9MiB/s][A
 39%|███▊      | 81.1M/211M [00:01<00:02, 54.0MiB/s][A
 41%

 > Downloading XTTS v2.0 files!


100%|██████████| 211M/211M [00:04<00:00, 50.1MiB/s]

100%|██████████| 361k/361k [00:00<00:00, 1.70MiB/s]

  0%|          | 3.87M/1.87G [00:00<00:48, 38.7MiB/s][A
  1%|          | 9.37M/1.87G [00:00<00:38, 48.3MiB/s][A
  1%|          | 14.9M/1.87G [00:00<00:36, 51.3MiB/s][A
  1%|          | 20.2M/1.87G [00:00<00:35, 52.2MiB/s][A
  1%|▏         | 25.6M/1.87G [00:00<00:34, 52.9MiB/s][A
  2%|▏         | 30.9M/1.87G [00:00<00:35, 52.2MiB/s][A
  2%|▏         | 36.2M/1.87G [00:00<00:35, 51.9MiB/s][A
  2%|▏         | 41.5M/1.87G [00:00<00:34, 52.3MiB/s][A
  3%|▎         | 46.8M/1.87G [00:00<00:34, 52.6MiB/s][A
  3%|▎         | 52.2M/1.87G [00:01<00:34, 52.9MiB/s][A
  3%|▎         | 57.5M/1.87G [00:01<00:34, 53.1MiB/s][A
  3%|▎         | 62.8M/1.87G [00:01<00:33, 53.1MiB/s][A
  4%|▎         | 68.2M/1.87G [00:01<00:34, 52.9MiB/s][A
  4%|▍         | 73.8M/1.87G [00:01<00:33, 53.9MiB/s][A
  4%|▍         | 79.4M/1.87G [00:01<00:32, 54.7MiB/s][A
  5%|▍         | 85.1M/1.87G [00:01<00:

In [9]:
training_dir = "/kaggle/input/hindi-speech1"

### Batch Size ###

* BATCH_SIZE is the amount of items being loaded into VRAM/memory at once.

* GRAD_ACCUM_STEPS is the amount of times we perform a forward pass with BATCH_SIZE amount of items before updating the parameters according to the SGD algorithm.


In [10]:

OPTIMIZER_WD_ONLY_ON_WEIGHTS = True  
START_WITH_EVAL = True  
BATCH_SIZE = 1
GRAD_ACUMM_STEPS = 252
LANGUAGE = "hi"


100%|██████████| 1.87G/1.87G [00:45<00:00, 54.5MiB/s][A

In [59]:
import librosa

# Load the audio file
audio_path = "/kaggle/input/hindi-speech1/wavs/0001_030.wav"
audio, sample_rate = librosa.load(audio_path, sr=None)  # sr=None preserves the original sample rate

print(f"Sample rate: {sample_rate} Hz")

Sample rate: 8000 Hz


In [11]:
import os
import wave

wav_folder = "/kaggle/input/hindi-speech1/wavs"
sample_rate = 8000   # Replace with the actual sample rate of your dataset

min_length = float('inf')
max_length = 0

for wav_file in os.listdir(wav_folder):
    if wav_file.endswith('.wav'):
        with wave.open(os.path.join(wav_folder, wav_file), 'r') as wav:
            frames = wav.getnframes()
            duration = frames / float(wav.getframerate())  # in seconds
            length_in_samples = int(duration * sample_rate)

            # Update min and max lengths
            if length_in_samples < min_length:
                min_length = length_in_samples
            if length_in_samples > max_length:
                max_length = length_in_samples

print(f"Min length: {min_length} samples")
print(f"Max length: {max_length} samples")

Min length: 8480 samples
Max length: 82720 samples


### Dataset Config ###

Note that the lengths below are lengths of WAV files. So if your WAV file has a sample rate of 22050, then a a max_wav_length of 370000 is: 370000/22050 = ~16.78 seconds long.




In [12]:
model_args = GPTArgs(
    max_conditioning_length=82720,  # Increased to accommodate longer audio files
    min_conditioning_length=8480,   
    debug_loading_failures=True,
    max_wav_length=82720,           # Increased to handle the longest audio file (675507)
    max_text_length=1000,             # Increased to handle longer text inputs
    mel_norm_file=MEL_NORM_FILE,
    dvae_checkpoint=DVAE_CHECKPOINT,
    xtts_checkpoint=XTTS_CHECKPOINT,  
    tokenizer_file=TOKENIZER_FILE,
    gpt_num_audio_tokens=1026, 
    gpt_start_audio_token=1024,
    gpt_stop_audio_token=1025,      
    gpt_use_masking_gt_prompt_approach=True,
    gpt_use_perceiver_resampler=True,
)

### Audio Config ###

The coqui TTS docs mention inspecting your data with the CheckSpectrograms.ipynb notebook to help decide on audio parameters. I think this is irrelevant for XTTS-v2, because it doesn't use the same audio config as some of the older coqui models and doesn't have the same parameters.

The default is 22050 for input and 24000 for output. 

In [13]:
audio_config = XttsAudioConfig(sample_rate=22050, dvae_sample_rate=22050, output_sample_rate=24000)

### Speaker Reference ###

This is the audio file that will be used for creating the conditioning latent and speaker embedding.

Choosing the right speaker reference is **VERY** important for XTTS-v2. It can completely change how your model will sound. Even two clips taken from the same recording of the same speaker can produce markedly different outputs. Unfortunately I can't provide an algorithm for selecting this. I recommend that you manually go through your dataset and select approximately 10 clips of your speaker where they are saying a full sentence with an intonation/rythm/speed/style that sounds pretty good. Then just experiment with all of them and find one you like. This is especially important at inference time.

Note that you can give a speaker reference that 'doesn't belong' to your model. 

In [14]:
SPEAKER_REFERENCE = "/kaggle/input/hindi-speech1/wavs/0001_030.wav"

### Trainer Config ###

Fine-tune for about 100,000 dataset items but stop early if test outputs sound good; listening is better than just monitoring loss.
US male voices fine-tune faster; complex accents take longer.
Keep test sentences consistent for comparison across model runs.

In [15]:
SPEAKER_REFERENCE

'/kaggle/input/hindi-speech1/wavs/0001_030.wav'

In [16]:
config = GPTTrainerConfig(
    run_eval=True,
    epochs = 1000, # assuming you want to end training manually w/ keyboard interrupt
    output_path=OUT_PATH,
    model_args=model_args,
    run_name=RUN_NAME,
    project_name=PROJECT_NAME,
    run_description="""
        GPT XTTS training
        """,
    dashboard_logger=DASHBOARD_LOGGER,
    wandb_entity=None,
    logger_uri=LOGGER_URI,
    audio=audio_config,
    batch_size=BATCH_SIZE,
    batch_group_size=48,
    eval_batch_size=BATCH_SIZE,
    num_loader_workers=8, #consider decreasing if your jupyter env is crashing or similar
    eval_split_max_size=256, 
    print_step=50, 
    plot_step=100, 
    log_model_step=1000, 
    save_step=9999999999, #ALREADY SAVES EVERY EPOCHMaking this high on kaggle because Output dir is limited in size. I changed this to be size of training set/2 so I would effectively have a checkpoint every half epoch 
    save_n_checkpoints=1,#if you want to store multiple checkpoint rather than just 1, increase this
    save_checkpoints=False,# Making this False on kaggle because Output dir is limited
    print_eval=False,
    optimizer="AdamW",
    optimizer_wd_only_on_weights=OPTIMIZER_WD_ONLY_ON_WEIGHTS,
    optimizer_params={"betas": [0.9, 0.96], "eps": 1e-8, "weight_decay": 1e-2},
    lr=5e-06,  
    lr_scheduler="MultiStepLR",
    lr_scheduler_params={"milestones": [50000 * 18, 150000 * 18, 300000 * 18], "gamma": 0.5, "last_epoch": -1},
    test_sentences=[ 
        {
            "text": "भारत एक विशाल और विविधतापूर्ण देश है।",
            "speaker_wav": SPEAKER_REFERENCE,
            "language": LANGUAGE,
        },
        {
            "text": "हिंदी भारत की सबसे अधिक बोली जाने वाली भाषाओं में से एक है।",
            "speaker_wav": SPEAKER_REFERENCE,
            "language": LANGUAGE,
        },
        {
            "text": "ताजमहल भारत का एक प्रसिद्ध ऐतिहासिक स्मारक है।",
            "speaker_wav": SPEAKER_REFERENCE,
            "language": LANGUAGE,
        }
        
    ],
) 

model = GPTTrainer.init_from_config(config)

  return torch.load(f, map_location=map_location, **kwargs)
  self.mel_norms = torch.load(f)


>> DVAE weights restored from: /kaggle/working/run/XTTS_v2.0_original_model_files/dvae.pth


  dvae_checkpoint = torch.load(self.args.dvae_checkpoint, map_location=torch.device("cpu"))


### Load Dataset ###

The evaluation set is 1% of the training data by default. This seems very low, but when you consider that you will probably want to evaluate performance by listening to tests rather than just comparing loss values.

In [17]:
dataset_config = BaseDatasetConfig(
    formatter="ljspeech", meta_file_train="deduplicated_transcripts1.txt", language=LANGUAGE, path=training_dir
)
train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True, eval_split_size=0.02)

 | > Found 1099 files in /kaggle/input/hindi-speech1


### Train! ###



In [None]:
trainer = Trainer(
    TrainerArgs(
        restore_path=None,
        skip_train_epoch=False,
        start_with_eval=START_WITH_EVAL,
        grad_accum_steps=GRAD_ACUMM_STEPS,
    ),
    config,
    output_path=OUT_PATH,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
trainer.fit()

# Testing on Different Texts

In [20]:
!tts --text "वह उसके पीछे भागा" \
    --model_path "/kaggle/working/run/XTTS_v2.0_original_model_files" \
    --config_path "/kaggle/working/run/kaggletest-October-14-2024_05+30AM-0000000/config.json" \
    --out_path "/kaggle/working/run/kaggletest-October-14-2024_05+30AM-0000000/output_1.wav" \
    --language_idx hi \
    --speaker_wav "/kaggle/input/hindi-speech1/wavs/0001_030.wav"


RuntimeError: module was compiled against NumPy C-API version 0x10 (NumPy 1.23) but the running NumPy has C-API version 0xf. Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem.
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf . Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem .
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf . Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem .
RuntimeError: module compiled against A

Your fine-tuned model will be stored in /kaggle/working/run

In [21]:
import os

# Define the path to the checkpoint
checkpoint_path = "/kaggle/working/run/XTTS_v2.0_original_model_files"

# Create a zip file name
zip_name = "checkpoint-1000.zip"

# Create the zip file
!zip -r /kaggle/working/{zip_name} {checkpoint_path}

print(f"Zip file created: /kaggle/working/{zip_name}")

  adding: kaggle/working/run/XTTS_v2.0_original_model_files/ (stored 0%)
  adding: kaggle/working/run/XTTS_v2.0_original_model_files/mel_stats.pth (deflated 37%)
  adding: kaggle/working/run/XTTS_v2.0_original_model_files/model.pth (deflated 7%)
  adding: kaggle/working/run/XTTS_v2.0_original_model_files/dvae.pth (deflated 7%)
  adding: kaggle/working/run/XTTS_v2.0_original_model_files/vocab.json (deflated 81%)
Zip file created: /kaggle/working/checkpoint-1000.zip
