# FINE-TUNING XTTS-v2 #

## Creating Your Dataset:


### Dataset:
The dataset is generated using scrapping Techincal Blogs and Documentation by pre-processing the scrapped data in order to achieve clear and concise pronounciation,accuracy etc.

### Preparing your own Data:
Simply discard any audio that is significantly worse quality than the rest. Examples of unpromising source audio include: constant background noise (e.g., coughing, clapping, laughter), excessive clipping in waveform view of Audacity, poor quality recording with constant whine/noise/etc. .

### Making an LJSpeech Style Dataset:
The format for LJSpeech is a dir that contains two things: a metadata.csv file and a dir called 'wavs' that contains your voice recordings. Each line of the metadata.csv file includes:

1. The name of an audio file
2. The text for that file. E.g., "Jane eyre by Charlotte Bronte. Chapter 1."
3. The normalised text. E.g., "Jane eyre by Charlotte Bronte. Chapter one."



### Note on Model Performance:
Some degree of repetition/mushy mouth sounds seems to be inherent to the model. Even the pre-trained voices that comes packaged with TTS suffer from this problem to a small extent. There are two ways I'm aware of to improve your performance (these are already covered in other parts of this/my other notebook, but I'm putting it here again since it's pretty important):

1. Improve the quality of your training data. Cull problematic items. Get more training data if your dataset is really small.
2. The model does not generalise well to unseen sequence lengths. If you only fine-tune on 10s long audio clips and then try to produce a 1s clip at inference time, it will probably struggle. 
Make sure you have a good distribution of training lengths. Note that when you try to generate audio from a long text string, *this program is automatically splitting that long string of text into several shorter strings*, because the model cannot generate sequences of arbitrary length. If you are suffering from garbled/repetitious outputs, then I recommend putting some print statements in the 'split_sentence" function in TTS.tts.layers.xtts.tokenizer. This will show you how your long text is being split up. If you see that your bad outputs are only occuring when the model is trying to generate audio for very short sequences or very long sequences, then you know what needs to be addressed. 

In [1]:
!pip install git+https://github.com/coqui-ai/TTS

Collecting git+https://github.com/coqui-ai/TTS
  Cloning https://github.com/coqui-ai/TTS to /tmp/pip-req-build-bqvkm00r
  Running command git clone --filter=blob:none --quiet https://github.com/coqui-ai/TTS /tmp/pip-req-build-bqvkm00r
  Resolved https://github.com/coqui-ai/TTS to commit dbf1a08a0d4e47fdad6172e433eeb34bc6b13b4e
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting scikit-learn>=1.3.0 (from TTS==0.22.0)
  Downloading scikit_learn-1.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting inflect>=5.6.0 (from TTS==0.22.0)
  Downloading inflect-7.4.0-py3-none-any.whl.metadata (21 kB)
Collecting anyascii>=0.3.0 (from TTS==0.22.0)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting packaging>=23.1 (from TTS==0.22.0)
  Using cached packaging-24.1-py3-none-any.whl.metadata (3.2 kB)
Collecting mutagen=

In [2]:
!pip install transformers==4.37.1

Collecting transformers==4.37.1
  Downloading transformers-4.37.1-py3-none-any.whl.metadata (129 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.4/129.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.19,>=0.14 (from transformers==4.37.1)
  Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.37.1-py3-none-any.whl (8.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m78.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[?25hDownloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m77.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.20.0
    Uninstalling tokenizers-0.20.0:

In [None]:
###updated training zone####

In [4]:
from trainer import Trainer, TrainerArgs
#from trainer.logging.wandb_logger import WandbLogger
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.layers.xtts.trainer.gpt_trainer import GPTArgs, GPTTrainer, GPTTrainerConfig, XttsAudioConfig
from TTS.utils.manage import ModelManager

import sys
import os
import wandb

### Monkey Patching for wandb (!!!) ###

XTTS-v2 uses tensorboard for logging by default. Officially wandb is supported, but it breaks things when I've used it (after a few epochs creating massive amounts of artifact files). For this reason I've monkey patched the offending method so that no artifacts are added.


In [5]:
from trainer.logging.wandb_logger import WandbLogger

In [6]:
def add_artifact(self, file_or_dir, name, artifact_type, aliases=None):
    ###instead of adding artifact, do nothing###
    print(f"========Ignoring artifact: {name} {file_or_dir}========")
    return


WandbLogger.add_artifact = add_artifact

In [7]:
# Logging parameters
RUN_NAME = "kaggletest"
PROJECT_NAME = "gore" 
DASHBOARD_LOGGER = "wandb" 
LOGGER_URI = None

In [8]:
OUT_PATH = '/kaggle/working/run/'
os.makedirs(OUT_PATH, exist_ok=True)

Retreive the base model files. 

In [9]:
# Define the path where XTTS v2.0.1 files will be downloaded
CHECKPOINTS_OUT_PATH = os.path.join(OUT_PATH, "XTTS_v2.0_original_model_files/")
os.makedirs(CHECKPOINTS_OUT_PATH, exist_ok=True)

# DVAE files
DVAE_CHECKPOINT_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/dvae.pth"
MEL_NORM_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/mel_stats.pth"

# Set the path to the downloaded files
DVAE_CHECKPOINT = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(DVAE_CHECKPOINT_LINK))
MEL_NORM_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(MEL_NORM_LINK))

# download DVAE files if needed
if not os.path.isfile(DVAE_CHECKPOINT) or not os.path.isfile(MEL_NORM_FILE):
    print(" > Downloading DVAE files!")
    ModelManager._download_model_files([MEL_NORM_LINK, DVAE_CHECKPOINT_LINK], CHECKPOINTS_OUT_PATH, progress_bar=True)

# Download XTTS v2.0 checkpoint if needed
TOKENIZER_FILE_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/vocab.json"
XTTS_CHECKPOINT_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/model.pth"

# XTTS transfer learning parameters: You we need to provide the paths of XTTS model checkpoint that you want to do the fine tuning.
TOKENIZER_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(TOKENIZER_FILE_LINK))  # vocab.json file
XTTS_CHECKPOINT = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(XTTS_CHECKPOINT_LINK))  # model.pth file

# download XTTS v2.0 files if needed
if not os.path.isfile(TOKENIZER_FILE) or not os.path.isfile(XTTS_CHECKPOINT):
    print(" > Downloading XTTS v2.0 files!")
    ModelManager._download_model_files(
        [TOKENIZER_FILE_LINK, XTTS_CHECKPOINT_LINK], CHECKPOINTS_OUT_PATH, progress_bar=True
    )

 > Downloading DVAE files!


  0%|          | 0.00/1.07k [00:00<?, ?iB/s]
100%|██████████| 1.07k/1.07k [00:00<00:00, 3.23kiB/s]

  2%|▏         | 3.45M/211M [00:00<00:06, 34.5MiB/s][A
  4%|▍         | 8.18M/211M [00:00<00:04, 42.0MiB/s][A
  6%|▋         | 13.2M/211M [00:00<00:04, 45.7MiB/s][A
  9%|▉         | 18.7M/211M [00:00<00:03, 49.6MiB/s][A
 12%|█▏        | 24.3M/211M [00:00<00:03, 51.8MiB/s][A
 14%|█▍        | 29.9M/211M [00:00<00:03, 53.3MiB/s][A
 17%|█▋        | 35.5M/211M [00:00<00:03, 54.2MiB/s][A
 19%|█▉        | 41.0M/211M [00:00<00:03, 54.5MiB/s][A
 22%|██▏       | 46.5M/211M [00:00<00:03, 54.1MiB/s][A
 25%|██▍       | 51.9M/211M [00:01<00:02, 54.0MiB/s][A
 27%|██▋       | 57.3M/211M [00:01<00:02, 53.8MiB/s][A
 30%|██▉       | 62.7M/211M [00:01<00:02, 53.8MiB/s][A
 32%|███▏      | 68.1M/211M [00:01<00:02, 54.0MiB/s][A
 35%|███▍      | 73.5M/211M [00:01<00:02, 54.0MiB/s][A
 38%|███▊      | 78.9M/211M [00:01<00:02, 54.1MiB/s][A
 40%|████      | 84.4M/211M [00:01<00:02, 53.9MiB/s][A
 43%

 > Downloading XTTS v2.0 files!


100%|██████████| 211M/211M [00:04<00:00, 51.2MiB/s]
 30%|██▉       | 108k/361k [00:00<00:00, 913kiB/s]
100%|██████████| 361k/361k [00:00<00:00, 842kiB/s]

  0%|          | 1.52M/1.87G [00:00<02:02, 15.2MiB/s][A
  0%|          | 5.78M/1.87G [00:00<00:59, 31.3MiB/s][A
  1%|          | 11.1M/1.87G [00:00<00:44, 41.3MiB/s][A
  1%|          | 16.5M/1.87G [00:00<00:39, 46.4MiB/s][A
  1%|          | 21.9M/1.87G [00:00<00:37, 48.9MiB/s][A
  1%|▏         | 27.2M/1.87G [00:00<00:36, 50.5MiB/s][A
  2%|▏         | 32.5M/1.87G [00:00<00:35, 51.2MiB/s][A
  2%|▏         | 37.9M/1.87G [00:00<00:35, 52.1MiB/s][A
  2%|▏         | 43.4M/1.87G [00:00<00:34, 52.9MiB/s][A
  3%|▎         | 48.7M/1.87G [00:01<00:34, 52.5MiB/s][A
  3%|▎         | 54.2M/1.87G [00:01<00:33, 53.6MiB/s][A
  3%|▎         | 59.6M/1.87G [00:01<00:33, 53.5MiB/s][A
  3%|▎         | 65.0M/1.87G [00:01<00:33, 53.8MiB/s][A
  4%|▍         | 70.5M/1.87G [00:01<00:33, 54.0MiB/s][A
  4%|▍         | 75.9M/1.87G [00:01<00:33, 54.2

In [10]:
training_dir = "/kaggle/input/tech-data"

### Batch Size ###

* BATCH_SIZE is the amount of items being loaded into VRAM/memory at once.

* GRAD_ACCUM_STEPS is the amount of times we perform a forward pass with BATCH_SIZE amount of items before updating the parameters according to the SGD algorithm.

In [11]:

OPTIMIZER_WD_ONLY_ON_WEIGHTS = True  
START_WITH_EVAL = True  
BATCH_SIZE = 1
GRAD_ACUMM_STEPS = 252
LANGUAGE = "en"

### Dataset Config ###

Note that the lengths below are lengths of WAV files. So if your WAV file has a sample rate of 22050, then a a max_wav_length of 370000 is: 370000/22050 = ~16.78 seconds long.




In [12]:
model_args = GPTArgs(
    max_conditioning_length=143677,#the audio you will use for conditioning latents should be less than this 
    min_conditioning_length=66150,#and more than this
    debug_loading_failures=True,#this will print output to console and help you find problems in your ds
    max_wav_length=223997,#set this to >= the longest audio in your dataset  
    max_text_length=200, 
    mel_norm_file=MEL_NORM_FILE,
    dvae_checkpoint=DVAE_CHECKPOINT,
    xtts_checkpoint=XTTS_CHECKPOINT,  
    tokenizer_file=TOKENIZER_FILE,
    gpt_num_audio_tokens=1026, 
    gpt_start_audio_token=1024,
    gpt_stop_audio_token=1025,
    gpt_use_masking_gt_prompt_approach=True,
    gpt_use_perceiver_resampler=True,
)


100%|██████████| 1.87G/1.87G [00:45<00:00, 55.0MiB/s][A

### Audio Config ###

The default is 22050 for input and 24000 for output. 


In [13]:
audio_config = XttsAudioConfig(sample_rate=16000, dvae_sample_rate=16000, output_sample_rate=24000) 

### Speaker Reference ###

This is the audio file that will be used for creating the conditioning latent and speaker embedding. 

Choosing the right speaker reference is **VERY** important for XTTS-v2. It can completely change how your model will sound. Even two clips taken from the same recording of the same speaker can produce markedly different outputs.

In [14]:
SPEAKER_REFERENCE = "/kaggle/input/tech-data/wavs/audio_1.wav"

### Trainer Config ###

- Fine-tune for about 100,000 dataset items but stop early if test outputs sound good; listening is better than just monitoring loss.
- US male voices fine-tune faster; complex accents take longer.
- Keep test sentences consistent for comparison across model runs.








In [15]:
config = GPTTrainerConfig(
    run_eval=True,
    epochs = 1000, # assuming you want to end training manually w/ keyboard interrupt
    output_path=OUT_PATH,
    model_args=model_args,
    run_name=RUN_NAME,
    project_name=PROJECT_NAME,
    run_description="""
        GPT XTTS training
        """,
    dashboard_logger=DASHBOARD_LOGGER,
    wandb_entity=None,
    logger_uri=LOGGER_URI,
    audio=audio_config,
    batch_size=BATCH_SIZE,
    batch_group_size=48,
    eval_batch_size=BATCH_SIZE,
    num_loader_workers=8, #consider decreasing if your jupyter env is crashing or similar
    eval_split_max_size=256, 
    print_step=50, 
    plot_step=100, 
    log_model_step=1000, 
    save_step=9999999999, #ALREADY SAVES EVERY EPOCHMaking this high on kaggle because Output dir is limited in size. I changed this to be size of training set/2 so I would effectively have a checkpoint every half epoch 
    save_n_checkpoints=1,#if you want to store multiple checkpoint rather than just 1, increase this
    save_checkpoints=False,# Making this False on kaggle because Output dir is limited
    print_eval=False,
    optimizer="AdamW",
    optimizer_wd_only_on_weights=OPTIMIZER_WD_ONLY_ON_WEIGHTS,
    optimizer_params={"betas": [0.9, 0.96], "eps": 1e-8, "weight_decay": 1e-2},
    lr=5e-06,  
    lr_scheduler="MultiStepLR",
    lr_scheduler_params={"milestones": [50000 * 18, 150000 * 18, 300000 * 18], "gamma": 0.5, "last_epoch": -1},
    test_sentences=[ 
        {
            "text": "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
            "speaker_wav": SPEAKER_REFERENCE, 
            "language": LANGUAGE,
        },
        {
            "text": "This cake is great. It's so delicious and moist.",
            "speaker_wav": SPEAKER_REFERENCE,
            "language": LANGUAGE,
        },
        {
            "text": "And soon, nothing more terrible, nothing more true, and specious stuff that says no rational being can fear a thing it will not feel, not seeing that this is what we fear.",
            "speaker_wav": SPEAKER_REFERENCE,
            "language": LANGUAGE,
        }
        
    ],
) 

model = GPTTrainer.init_from_config(config)

  return torch.load(f, map_location=map_location, **kwargs)
  self.mel_norms = torch.load(f)


>> DVAE weights restored from: /kaggle/working/run/XTTS_v2.0_original_model_files/dvae.pth


  dvae_checkpoint = torch.load(self.args.dvae_checkpoint, map_location=torch.device("cpu"))


### Load Dataset ###

The evaluation set is 1% of the training data by default. 

In [16]:
dataset_config = BaseDatasetConfig(
    formatter="ljspeech", meta_file_train="metadata.txt", language=LANGUAGE, path=training_dir
)
train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True, eval_split_size=0.02)

 | > Found 1151 files in /kaggle/input/tech-data


### Train! ###

In [None]:
trainer = Trainer(
    TrainerArgs(
        restore_path=None,
        skip_train_epoch=False,
        start_with_eval=START_WITH_EVAL,
        grad_accum_steps=GRAD_ACUMM_STEPS,
    ),
    config,
    output_path=OUT_PATH,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
trainer.fit()

Your fine-tuned model will be stored in /kaggle/working/run

# Testing on different sentences

In [39]:
!tts --text "Text for TTS" \
    --model_path "/kaggle/working/run/XTTS_v2.0_original_model_files" \
    --config_path "/kaggle/working/run/kaggletest-October-12-2024_06+41AM-0000000/config.json" \
    --out_path "/kaggle/working/run/kaggletest-October-12-2024_06+41AM-0000000/output.wav" \
    --language_idx en \
    --speaker_wav "/kaggle/input/tech-data/wavs/audio_1.wav"


RuntimeError: module was compiled against NumPy C-API version 0x10 (NumPy 1.23) but the running NumPy has C-API version 0xf. Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem.
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf . Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem .
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf . Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem .
RuntimeError: module compiled against A