# **Parler TTS : TAMIL**

## 1. Installation

Cloning and Installing both Dataspeech and Parler-TTS repository from HuggingFace Hub

In [None]:
!git clone https://github.com/huggingface/dataspeech.git
!cd dataspeech
! pip install --quiet -r ./dataspeech/requirements.txt

In [None]:
!git clone https://github.com/huggingface/parler-tts.git
%cd parler-tts
!pip install --quiet -e .[train]

In [None]:
!pip install --upgrade protobuf wandb==0.16.6  # Updating the Protobuf library and Installing wandb (Weight & Biases)

In [None]:
!pip install huggingface_hub  # Installing huggingface_hub library

Logging in to HuggingFace through the Authentication Token to Authenticate Operations like Pushing the Models to the hub or Loading the Datasets

In [None]:
!git config --global credential.helper store

In [None]:
from huggingface_hub import login
login()

In [None]:
%cd ../dataspeech  # Setting the Directory to Dataspeech Folder

## 2. Loading Dataset

In [None]:
from datasets import load_dataset
dataset = load_dataset("SPRINGLab/IndicTTS_Tamil")

In [None]:
from IPython.display import Audio
print(dataset["train"][0]["transcription"])
Audio(dataset["train"][0]["audio"]["array"], rate=dataset["train"][0]["audio"]["sampling_rate"])

In [None]:
from IPython.display import Audio
print(dataset["train"][1]["transcription"])
Audio(dataset["train"][1]["audio"]["array"], rate=dataset["train"][1]["audio"]["sampling_rate"])

In [None]:
del dataset

## 3. Annotating the Dataset

Through the "main.py" script, The Dataset receives metadata annotations which include pitch, SNR, reverberation and speech rate parameters. The dataset will gain important acoustic attributes through these parameters.

In [None]:
!python main.py "SPRINGLab/IndicTTS_Tamil" \
  --configuration "default" \
  --text_column_name "text" \
  --audio_column_name "audio" \
  --cpu_num_workers 1 \
  --num_workers_per_gpu_for_pitch 1 \
  --rename_column \
  --repo_id "IndicTTS-Tamil-tags"

In [None]:
# Checking out the Annotated Dataset

from datasets import load_dataset
dataset = load_dataset("SrihariGKS/IndicTTS-Tamil-tags")
print("SNR 1st sample", dataset["train"][0]["snr"])
print("C50 2nd sample", dataset["train"][0]["c50"])
del dataset

## 4. Metadata to text bins

 The metadata-enriched dataset gets converted into standardized text bins which label metadata attributes using textual values (bins). This is done to simplify or standardize features by transforming them into text-based divisions for easier categorization.

In [None]:
!python ./scripts/metadata_to_text.py \
    "SrihariGKS/IndicTTS-Tamil-tags" \
    --repo_id "IndicTTS-Tamil-tags" \
    --configuration "default" \
    --cpu_num_workers 2 \
    --path_to_bin_edges "./examples/tags_to_annotations/v01_bin_edges.json" \
    --avoid_pitch_computation

In [None]:
# Checking out whether the metadata has been succesfully categorized

from datasets import load_dataset
dataset = load_dataset("SrihariGKS/IndicTTS-Tamil-tags")
print("Noise 1st sample:", dataset["train"][0]["noise"])
print("Speaking rate 2nd sample:", dataset["train"][0]["speaking_rate"])
del dataset

## 5. Creating a Dataset for Speaker Descriptions

The "run_prompt_creation" script produces textual description prompts from dataset metadata for the Speaker. The descriptions contain natural language contexts including speaking style and quality from both model training and inference sessions.

In [None]:
!python ./scripts/run_prompt_creation.py \
  --speaker_name "Ananya" \
  --is_single_speaker \
  --dataset_name "SrihariGKS/IndicTTS-Tamil-tags" \
  --output_dir "./tmp_Ananya" \
  --dataset_config_name "default" \
  --model_name_or_path "google/gemma-2-2b-it" \
  --per_device_eval_batch_size 12 \
  --attn_implementation "sdpa" \
  --dataloader_num_workers 2 \
  --push_to_hub \
  --hub_dataset_id "IndicTTS-Tamil-tagged" \
  --preprocessing_num_workers 2

In [None]:
# Checking out the Prompt Dataset

from datasets import load_dataset
dataset = load_dataset("SrihariGKS/IndicTTS-Tamil-tagged")
print("1st sample:", dataset["train"][0]["text_description"])
print("2nd sample:", dataset["train"][1]["text_description"])
del dataset

## 6. Fine-Tuning the Model

In [None]:
%cd ../parler-tts # Setting the Directory to Dataspeech Folder

The "run_parler_tts_training" script uses specified training and evaluation datasets for fine-tuning Parler TTS through executed commands. Usage of GPU acceleration through "accelerate" allows efficient training while letting the dataset define it's names and evaluation metrics alongside batch sizes.

In [None]:
!accelerate launch ./training/run_parler_tts_training.py \
    --model_name_or_path "ai4bharat/indic-parler-tts" \                           # Pre-trained model to fine-tune
    --feature_extractor_name "ylacombe/dac_44khz" \                               # Feature extractor for audio processing
    --description_tokenizer_name "ai4bharat/indic-parler-tts" \                   # Tokenizer for descriptive text metadata
    --prompt_tokenizer_name "ai4bharat/indic-parler-tts" \                        # Tokenizer for text prompts
    --report_to "wandb" \                                                         # Logs training metrics and progress to Weights & Biases (wandb)
    --overwrite_output_dir true \                                                 # Overwrites the output directory if it exists
    --train_dataset_name "SPRINGLab/IndicTTS_Tamil" \                             # Training dataset containing audio and text
    --train_metadata_dataset_name "SrihariGKS/IndicTTS-Tamil-tagged" \            # Additional training metadata (descriptions, tags).
    --train_dataset_config_name "default" \                                       # Configuration name for the training dataset
    --train_split_name "train" \                                                  # Split of the training dataset to use
    --eval_dataset_name "SPRINGLab/IndicTTS_Tamil" \                              # Evaluation dataset for the Model Performance
    --eval_metadata_dataset_name "SrihariGKS/IndicTTS-Tamil-tagged" \             # Evaluation metadata dataset
    --eval_dataset_config_name "default" \                                        # Dataset configuration for evaluation
    --eval_split_name "train" \                                                   # Dataset split used for evaluation
    --max_eval_samples 2 \                                                        # Limits evaluation to first 8 samples for faster periodic evaluation
    --per_device_eval_batch_size 2 \                                              # Number of samples per evaluation batch on each GPU/device
    --target_audio_column_name "audio" \                                          # Column name containing audio data
    --description_column_name "text_description" \                                # Column name containing textual descriptions
    --prompt_column_name "text" \                                                 # Column name containing text prompts
    --max_duration_in_seconds 20 \                                                # Maximum duration of audio samples in seconds
    --min_duration_in_seconds 2.0 \                                               # Minimum duration of audio samples in seconds
    --max_text_length 400 \                                                       # Maximum length of text prompts in tokens
    --preprocessing_num_workers 2 \                                               # Number of CPU workers for preprocessing
    --do_train true \                                                             # Enables training
    --num_train_epochs 10 \                                                       # Number of training epochs
    --gradient_accumulation_steps 18 \                                            # Accumulates gradients over multiple batches before updating model weights
    --gradient_checkpointing true \                                               # Reducing GPU memory usage by checkpointing gradients, enabling larger models or batches
    --per_device_train_batch_size 2 \                                             # Number of samples per batch per device for training
    --learning_rate 0.00003 \                                                     # Initial learning rate used by the Adam optimizer
    --adam_beta1 0.9 \                                                            # Beta1 hyperparameter for Adam optimizer
    --adam_beta2 0.99 \                                                           # Beta2 hyperparameter for Adam optimizer
    --weight_decay 0.01 \                                                         # Weight decay coefficient for regularization, preventing overfitting
    --max_grad_norm 0.8 \                                                         # Applies gradient clipping by limiting gradient norm to 0.8, preventing gradient explosion and enhancing training stability.
    --lr_scheduler_type "constant_with_warmup" \                                  # Learning rate schedule type: keeps learning rate constant after initial warm-up
    --warmup_steps 50 \                                                           # Number of warmup steps for the learning rate scheduler
    --logging_steps 500 \                                                         # Logs training metrics every 500 steps
    --save_strategy "steps" \                                                     # Model checkpoint saving strategy based on step intervals
    --save_steps 1000 \                                                           # Saves model checkpoints every 1000 steps
    --save_total_limit 1 \                                                        # Limits the total number of saved checkpoints to 1
    --freeze_text_encoder true \                                                  # Freezes weights of the text encoder, preventing updates during training to retain original pretrained representations
    --audio_encoder_per_device_batch_size 2 \                                     # Batch size for audio encoder processing on each device
    --dtype "float16" \                                                           # Data type precision used for computations (float16 for reduced memory usage and faster computation)
    --seed 456 \                                                                  # Random seed for reproducibility of training
    --output_dir "./output_dir_training/" \                                       # Directory to save trained model checkpoints and outputs
    --temporary_save_to_disk "./audio_code_tmp/" \                                # Temporary directory for intermediate audio codes or data storage
    #--resume_from_checkpoint "./output_dir_training/checkpoint-1500-epoch-1" \   # Resumes the Model Training from the saved Checkpoint in case if the model stops training abruptly
    --save_to_disk "./tmp_dataset_audio/" \                                       # Directory to save processed audio datasets to disk
    --dataloader_num_workers 2 \                                                  # Number of subprocesses to use for loading data
    --do_eval \                                                                   # Enables evaluation during training
    --predict_with_generate \                                                     # Uses generative approach during evaluation (model generates output to compute metrics)
    --include_inputs_for_metrics \                                                # Includes inputs when computing evaluation metrics, essential for detailed metric logging
    --group_by_length true                                                        # Groups audio samples by similar length to optimize batch processing efficiency.


## 7. Model Inference

Testing out the Model  by generating the speech samples through different prompts and Descriptions

In [None]:
# Loading the Model and the Tokenizer

from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

tam_model = ParlerTTSForConditionalGeneration.from_pretrained("./output_dir_training", torch_dtype=torch.float16).to(device)
tam_tokenizer = AutoTokenizer.from_pretrained("./output_dir_training")
tam_description_tokenizer = AutoTokenizer.from_pretrained(tam_model.config.text_encoder._name_or_path)

In [None]:
tam_prompt = "இந்த இடத்தை நான் எங்கே காணலாம்? ஒரு மணி நேரமாக தேடிக் கொண்டிருக்கிறேன்"
tam_description = "A Female Speaker's high-pitched, engaging voice is captured in a clear, close-sounding recording. Her slightly slower delivery conveys a positive tone."

tam_description_input_ids = tam_description_tokenizer(tam_description, return_tensors="pt").to(device)
tam_prompt_input_ids = tam_tokenizer(tam_prompt, return_tensors="pt").to(device)

tam_generation = tam_model.generate(input_ids=tam_description_input_ids.input_ids, attention_mask=tam_description_input_ids.attention_mask, prompt_input_ids=tam_prompt_input_ids.input_ids, prompt_attention_mask=tam_prompt_input_ids.attention_mask)
tam_audio_arr = tam_generation.cpu().numpy().squeeze()
tam_audio_arr = tam_audio_arr.astype("float32")

from IPython.display import Audio
Audio(tam_audio_arr, rate=tam_model.config.sampling_rate)

## 8. Pushing the Model to Huggingface Hub

In [None]:
# Pushes the fine-tuned model to the Hugging Face Hub

model.push_to_hub("parler-tts-fine-tuned-tamil")
tokenizer.push_to_hub("parler-tts-fine-tuned-tamil")