<a href="https://colab.research.google.com/github/rapha18th/AWS-sagemaker-Project1/blob/master/rairo1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Complete TTS Fine-Tuning Notebook

This notebook performs the following steps:

1. **Data Loading and Preprocessing:** Reads two TSV metadata files (each with 250 samples) corresponding to the audio directories `pleshy_1` and `pleshy_3`, creates a combined dataset, normalizes the text, and constructs the full audio file paths.
2. **Fine-Tuning:** Loads a baseline TTS model (here, using a hypothetical model `facebook/mms-tts-en` from HuggingFace), tokenizes the text data, and fine-tunes the model using HuggingFace’s Trainer API.
3. **Inference:** Uses the fine-tuned model to generate audio (WAV files) for a set of example sentences.
4. **Evaluation:** Outlines an evaluation strategy (objective and subjective measures).

Reference: Exercise on TTS.pdf](file-service://file-R7VWcf23C2EkjC6TSfP8mB)

In [None]:
import os
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Specify the path to the folder you want to read
folder_path = '/content/drive/My Drive/tts_data'  # Replace with your actual folder path

try:
    # List all files and directories in the folder
    contents = os.listdir(folder_path)

    print(f"Contents of folder '{folder_path}':")
    for item in contents:
        full_path = os.path.join(folder_path, item) # creates the full path to the file or directory.
        if os.path.isfile(full_path):
            print(f"  File: {item}")
        elif os.path.isdir(full_path):
            print(f"  Directory: {item}")
        else:
            print(f"  Other: {item}") #for things like symlinks.
except FileNotFoundError:
    print(f"Error: Folder '{folder_path}' not found.")
except NotADirectoryError:
    print(f"Error: '{folder_path}' is not a directory.")
except Exception as e:
    print(f"An error occurred: {e}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Contents of folder '/content/drive/My Drive/tts_data':
  File: .DS_Store
  Directory: pleshy_3
  Directory: pleshy_1


In [None]:
!pip uninstall -y numpy pandas torch torchaudio transformers datasets
!pip install pandas torch torchaudio transformers datasets

Found existing installation: numpy 1.23.5
Uninstalling numpy-1.23.5:
  Successfully uninstalled numpy-1.23.5
[0mCollecting pandas
  Using cached pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Collecting torch
  Using cached torch-2.6.0-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting torchaudio
  Using cached torchaudio-2.6.0-cp311-cp311-manylinux1_x86_64.whl.metadata (6.6 kB)
Collecting transformers
  Using cached transformers-4.50.0-py3-none-any.whl.metadata (39 kB)
Collecting datasets
  Using cached datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting numpy>=1.23.2 (from pandas)
  Downloading numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014

In [None]:
# Environment Setup: Install required packages
#!pip uninstall -y numpy
#!pip install --no-cache-dir numpy pandas torch torchaudio transformers datasets
#!pip install --upgrade numpy
#!pip install --upgrade pandas # Upgrade pandas to potentially rebuild against the new numpy
import os
# Restart runtime here (if using a notebook environment like Google Colab)
# This is often achieved by going to 'Runtime' -> 'Restart runtime'

import pandas as pd  # Import pandas after restart/reinstall
import torchaudio
import torch

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset

print('Environment setup complete.')

RuntimeError: Failed to import transformers.models.auto.modeling_auto because of the following error (look up to see its traceback):
Failed to import transformers.generation.utils because of the following error (look up to see its traceback):
cannot import name '_center' from 'numpy._core.umath' (/usr/local/lib/python3.11/dist-packages/numpy/_core/umath.py)

In [None]:
# Load the TSV metadata files for each folder
# Assume each TSV has columns like 'filename' and 'text'
metadata1 = pd.read_csv('recorder1.tsv', sep='\t')
metadata3 = pd.read_csv('recorder3.tsv', sep='\t')

print('Metadata from recorder1.tsv:')
print(metadata1.head())

print('Metadata from recorder3.tsv:')
print(metadata3.head())

In [None]:
# Add the full audio file path for each sample
# For metadata from recorder1.tsv, files are in folder 'pleshy_1'; for recorder3.tsv, in folder 'pleshy_3'
metadata1['audio_filepath'] = metadata1['filename'].apply(lambda x: os.path.join('pleshy_1', x))
metadata3['audio_filepath'] = metadata3['filename'].apply(lambda x: os.path.join('pleshy_3', x))

# Combine the two metadata DataFrames
combined_metadata = pd.concat([metadata1, metadata3], ignore_index=True)

print('Combined metadata:')
print(combined_metadata.head())

In [None]:
# Text normalization function (example: lowercasing)
def normalize_text(text):
    return text.lower()

# Apply normalization
combined_metadata['text'] = combined_metadata['text'].apply(normalize_text)

# Reorder columns if needed (ensure 'audio_filepath' and 'text' exist)
combined_metadata = combined_metadata[['audio_filepath', 'text']]

# Save the combined metadata as a CSV file (using comma delimiter for HuggingFace dataset loading)
combined_metadata.to_csv('combined_metadata.csv', index=False)
print('Combined metadata saved as combined_metadata.csv')

In [None]:
# Load the combined metadata as a dataset using HuggingFace Datasets
dataset = load_dataset('csv', data_files={'train': 'combined_metadata.csv'})

print('Dataset loaded:')
print(dataset['train'][0])

In [None]:
## Fine-Tuning the Baseline TTS Model

# Load the pre-trained TTS model and tokenizer
model_name = 'facebook/mms-tts-en'  # Replace with the actual TTS model if different
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

print('Loaded TTS model and tokenizer.')

# Define a preprocessing function to tokenize the text
def preprocess_function(examples):
    # Tokenize the text; adjust parameters as needed for your TTS model
    inputs = tokenizer(examples['text'], truncation=True, padding='max_length', max_length=128)
    # You can add additional processing steps (e.g., audio feature extraction) if required by your model
    return inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Set up training arguments
training_args = TrainingArguments(
    output_dir='./tts_finetuned',
    num_train_epochs=10,
    per_device_train_batch_size=4,
    save_steps=500,
    save_total_limit=2,
    logging_steps=100
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train']
)

print('Starting fine-tuning...')
trainer.train()
print('Fine-tuning complete.')

In [None]:
## Inference: Generating WAV Files

# List of example sentences to synthesize
sentences = [
    "Voluntary participation of citizens in social groups, networks and social transformation",
    "Later on black eye, Vikings, Mafia, Black Beret, daughters of jezebel",
    "Statutory instruments: These are known as ministerial orders or departmental orders",
    "It makes one to respect other people's views, culture and religion",
    "The Electorate can check the excesses of the government through elections"
]

def synthesize_text(text):
    # This is a placeholder for the actual inference call.
    # Many TTS models require a dedicated inference pipeline to convert model outputs to a waveform.

    # Here we simply tokenize and call model.generate(), then create a dummy waveform for illustration.
    inputs = tokenizer(text, return_tensors='pt')
    outputs = model.generate(**inputs)

    # In practice, replace the next lines with a function that converts model outputs into a mel-spectrogram and then to audio
    sample_rate = 22050
    waveform = torch.randn(1, sample_rate)  # Dummy waveform: 1 second of noise
    return waveform, sample_rate

for idx, sentence in enumerate(sentences):
    print(f"Synthesizing audio for sentence {idx+1}...")
    waveform, sr = synthesize_text(sentence)
    output_filename = f"output_{idx+1}.wav"
    torchaudio.save(output_filename, waveform, sr)
    print(f"Saved {output_filename}")

## Evaluation

### Objective Evaluation

- **Mel Cepstral Distortion (MCD):** Compare the spectral properties of the synthesized audio with reference samples.
- **Signal-to-Noise Ratio (SNR):** Assess the quality of the generated waveform.

### Subjective Evaluation

- Conduct listening tests with native speakers to rate the naturalness, accent fidelity, and clarity.
- Use Mean Opinion Score (MOS) tests to quantify the perceptual quality of the speech.

### Error Analysis

Compare generated outputs with any available ground-truth samples to identify pronunciation or prosody issues.