<a href="https://colab.research.google.com/github/rapha18th/AWS-sagemaker-Project1/blob/master/rairo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise on TTS

This notebook outlines a workflow to fine-tune a baseline TTS model for Nigerian English. We will:

- Set up the environment
- Preprocess the dataset
- Fine-tune a baseline TTS model (using a pre-trained model from HuggingFace, e.g., `facebook/mms-tts-en`)
- Generate WAV outputs for given text examples
- Discuss evaluation strategies

Deadline: Friday, Mar 21, 2025

Reference: Exercise on TTS.pdf](file-service://file-R7VWcf23C2EkjC6TSfP8mB)

In [None]:
# Environment Setup
!pip install transformers datasets torchaudio TTS

import os
import pandas as pd
import torchaudio

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset

print('Setup complete.')

## 1. Data Preprocessing

Assume you have a CSV file (`metadata.csv`) containing 500 samples of Nigerian English in the following format:

- `audio_filepath`: Path to the audio file (if provided)
- `text`: The transcription for the audio

We perform a basic text normalization here (for example, lowercasing). If audio preprocessing is needed, ensure all audio files are converted to a common format (e.g., WAV with a consistent sample rate).

In [None]:
import os
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Specify the path to the folder you want to read
folder_path = '/content/drive/My Drive/tts_data'  # Replace with your actual folder path

try:
    # List all files and directories in the folder
    contents = os.listdir(folder_path)

    print(f"Contents of folder '{folder_path}':")
    for item in contents:
        full_path = os.path.join(folder_path, item) # creates the full path to the file or directory.
        if os.path.isfile(full_path):
            print(f"  File: {item}")
        elif os.path.isdir(full_path):
            print(f"  Directory: {item}")
        else:
            print(f"  Other: {item}") #for things like symlinks.
except FileNotFoundError:
    print(f"Error: Folder '{folder_path}' not found.")
except NotADirectoryError:
    print(f"Error: '{folder_path}' is not a directory.")
except Exception as e:
    print(f"An error occurred: {e}")

Mounted at /content/drive
Contents of folder '/content/drive/My Drive/tts_data':
  File: .DS_Store
  Directory: pleshy_3
  Directory: pleshy_1


In [None]:
folders_ls = ["pleshy_1", "pleshy_3"]
for i in folders_ls:
  folder_path = f'/content/drive/My Drive/tts_data/{i}'  # Replace with your path (Colab) or "./your_local_folder" (local)
  item_count = len(os.listdir(folder_path))
  print(f"{i}: {item_count}")


pleshy_1: 251
pleshy_3: 251


In [None]:
import pandas as pd
file_path = "/content/drive/My Drive/tts_data/pleshy_1/recorder.tsv"
df_pleshy1 = pd.read_csv(file_path, delimiter='\t', header='infer')

In [None]:
df_pleshy1.head()

Unnamed: 0,/Users/aremu/Desktop/audio-data/recorder_2024-04-11_13-38-52_009308.wav,Precious_nPpAHJbgMdj6eKqy,f_0001-0250_civic,Unnamed: 3,"DUTIES, FUNCTIONS AND POWER OF THE COUNCIL OF MINISTERS"
0,/Users/aremu/Desktop/audio-data/recorder_2024-...,Precious_nPpAHJbgMdj6eKqy,f_0001-0250_civic,,It promote peace in the society It makes all h...
1,/Users/aremu/Desktop/audio-data/recorder_2024-...,Precious_nPpAHJbgMdj6eKqy,f_0001-0250_civic,,Promoting skill acquisition through its Nation...
2,/Users/aremu/Desktop/audio-data/recorder_2024-...,Precious_nPpAHJbgMdj6eKqy,f_0001-0250_civic,,Displaying patriotism: Patriotism means having...
3,/Users/aremu/Desktop/audio-data/recorder_2024-...,Precious_nPpAHJbgMdj6eKqy,f_0001-0250_civic,,It laid down procedures for the creation of re...
4,/Users/aremu/Desktop/audio-data/recorder_2024-...,Precious_nPpAHJbgMdj6eKqy,f_0001-0250_civic,,(three) write out the reasons why people get m...


In [None]:
# Data Preprocessing

# Load the CSV metadata file
df = pd.read_csv('metadata.csv')
print('First 5 entries of the dataset:')
print(df.head())

# Define a text normalization function
def normalize_text(text):
 # Example normalization: lowercasing; add additional rules as needed
 return text.lower()

# Apply normalization
df['text'] = df['text'].apply(normalize_text)

# Save the normalized metadata
df.to_csv('normalized_metadata.csv', index=False)
print('Data preprocessing complete. Normalized metadata saved as normalized_metadata.csv')

## 2. Fine-Tuning the Baseline TTS Model

In this section, we fine-tune a baseline TTS model using the preprocessed data. We use the hypothetical model `facebook/mms-tts-en` from HuggingFace as our starting point.

**Steps:**

1. Load the pre-trained model and tokenizer
2. Load the dataset (our normalized CSV file) and map it into the expected format
3. Set up training parameters
4. Run the fine-tuning process

Note: The exact data pipeline may vary depending on the chosen TTS model’s requirements. Adjust preprocessing and training steps accordingly.

In [None]:
# Load the pre-trained TTS model and tokenizer
model_name = 'facebook/mms-tts-en' # Replace with the actual model name if different
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
print('Loaded baseline TTS model and tokenizer.')

# Load the dataset using the HuggingFace Datasets library
dataset = load_dataset('csv', data_files={'train': 'normalized_metadata.csv'})
print('Dataset loaded:')
print(dataset['train'][0])

# Define a preprocessing function to tokenize the text
def preprocess_function(examples):
 # Tokenize the text input; adjust max_length and other parameters as required
 inputs = tokenizer(examples['text'], truncation=True, padding='max_length', max_length=128)
 # Additional steps might include mapping audio file paths to features if needed
 return inputs

# Apply the preprocessing function to the dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Set training parameters
training_args = TrainingArguments(
 output_dir='./tts_finetuned',
 num_train_epochs=10,
 per_device_train_batch_size=4,
 save_steps=500,
 save_total_limit=2,
 logging_steps=100
)

# Create a Trainer instance and fine-tune the model
trainer = Trainer(
 model=model,
 args=training_args,
 train_dataset=tokenized_dataset['train']
)

print('Starting fine-tuning...')
trainer.train()
print('Fine-tuning complete.')

## 3. Inference: Generating WAV Files

After fine-tuning, we use the model to generate audio for a set of predefined sentences. Below are the sentences provided in the exercise. The model’s inference API might differ, so replace the `synthesize` function with the appropriate call as per the chosen TTS framework.

In [None]:
# List of sentences to generate audio
sentences = [
 "Voluntary participation of citizens in social groups, networks and social transformation",
 "Later on black eye, Vikings, Mafia, Black Beret, daughters of jezebel",
 "Statutory instruments: These are known as ministerial orders or departmental orders",
 "It makes one to respect other people's views, culture and religion",
 "The Electorate can check the excesses of the government through elections"
]

def synthesize_text(text):
 # Hypothetical synthesis call; replace with actual inference method
 # For instance, some models might use model.generate() or a custom inference pipeline
 # Here, we assume the function returns a waveform (as a torch.Tensor) and sample rate

 # Example (this is a placeholder):
 inputs = tokenizer(text, return_tensors='pt')
 outputs = model.generate(**inputs)

 # In a real scenario, the outputs would be converted to a waveform; here we create a dummy waveform
 import torch
 sample_rate = 22050
 waveform = torch.randn(1, sample_rate) # 1 second of random noise as a placeholder
 return waveform, sample_rate

for idx, sentence in enumerate(sentences):
 print(f"Synthesizing audio for sentence {idx+1}...")
 waveform, sr = synthesize_text(sentence)
 output_filename = f"output_{idx+1}.wav"
 torchaudio.save(output_filename, waveform, sr)
 print(f"Saved {output_filename}")

## 4. Model Evaluation

### Objective Evaluation

- **Mel Cepstral Distortion (MCD):** Compare spectral properties of generated audio with reference samples.
- **Signal-to-Noise Ratio (SNR):** Assess the quality of the generated waveform.

### Subjective Evaluation

- Conduct listening tests with native speakers to rate naturalness, accent fidelity, and clarity.
- Use Mean Opinion Score (MOS) tests where evaluators score the audio on a scale (e.g., 1-5).

### Error Analysis

Analyze any phonetic or prosodic issues by comparing generated outputs with ground truth samples (if available) to pinpoint areas for improvement.

## 5. Report and Submission

Include a brief summary of the approach in your final report covering:

- Data preprocessing steps
- Model selection and fine-tuning details
- Evaluation strategies
- Challenges encountered and lessons learned

Submit the source code and the generated WAV files via the provided Google form.

Reference: Exercise on TTS.pdf](file-service://file-R7VWcf23C2EkjC6TSfP8mB)