## Text-to-Speech Synthesis with HuggingFace Transformers
#### In this section, we delve into the world of speech synthesis utilizing the HuggingFace Transformers library. Specifically, we harness the power of the SpeechT5 model, fine-tuned for speech synthesis on the LibriTTS dataset. 

### Getting Started
To kick things off, make sure you have the necessary libraries installed. 

In [15]:
# Install required libraries
!pip install transformers[torch] datasets soundfile

Import the essential components

In [16]:
# Import necessary packages
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset
import torch
import random
import string
import soundfile as sf

In [17]:
# Check for GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"

Loading Models and Data

The `processor` serves as the text tokenizer.
`tts_model` is the core model responsible for converting text into speech.
The `vocoder` (voice encoder) transforms acoustic features into audible speech.
We load a dataset to access speaker voice embeddings, allowing us to synthesize speech with various speakers.

In [18]:
# Load the text processor
text_processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")

In [19]:
# Load the text-to-speech model
tts_model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts").to(device)

In [20]:
# Load the vocoder model for voice encoding
vocoder_model = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan").to(device)

In [21]:
# Load the dataset to obtain speaker embeddings
speaker_embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")

Available speakers and their corresponding IDs

In [22]:
# Speaker IDs from the embeddings dataset
speakers = {
    'slt': 0,  # US female 
    'rms': 1,  # US male 
    'awb': 2,  # Scottish male
    'jmk': 3,  # Canadian male
    'ksp': 4  # Indian male
}

Speech Synthesis Function

In [23]:
def save_text_to_speech(text, speaker=None):
    # Preprocess the input text
    input_text = text_processor(text=text, return_tensors="pt").to(device)
    
    if speaker is not None:
        # Load speaker embeddings containing voice characteristics from the dataset
        speaker_embeddings = torch.tensor(speaker_embeddings_dataset[speaker]["xvector"]).unsqueeze(0).to(device)
    else:
        # Generate random speaker embeddings for a random voice
        speaker_embeddings = torch.randn((1, 512)).to(device)
    
    # Generate speech using the models
    generated_speech = tts_model.generate_speech(input_text["input_ids"], speaker_embeddings, vocoder=vocoder_model)
    
    if speaker is not None:
        # If a speaker is specified, use their ID in the filename
        output_filename = f"{speaker}-{'-'.join(text.split()[:3])}.mp3"
    else:
        # If no speaker is specified, use a random string in the filename
        random_str = ''.join(random.sample(string.ascii_letters+string.digits, k=5))
        output_filename = f"{random_str}-{'-'.join(text.split()[:3])}.mp3"
    
    # Save the generated speech to a file with a 16KHz sampling rate
    sf.write(output_filename, generated_speech.cpu().numpy(), samplerate=16000)
    
    # Return the filename for reference
    return output_filename

Using the Speech Synthesis Function to generate speech with different voices

In [24]:
# Generate speech with an Indian male voice
save_text_to_speech("Text-to-Speech is a technology that converts digital text into audible speech.", speaker=speakers["ksp"])

'4-Text-to-Speech-is-a.mp3'

In [25]:
# Generate speech with a random voice
save_text_to_speech("Text-to-Speech is a technology that converts digital text into audible speech.")

'gC3Fc-Text-to-Speech-is-a.mp3'

In [26]:
# Text to convert using all available speaker voices
text_to_convert = """Text-to-Speech offers a range of applications, from accessibility tools for visually 
impaired individuals to voice assistants, e-learning platforms, audiobooks, and more. 
It can also provide a multisensory reading experience for specially abled people that 
combines seeing with hearing. With the latest techniques, one can generate a synthetic 
voice from only a few minutes of audio data, this is ideal for those who have lost their 
voice and only have limited recordings."""

for speaker_name, speaker_id in speakers.items():
    output_filename = save_text_to_speech(text_to_convert, speaker_id)
    print(f"Saved {output_filename}")

# Generate speech with a random speaker
output_filename = save_text_to_speech(text_to_convert)
print(f"Saved {output_filename}")

Saved 0-Text-to-Speech-offers-a.mp3
Saved 1-Text-to-Speech-offers-a.mp3
Saved 2-Text-to-Speech-offers-a.mp3
Saved 3-Text-to-Speech-offers-a.mp3
Saved 4-Text-to-Speech-offers-a.mp3
Saved gYUDI-Text-to-Speech-offers-a.mp3
