# **Speech Synthesis for Response**
This notebook demonstrates the implementation of a Text-to-Speech (TTS) system using Mozilla's TTS library. We will convert text into speech which can be played back as audio. This is particularly useful in accessibility applications, such as helping low-literacy users interact with technology using voice.

Tacotron 2 architecture, commonly implemented in Mozilla's TTS library, is a state-of-the-art text-to-speech synthesis system developed by Google. It is designed to convert a sequence of text into a corresponding sequence of audio waveforms. Tacotron 2 stands out because it produces very natural-sounding speech, often indistinguishable from human speech. The system combines a few deep learning technologies that enhance its effectiveness:

**Sequence-to-Sequence Model** :

At its core, Tacotron 2 uses a sequence-to-sequence model with attention. This model maps a sequence of text characters directly to a sequence of spectrograms, which are visual representations of the spectrum of frequencies of sound as they vary with time. This mapping allows for varying lengths of inputs and outputs, which is ideal for speech synthesis.

**Attention Mechanism** :

The attention mechanism in Tacotron 2 helps the model focus on different parts of the text input at different points in the synthesis process. This is crucial for maintaining natural prosody and intonation in speech, which are dynamic across different parts of a sentence.

**WaveNet Vocoder** :

Once Tacotron 2 generates a spectrogram, the WaveNet vocoder converts these spectrograms into raw audio waveforms. WaveNet, another groundbreaking neural network model for generating raw audio, uses a specialized convolutional neural network architecture that's highly effective at capturing the nuances of human speech.

First, we need to install the necessary library to run the TTS model.

In [1]:
!pip install TTS




# **Model Selection and Setup**
We use a pre-trained model from Mozilla's TTS library. This model is based on the Tacotron 2 architecture, which is known for generating high-quality speech from text. Tacotron 2 models are a popular choice for speech synthesis because they balance quality and computational efficiency well.

In [2]:
from TTS.utils.manage import ModelManager

# Initialize the model manager
model_manager = ModelManager()

# Download and load the pre-trained English TTS model
model_path, config_path, model_item = model_manager.download_model("tts_models/en/ljspeech/tacotron2-DDC")


 > Downloading model to /root/.local/share/tts/tts_models--en--ljspeech--tacotron2-DDC
 > Model's license - apache 2.0
 > Check https://choosealicense.com/licenses/apache-2.0/ for more info.


# **Text Preprocessing**
Text preprocessing is a crucial step to ensure the quality of the speech synthesis. It typically involves normalizing punctuation and may include converting numerals to words, correcting common spelling errors, and other language-specific adjustments.

In [3]:
# Can vary depending on what kind of text is
import re
import inflect

def preprocess_text(text):
    p = inflect.engine()

    # Convert numbers to words
    text = re.sub(r'\b\d+\b', lambda x: p.number_to_words(x.group()), text)

    # Replace newlines and multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text)

    # Handle common abbreviations
    text = re.sub(r'\bDr\.\b', 'Doctor ', text)
    text = re.sub(r'\bMr\.\b', 'Mister ', text)
    text = re.sub(r'\bMrs\.\b', 'Misses ', text)
    text = re.sub(r'\bMs\.\b', 'Miss ', text)
    text = re.sub(r'\bCo\.\b', 'Company ', text)
    text = re.sub(r'\bInc\.\b', 'Incorporated ', text)
    text = re.sub(r'\bLtd\.\b', 'Limited ', text)

    # Normalize punctuation
    text = re.sub(r'[,;@#?!&$]+\ *', ' ', text)

    # Strip leading/trailing whitespace
    text = text.strip()

    return text

# Example usage
sample_text = "Dr. Smith owes $50,000. He works at XYZ Co., Inc."
preprocessed_text = preprocess_text(sample_text)
print(preprocessed_text)

Dr. Smith owes  fifty zero. He works at XYZ Co. Inc.


In [5]:
from TTS.utils.manage import ModelManager
from TTS.utils.synthesizer import Synthesizer

# Initialize the model manager
model_manager = ModelManager()

# Download and load the pre-trained English TTS model
model_path, config_path, model_item = model_manager.download_model("tts_models/en/ljspeech/tacotron2-DDC")

# Load the synthesizer using the downloaded model
synthesizer = Synthesizer(model_path, config_path)


 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > Using model: Tacotron2
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Model's reduction rate `r` is set to: 1


In [8]:
# Example text for conversion
text = "Here is a simple sentence to convert to speech."
preprocessed_text = preprocess_text(text)  # Assuming preprocess_text is defined and imported

# Convert preprocessed text to speech
wav = synthesizer.tts(preprocessed_text)

 > Text splitted to sentences.
['Here is a simple sentence to convert to speech.']
 > Processing time: 4.988208055496216
 > Real-time factor: 1.4414707960748003


# **Text-to-Speech Conversion**
Convert the preprocessed text into speech using the selected TTS model. The output is typically a list of audio samples that need to be handled correctly for playback or storage.

In [9]:
!pip install pydub



Due to the output being a list of audio samples, we need to convert these to a proper audio format that can be easily played back or saved. This is handled by converting the sample list to bytes and saving it as a WAV file.

In [10]:
from pydub import AudioSegment
import numpy as np

# Assuming the wav data is in the form of a numpy array of floats
wav_array = np.array(wav, dtype=np.float32)

# Convert float array to int16 (standard for PCM audio)
wav_array_int16 = np.int16(wav_array / np.max(np.abs(wav_array)) * 32767)

# Create an audio segment
audio_segment = AudioSegment(
    data=wav_array_int16.tobytes(),
    sample_width=2,  # 2 bytes (16 bits)
    frame_rate=22050,  # typical speech model sample rate
    channels=1
)

# Export to a file
output_path = "/content/drive/MyDrive/Colab Notebooks/generated_speech.wav"
audio_segment.export(output_path, format="wav")

<_io.BufferedRandom name='/content/drive/MyDrive/Colab Notebooks/generated_speech.wav'>

# **Playing Audio**
Finally, play the generated audio to verify the output. This step is essential for testing and quality assurance.

In [11]:
import IPython.display as ipd

# Play the generated audio
ipd.Audio(output_path)