# Hearing / Voice Models Rendering

Source 
- [Intro to text-to-speech models](https://huggingface.co/learn/audio-course/en/chapter6/pre-trained_models)

Dependencies 

```bash
    librosa 
    soundfile 
    speechbrain
    torchaudio
```


## Quick Demo: How to install audio data from HF and play in notebook 

Basics in handling audio sounds in jupyter notebook 

In [None]:
# curl -L -o "01-00.04.75_00.07.46.wav" "https://huggingface.co/datasets/mio/sukasuka-anime-vocal-dataset/resolve/main/Chtholly/01-00.04.75_00.07.46.wav"

from IPython.display import Audio

Audio("01-00.04.75_00.07.46.wav")

## How to run text-2-speech model

### Setting up

In [None]:
from speechbrain.pretrained import EncoderClassifier
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan


#### **1. Load the Processor and Feature Extraction model**

In [None]:
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts") # used like a tokenizer
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts") # used for speech feature extraction

#### **2. Load the Speech Embedding model (Optional)**

This model encodes the sound wav files to xvectors which is a popular feature vector used for sound models.
This step is optional and is only loaded if the dataset is not in xvector form and you need to convert the file to xvector form. 

In [None]:
classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-xvect-voxceleb", savedir="pretrained_models/spkrec-xvect-voxceleb")

  from speechbrain.pretrained import EncoderClassifier
  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)


#### **3. Load a spectogram encoder**

This model is used to convert spectogram to waveform. The vocoder works on 80-bin mel spectograms

In [None]:
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

### Example

In [None]:
# Load the pkgs

import torchaudio
import torchaudio.transforms as T
from IPython.display import Audio

In [135]:
# In this example, a wave file is loaded and this will be used as the referencing or conditioning feature vector for the sound model

sound_path = "01-00.04.75_00.07.46.wav"

# Load your .wav file
signal, fs = torchaudio.load(sound_path)

print(f"Shape of signal: {signal.shape} with resample {fs}Hz. ")

if signal.size(0) == 2:
    print("Converting the signal to process as mono channel waveform")
    signal = signal.mean(dim=0, keepdim=True)

if fs != 16000:
    print(f"Resampling from {fs}Hz to 16000Hz")
    resampler = T.Resample(orig_freq=fs, new_freq=16000)
    signal = resampler(signal)
    fs = 16000  # Update fs to the new sample rate

if signal.size(0) == 2:
    signal = signal.mean(dim=0, keepdim=True)

# Extract x-vector using the classifier
embedding = classifier.encode_batch(signal) # NOTE: this audio file is stero so it comes with 2 channels, slicing the first will give you the monowave

# To get numpy vector
# xvector = embedding.squeeze().detach().cpu().numpy()
print(f"Embedding shape: {embedding.shape}")

Shape of signal: torch.Size([2, 119511]) with resample 44100Hz. 
Converting the signal to process as mono channel waveform
Resampling from 44100Hz to 16000Hz
Embedding shape: torch.Size([1, 1, 512])


In [133]:
signal

tensor([[-0.0032, -0.0051, -0.0047,  ...,  0.0053,  0.0047,  0.0045]])

In [140]:
# Insert text message to sound out
inputs = processor(text="The aroma of fresh coffee filled the room, making it the perfect start to the day. He paused at the edge of the lake, staring at the still water, reflecting the clear blue sky above.", return_tensors="pt")
# Run the model
speech = model.generate_speech(inputs["input_ids"], embedding.squeeze(0), vocoder=vocoder)

In [142]:
Audio(speech, rate=fs) # play example

In [143]:
torchaudio.save("output.wav", speech.unsqueeze(0), fs)