<a href="https://colab.research.google.com/github/zachary-shah/Real-Time-Voice-Cloning/blob/ee269-voice-cloning/EE269_HW2_Voice_Cloning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EE269 HW2: Voice Cloning with neural models

In this homework question, we will explore Voice Cloning using state-of-the-art neural modeling approaches. This assignment is based on a voice cloning framework developed by Corentin Jemine described [here](https://github.com/CorentinJ/Real-Time-Voice-Cloning). Roughly, there is a neural network which generates a speaker embedding from audio, and a neural Text-to-Speech (TTS) system which conditions on this speaker embedding to create an adjusted audio sample for a given text input.

The goals for this assignment are:
* Introduce the Voice Cloning system and interface with the pre-trained models
* Visualize mel spectograms and MFCCs from real vs synthesized audio examples to compare them
* Explore how noise filtering affects the quality of model outputs
* Clone your own voice :)

**Note:** You will need to make a copy of this Colab notebook in your Google Drive before you can edit it.

# Setup

First, make sure you have a GPU enabled in Colab by selecting Runtime -> Change Runtime Type -> T4 GPU.

We made a branch of the original framework for this homework assignment. The two blocks below set this up and install all required packages. All necessary files for this assignment should be contained within the `Real-Time-Voice-Cloning` folder once the github branch has been cloned.

*Note*: When installing the required packages, may get an error that ```pip's dependency resolver does not currently take into account all the packages that are installed.``` This is most likely fine, as long as the code cells in subsequent sections are able to run. Otherwise, try to install the packages in ```requirements.txt``` independently.

In [None]:
!git clone --branch ee269-voice-cloning https://github.com/zachary-shah/Real-Time-Voice-Cloning.git
%cd Real-Time-Voice-Cloning
%ls

In [None]:
!apt install -q libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg
!pip install -q -r requirements.txt

**For convenience, here is a helper function for plotting mel spectrograms:**

In [None]:
# packages for problem
import IPython.display as ipd
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import soundfile as sf
import os

# helper function to plot a mel spectrogram
# arguments: (wave array, sampling rate, number of mel bins, max frequency of mel scale)
def plot_melspectrogram(wav, sr, n_mels=256, fmax=4096, title=None,
                        fig=None, ax=None, show_legend=True):

    if ax == None:
        fig, ax = plt.subplots(1,1,figsize=(20,5))
    M = librosa.feature.melspectrogram(y=wav, sr=sr, n_mels=n_mels, fmax=fmax, n_fft=2048)
    M_db = librosa.power_to_db(M, ref=np.max)
    img = librosa.display.specshow(M_db, y_axis='mel', x_axis='time', ax=ax, fmax=fmax)
    if show_legend:
        ax.set(title='Mel spectrogram display')
        fig.colorbar(img, ax=ax, format="%+2.f dB")
    if title is not None:
        ax.set(title=title)

# seed for repeatability
seed = 269269269269

# Part 1: Learn how to voice clone. (5 pts)

We will first experiment with an example voice from the dataset used in the development of the pretrained vocoder and synthesizer models.


### Loading Data (1 point)

First, load in the audio file at ```audio_path``` below and listen to the audio file (you may find ```Ipython.display.Audio``` useful for this). Use the helper function above to plot the mel spectrogram.


In [None]:
## TASK 1: Load the audio file and plot it's mel spectrogram using the helper function provided

# example audio file
audio_path = "data/dataset_example.wav"

#########################
###  YOUR CODE HERE   ###
#########################


### Cloning a Voice (4 points)

Now, we will use the voice cloning models to transfer the style of the above voice into new text (also known as an utterance).

For simplicity, we have abstracted inference of the voice cloning models into a ```VoiceCloner()``` object, which will load the pre-trained encoder, synthesizer, and vocoder models upon initialization. It has two functions useful for generating cloned audio:

```VoiceCloner.embed_style()```: Generates the voice style embeddings
> Parameters:
>> ```in_fpath```: Path to audio file containing an utterance
>
>  Returns:
>> ```emb```: embedding vector for utterance (numpy.ndarray)

```VoiceCloner.voice_clone()```: Generates an utterance cloned in a style described by embeddings
> Parameters:
>> ```text```: Text for utterance to transfer style onto (str)
>
>> ```emb```: Voice embedding vector (numpy.ndarray)
>
>> ```out_fpath```: filepath to write cloned utterance to (str)
>
> Returns:
>> ```generated_wav```: array representation of cloned utterance (numpy.ndarray)


With this information, clone the voice in the file ```data/dataset_example.wav``` which was plotted above.

After cloning, listen to the cloned utterance and plot it's mel spectrogram. Verbally compare the quality of the clone to the signals for the true voice.

Feel free to try different prompts, but for the final submission, please use the default utterance for Parts 1 and 2: "Hello. I am a voice clone. Nice to meet you!"

In [None]:
# Set up the voice cloner
from voicecloner.sampler import VoiceCloner
cloner = VoiceCloner(seed=seed)

In [None]:
# TASK: Clone the dataset example into the below utterance using the cloner object.

text = "Hello. I am a voice clone. Nice to meet you!"

#########################
###  YOUR CODE HERE   ###
#########################



In [None]:
# TASK: plot cloned spectrogram and listen to audio

#########################
###  YOUR CODE HERE   ###
#########################



# Part 2: Cloning with Noisy Signals (10 pts)

In the section before, the audio file we generated the style embeddings from was considerably clean.

In this section, we will examine what happens when we try to clone with a noisy recording. During training, the model will see mostly clean audio as it learns the style embeddings, so it is interesting to examine what the model does during inference on a noisy signal.

We will test this out with a recording of Dr. Pilanci from a previous lecture. Unfortunately, this recording was corrupted by Additive White Gaussian Noise. Can we still clone his voice?

TASK: Following the procedure in Part 1, read in the audio file ```data/dr_pilanci_lecture_clip.wav```, listen to the audio and plot the mel spectrogram.

Write out any visual differences you see between the spectrogram for a noisy audio clip in comparison to the mel spectrograms in the training distribution.

In [None]:
# TASK: Listen to and plot the mel spectrogram of Dr. Pilanci's noisy lecture clip.

noisy_lecture_path = "data/dr_pilanci_lecture_clip.wav"

#################
### CODE HERE ###
#################



> TASK: Write Observations here

## Voice Cloning on raw noisy signal (2 points)

Following the process of part 1, clone Dr. Pilanci's voice saying the text "Hello. I am a voice clone. Nice to meet you!"

Listen to the cloned audio and plot it's spectrogram. How does the cloning quality compare to Part 1?

*Note: You should not need to create a new VoiceCloner() object.*

In [None]:
# TASK: Clone utterance with noisy lecture audio, and plot mel spectrogram.

#################
### CODE HERE ###
#################



> TASK: Describe quality here.

## Denoising (5 points)

Now, we would like to see if we can improve the model output quality by denoising the input signal.

To eliminate signal noise, we will employ a method known as "spectral gating". This algorithm functions by generating a spectrogram of a signal and determining a noise threshold (or gate) for each frequency band within that signal/noise. This threshold then aids in the creation of a mask that filters out noise beneath the varying frequency threshold. This has been implemented in the Python package [noisereduce](https://pypi.org/project/noisereduce/), which we have already installed in the setup.

In the code below, denoise the clip of Dr. Pilanci's lecture using spectral gating. This can be done with ```noisereduce.reduce_noise()```. You may need to play with some of the noise thresholding parameters (such as prop_decrease, time_constant_s, n_std_thresh_stationary, sigmoid_slope_nonstationary) to minimize signal loss and maintain a well-balanced signal-to-noise ratio in the denoised audio.

Plot the spectrogram of the original vs. denoised audio. The goal is to get the denoised spectrogram to look like a sample from the same distribution of audio that the model was trained on (we saw an example of this in Part 1).

Make sure to listen to the denoised audio as well.

In [None]:
# TASK: Filter noisy lecture clip look more similar to the example in the training dataset.
# Plot the mel spectrogram for the noisy and denoised clips side by side.

import noisereduce

#################
### CODE HERE ###
#################


## Voice Cloning on cleaned signal (3 points)

With the signal you just cleaned, generate a voice clone of the same text: "Hello. I am a voice clone. Nice to meet you!"

Listen to the cloned audio and plot it's spectrogram. How does the cloning quality compare to that of the clone from the noisy audio?

*Note: If the clone sounds worse than before, you may need to adjust the denoising parameters.*

In [None]:
# TASK: Clone utterance with the cleaned audio clip.
# Plot the mel spectrogram of the clone.

#################
### CODE HERE ###
#################


> TASK: Describe quality of clone

# Part 3: Clone **your** voice (5 points)

Now that you know how to voice clone, why not try it on yourself?

### Record your voice.

Below we have provided a function that will record audio from your computer microphone directly through colab. Make sure to enable microphone access to your browser before running.

Record a 5-second clip speaking any text you would like.

In [None]:
from voicecloner.interface import record

audio_path = record(sec=5, out_fpath="data/my_voice.wav")
ipd.Audio(audio_path)

## Clone your voice (2 points)

As we have done before, clone your own voice. This time, use the voice cloner to generate the exact same text you recorded.

If the recording quality is low, try enhancing the audio SNR using ```noisereduce``` or any other method you think might help.

In [None]:
# TASK: Clone your voice saying the text your recorded in the block above.

###################
#### CODE HERE ####
###################



### Plot Spectrograms (1 point)

Plot spectrograms for your original speech and synthesized speech next to one another. Describe the differences you notice when listening in the audio, and how you think such differences register in the spectrogram plot.

In [None]:
###################
#### CODE HERE ####
###################


> TASK: Write observations here

## Plot MFCCs (2 points)

Thus far, we have been representing audio in its mel spectrogram. We can also represent an audio signal through its MFCCs (Mel-Frequency Cepstral Coefficients), which capture the short-term power spectrum of sound and are especially sensitive to the human ear's perception of speech. By modeling the unique characteristics of an individual's voice, MFCCs can be used to effectively differentiate and classify different speakers.

Plot the first 20 MFCC coefficients for both your real and cloned utterances side by side. You can use [librosa.feature.mfcc](https://librosa.org/doc/main/generated/librosa.feature.mfcc.html) for this, which will compute MFCC coefficients across different frames.

For each instance, plot both the average MFCC coefficients (compute this average over all frames), as well as the log-normalized MFCC coefficients plotted for all frames as an image (an example is shown [here](https://haythamfayek.com/assets/posts/post1/mfcc_raw.jpg)).

In [None]:
# TASK: Compute MFCCs of real and cloned voice and plot MFCC matrix as an image

###################
#### CODE HERE ####
###################


In [None]:
# TASK: Plot the average MFCCs across the frame dimension for real and cloned voices

###################
#### CODE HERE ####
###################
