<a href="https://colab.research.google.com/github/tomer9080/DL-Speech-exercises/blob/main/ex4_part3_046747.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 3 - Using x-vectors as embeddings for classification tasks

In this section, you will learn how to use speech embedding models to perform speech classification tasks.

You will be working in a Google Colab environment.

A few important notes:

* Make sure to copy this notebook to your own drive so your progress is saved.
* Change the runtime type to GPU to achieve faster inference with our embedding model. (To do this, click on 'Runtime' → 'Change runtime type' in the toolbar.)
* You can add any necessary imports (e.g., when selecting classifiers) across different code cells.

## Imports

In [None]:
!pip install speechbrain

In [None]:
import os
import torch
import random
import librosa
import torchaudio
import numpy as np
import matplotlib.pyplot as plt

from sklearn.manifold import TSNE
from sklearn.model_selection import KFold
from speechbrain.inference.classifiers import EncoderClassifier

device = "cuda" if torch.cuda.is_available() else "cpu"

## 3.1 - SPEECHCOMMANDS Dataset

The SPEECHCOMMANDS dataset consists of approximately 2,000 speakers and around 100,000 utterances.

The code below downloads both the training and validation splits.

In [None]:
# DS loading - takes aprroximately 2-3 mins, be patient.
dataset_train = torchaudio.datasets.SPEECHCOMMANDS(
    root="./",
    download=True,
    subset="training"
)

The code below loads the embedding model to the device, and sets model mode to evaluation.

In [None]:
# Given code
xvector_model = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", savedir="./")
xvector_model.device = device
xvector_model.to(device)
xvector_model.eval()

## 3.2 - Speaker Classification

Each sample on the file tree of SPEECHCOMMANDS includes a unique speaker id.

The below code extracts all the speaker ids in the dataset to `spkr_ids` list.

In [None]:
# Given code
wav_paths = [os.path.join(root, file) for root, dirs, files in os.walk('./SpeechCommands/speech_commands_v0.02') for file in files if file.endswith('.wav')]
spkr_ids = list({os.path.basename(wav_path).split('_')[0] for wav_path in wav_paths})

### 3.2.1 - t-SNE Projection

t-SNE projection is similar to PCA. We will use it to visualize how our embedding model maps the speech files onto a space relevant to our task.

You can read more about t-SNE here:
https://scikit-learn.org/stable/modules/manifold.html#t-distributed-stochastic-neighbor-embedding-t-sne

And about sklearn's TSNE implementation here:
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

There is no need to dive deeply into the implementation or technical details of t-SNE, but feel free to explore further and broaden your horizons. :)

In this section, you will gain an understanding of how x-vector embeddings map speech data onto another latent space.

Follow these instructions:

1. Select 10 speakers from the spkr_ids list. Choose speakers with more than 100 utterances, and ensure all selected speakers have a similar number of samples.
1. For the chosen speakers, create a list of all wav files associated with them (across all different classes).
1. Ensure your list contains at least 1,000 wav files, with an even distribution of samples across speakers.
1. For each wav file in the list, extract its embedding and the corresponding spkr_id label.
1. Generate a scatter plot showing the t-SNE projection of the extracted embeddings. Assign a unique color to each speaker's data points on the scatter plot.

**Hint**: Use two lists while generating the embeddings—one for the embeddings and another for the speaker IDs.

**Notes**:
* Use a 2-dimensional t-SNE projection.
* Include an appropriate title and labels for the axes in the plot.

### 3.2.2 - Classify Speakers

On this section, you will classify different speakers using embeddings from our embedding model.

1. Collect wav files of **20** different speakers. (Follow the same guidelines from 3.2.1.1).
1. Choose three different classifiers from `sklearn` library (e.g., SVM, Decision Tree, Random Forest, LDA, etc.)
1. Add a table of the results, using 5-Fold Crossv-Validation (CV) as specified in the exercise PDF. Discuss the results.

**Note**: Present mean and standard-deviation of classification results per each classifier.


## 3.3 - Command Classification

The SPEECHCOMMANDS dataset was originally created to train speech models for classifying different commands.

In this section, we will evaluate our embedding model on a command classification task.

### 3.3.1 - t-SNE Projection



Repeat the process from Section 3.2.1, but this time use the commands as labels in the t-SNE projection.

1. Select only **three** commands from the following set: {left, happy, marvin, go, zero, right}.
1. Randomly select 1,000 samples for each chosen command.
1. This should result in a dataset of 3,000 samples, divided into three classes (the three commands you chose), with 1,000 samples per command.
1. Generate a scatter plot of the t-SNE projection and discuss the results in your report.

### 3.3.2 - Classify Commands

Repeat the process from Section 3.2.2, but this time use commands as labels for classification.

1. Use the embeddings from Section 3.3.1 to classify the different classes.
1. Display a table of the classification results and discuss your findings in the report.


**Note**: Present the classification results using 5-fold cross-validation (CV).

**Note**: Present mean and standard-deviation of classification results per each classifier.

### Good Luck!