# Conversational AI: Data Science (UCS663)
**Kaggle Based Lab Evaluation - 1** <br>
Suvrat Arora <br>
101903331 <br>
3CO13 <br>

# Speaker Recognition 

In the Speaker Recognition problem, we are provided with a dataset comprising speeches of prominent personalities in the form of .wav files. Besides, a background noise audio file has also been given, which can be intertwined with the speech audios to ensure better results. <br>
Here, our objective is to train a model with the best possible accuracy that can correctly predict the speaker of the speech; thus, it is an **Audio Classification** problem.

We will deal with the problem statement in the following stepwise manner: <br>

**- Data Exploration and Visualization:** As should be done with any Machine Learning/Data Science problem, we begin with understanding the data. Since we are given the data in directories in the form of wav audio file - we'll traverse through the directory and plot a few audio signals to understand the nature of the data provided.

**- Data Pre-Processing:** After having explored the data, we'll pre-process the audio files. One of the most common ways to deal with audio data is to convert it into some kind of wave plot - thereby converting the audio processing problem into an image processing problem. Furthermore, we'll also store the wav file paths from the directories to local variables and then divide the data into training and testing datasets so that they can be employed for model training and evaluation.<br>

**- Model Building:** After having pre-processed the data, the next step will be to build a model. Since we are endeavouring to convert the audio files into graphical representations, we'll need to create a Convolutional Neural Network for this purpose. Thus, we will feed the model with waveplots, apply the convolution operation to the generated graphs, and perform the speaker recognition task.<br>

**- Model Evaluation:** Once the model is built on the training data set with an acceptable level of training accuracy, the model will be evaluated on the testing data, and the final accuracy will be determined. <br>

Subsequently, in this notebook, we'll attempt to perform the above-mentioned steps and get the best possible results.

In [1]:
# Installing spela for Melspectogram
!pip install spela

In [5]:
# Importing necessary libraries 
import os
import tensorflow as tf
import matplotlib.pyplot as plt
import pathlib
import librosa
import librosa.display #Audio Processing Library for Python
import IPython.display as ipd
from tqdm import tqdm #To form the loading bar for loading the dataset (since the dataset is quite haevy, it helps monitor the )
import cupy #RAPIDS accelerated alternative to numpy 
import cudf #RAPIDS accelerated alternative to pandas 
from cuml.metrics import confusion_matrix
from cuml.model_selection import train_test_split
import numpy as np #RAPIDS accelerated alternative to numpy 
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from spela.melspectrogram import Melspectrogram

# Data Explorartion and Visualization
 Now that we have imported all the necessary libraries, we'll load our audio data, set the contents of the directory and plot the waveplots and spectrograms of one audio sample in order to get to know our data better.

In [6]:
# Saving the input data path 
data_path = '../input/speaker-recognition-dataset/16000_pcm_speeches/'

In [7]:
# Checking the folder names present in the data folder
# The folder names represent the speaker names
os.listdir(data_path)

In [8]:
# Loading the first audio file
audio_path = '../input/speaker-recognition-dataset/16000_pcm_speeches/Benjamin_Netanyau/0.wav'
librosa_audio_data,librosa_sample_rate=librosa.load(audio_path)

# Plotting the audio signal
plt.figure(figsize=(12, 4))
plt.plot(librosa_audio_data)

In [9]:
S=librosa.feature.melspectrogram(y=librosa_audio_data, sr=librosa_sample_rate)
plt.figure(figsize=(10, 4))
librosa.display.specshow(librosa.power_to_db(S, ref=np.max), y_axis='mel', fmax=8000, x_axis='time')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel pectrogram')
plt.tight_layout()

# Creating/Processing Training Dataset

In [10]:
# Defining a function to fetch the wav file paths
def get_wav_paths(speaker):
    speaker_path = data_path + speaker
    all_paths = [item for item in os.listdir(speaker_path)]
    return all_paths

In [11]:
# Retriving the file paths to individual speakers
nelson_mandela_paths = get_wav_paths("Nelson_Mandela")
margaret_thatcher_paths = get_wav_paths("Magaret_Tarcher")
benjamin_netanyau_paths = get_wav_paths("Benjamin_Netanyau")
jens_stoltenberg_paths = get_wav_paths( 'Jens_Stoltenberg')
julia_gillard_paths = get_wav_paths("Julia_Gillard")

In [12]:
# Loading the data for model training
def load_wav(wav_path, speaker):
    with tf.compat.v1.Session(graph=tf.compat.v1.Graph()) as sess:
        wav_path = data_path +speaker + "/"+ wav_path
        wav_filename_placeholder = tf.compat.v1.placeholder(tf.compat.v1.string, [])
        wav_loader = tf.io.read_file(wav_filename_placeholder)
        wav_decoder = tf.audio.decode_wav(wav_loader, desired_channels=1)
        wav_data = sess.run(
            wav_decoder, feed_dict={
                wav_filename_placeholder: wav_path
            }).audio.flatten().reshape((1, 16000))
        sess.close()
    return wav_data

In [13]:
# Create training data
def generate_training_data(speaker_paths, speaker, label):
    wavs, labels = [], []
    for i in tqdm(speaker_paths):
        wav = load_wav(i, speaker)
        wavs.append(wav)
        labels.append(label)
    return wavs, labels

In [14]:
nelson_mandela_wavs, nelson_mandela_labels = generate_training_data(nelson_mandela_paths, "Nelson_Mandela", 0) 
margaret_thatcher_wavs, margaret_thatcher_labels = generate_training_data(margaret_thatcher_paths, "Magaret_Tarcher", 1) 
benjamin_netanyau_wavs, benjamin_netanyau_labels = generate_training_data(benjamin_netanyau_paths, "Benjamin_Netanyau", 2) 
jens_stoltenberg_wavs, jens_stoltenberg_labels = generate_training_data(jens_stoltenberg_paths, "Jens_Stoltenberg", 3) 
julia_gillard_wavs, julia_gillard_labels = generate_training_data(julia_gillard_paths, "Julia_Gillard", 4) 

In [15]:
# remove the extra wav for Julia Gillard
julia_gillard_labels = julia_gillard_labels[1:]
julia_gillard_wavs = julia_gillard_wavs[1:]

In [16]:
all_wavs = nelson_mandela_wavs + margaret_thatcher_wavs + benjamin_netanyau_wavs + jens_stoltenberg_wavs + julia_gillard_wavs
all_labels = nelson_mandela_labels + margaret_thatcher_labels + benjamin_netanyau_labels + jens_stoltenberg_labels + julia_gillard_labels

In [17]:
# split the dataset into trainin and testing set
train_wavs, test_wavs, train_labels, test_labels = train_test_split(all_wavs, all_labels, test_size=0.2)

In [18]:
train_x, train_y = np.array(train_wavs), np.array(train_labels)
test_x, test_y = np.array(test_wavs), np.array(test_labels)

**Mel Spectrogram** plots spectrograms by employing Mel scale - which takes into consideration the dynamics of human speech and the range at which humans are accustomed to hearing. Thus, we are using Mel spectrograms to train our CNN mode

In [19]:
def create_model():
    model = tf.keras.Sequential()
    model.add(Melspectrogram(sr=16000, n_mels=128,n_dft=512, n_hop=256,
                            input_shape=(1 , 16000),return_decibel_melgram=True,
                            trainable_kernel=False, name='melgram'))
   

    model.add(tf.keras.layers.Conv2D(64, (3, 3), activation="relu"))
    model.add(tf.keras.layers.MaxPool2D(pool_size=(2, 2)))

    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(5, activation="softmax"))
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-4)
            , loss = "sparse_categorical_crossentropy"
            , metrics = ["accuracy"])
    return model

In [20]:
# melspectrogram model
model = create_model()

In [21]:
model.fit(x=train_x, y=train_y, epochs=10)

We can see that a training accuracy of **97.03%** has been achieved. Now, let us evaluate our model on the test data set and find the accuracy of it.

In [22]:
model.evaluate(x=test_x, y=test_y)

Upon evaluating the model on the test data, it is clear that we have achieved an accuracy of **93.20%**

# Conclusion

We have sucessfully fetched the audio files from the directories, pre-processed them to form Mel Spectrograph - hereafter the problem turned into an image processing problem. Since CNNs are one of the best tools to deal with training of image models, we trained a CNN and accomplished the following accuracy levels: <br>

**Training Accuracy:** 97.03% <br>
**Testing Accuracy:** 93.20% <br>

[**Loss Function used:** sparse_categorical_crossentropy <br>
**Evaluation Metric used:** accuracy]