# **Implementation by Yashaswi Galhotra**

To build a model to recognize emotion from speech using the librosa and sklearn libraries and the Ryerson Audio-Visual Database of Emotional Speech and Song dataset.

In this proposed solution, I will use the libraries librosa, soundfile, and sklearn to build a model using an MLP Classifier. MLP Classifier is an artifical neural network This will be able to recognize emotion from sound files. We will load the data, extract features from it, then split the dataset into training and testing sets. Then, we’ll initialize an MLP Classifier and train the model. Finally, we’ll calculate the accuracy,precision, recall, F1 score and confusion matrix of our model.

I have used **Librosa package** because
*   Librosa is a python package for music and audio analysis. 
*   It provides the building blocks necessary to create music information retrieval systems



In [None]:
# Importing the important libraries required to implement this problem
import librosa
import soundfile
import os, glob, pickle
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix


I have defined a dictionary which contains the number of emotions the dataset contains and a list called "required_emotions" which contains the emotions mentioned in the pdf file which are (Calm, Happy,
Sad, Angry, Fearful, Disgust, Surprised and Neutral)

In [None]:
# Emotions in the Ryerson Audio-Visual Database of Emotional Speech and Song dataset
emotions={
  '01':'neutral',
  '02':'calm',
  '03':'happy',
  '04':'sad',
  '05':'angry',
  '06':'fearful',
  '07':'disgust',
  '08':'surprised'
}
# Emotions to observe
required_emotions=['calm','happy','fearful','disgust','sad','angry','surprised','neutral']

Defined a function **extract_feature** to extract the mfcc, chroma, and mel features from a sound file. This function takes 4 parameters- the file name and three Boolean parameters for the three features: 

* **mfcc:** Mel Frequency Cepstral Coefficient,represents the short-term power spectrum of a sound
* **chroma:** Pertains to the 12 different pitch classes
* **mel:** Mel Spectrogram Frequency

After extracting the features opne the sound file with soundfile. Read from it and call it X. Also, get the sample rate. If chroma is True, get the Short-Time Fourier Transform of X.

Now I called hstack() function from numpy because it is used to stack the sequence of input arrays horizontally (i.e. column wise) to make a single array.

In [None]:
# Extract features (mfcc, chroma, mel) from a audio file
def extract_feature(file_name, mfcc, chroma, mel):
    with soundfile.SoundFile(file_name) as sound_file:
        X = sound_file.read(dtype="float32")
        sample_rate=sound_file.samplerate
        if chroma:
            stft=np.abs(librosa.stft(X))
        result=np.array([])
        if mfcc:
            mfccs=np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
            result=np.hstack((result, mfccs))
        if chroma:
            chroma=np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
            result=np.hstack((result, chroma))
        if mel:
            mel=np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
            result=np.hstack((result, mel))
return result

After extracting the features, I have loaded the data with a function **load_data()** – this takes in size of the test set as parameter. x and y are empty lists. I used the **glob() function** from the glob module to get all the pathnames for the Ryerson Audio-Visual Database of Emotional Speech and Song dataset. 

Using our emotions dictionary, this number is turned into an emotion. It makes a call to extract_feature and stores what is returned in ‘feature’. Then, it appends the feature to x and the emotion to y. So, **the list x holds the features and y holds the emotions**. I called the function train_test_split with these, the test size, and a random state value, and return that.

In [None]:
# Load the data and extract features for each audio file
def load_data(test_size=0.2):
    x,y=[],[]
    for file in glob.glob("D:\\audio-video\\Actor_*\\*.wav"):
        file_name=os.path.basename(file)
        emotion=emotions[file_name.split("-")[2]]
        feature=extract_feature(file, mfcc=True, chroma=True, mel=True)
        x.append(feature)
        y.append(emotion)
    return train_test_split(np.array(x), y, test_size=test_size, random_state=9)

After assigning the emotions to the dataset, I spilted the dataset into training and testing sets. 
After spliting the datset, I observe the shape of the training and testing datasets:

In [None]:
# Split the dataset
x_train,x_test,y_train,y_test=load_data(test_size=0.25)

# Get the shape of the training and testing datasets
print((x_train.shape[0], x_test.shape[0]))

# Get the number of features extracted
print(f'Features extracted: {x_train.shape[1]}')

Initialized an **MLP Classifier**. **Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a function by training on a dataset, where is the number of dimensions for input and is the number of dimensions for output**. This is a Multi-layer Perceptron Classifier; it optimizes the log-loss function using Limited Memory Broyden–Fletcher–Goldfarb–Shanno . Unlike SVM or Naive Bayes, the **MLP Classifier has an internal neural network for the purpose of classification. This is a feedforward ANN model.**

In [None]:
# Initialize the Multi Layer Perceptron Classifier
model=MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08, hidden_layer_sizes=(300,), learning_rate='adaptive', max_iter=500)

After implementing the MLP Classifier we make model fit on training dataset and predict it's application test dataset.

In [None]:
# Train the model
model.fit(x_train,y_train)

# Predict for the test set
y_pred=model.predict(x_test)

After training and testing the model on dataset we find the **accuracy, f1 score, recall and confusion matrix.**

In [None]:
# Calculate the accuracy of our model
accuracy=accuracy_score(y_true=y_test, y_pred=y_pred)
# Print the accuracy
print("Accuracy: {:.2f}%".format(accuracy*100))
#Print the recall
recall_score(y_true=y_test, y_pred=y_pred, average=None)
print(f'Features extracted: {recall_score}')
#Print the f1 score
print('F1 score:', f1_score(y_test,y_pred,average='weighted'))
#Print the confusion matrix
print('\n confussion matrix:\n',confusion_matrix(y_test, y_pred))