#Speech Emotion Recognition with MLP Classifier



#Dataset
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) 

---
Audio-only files

Audio-only files of all actors (01-24) are available as two separate zip files (~200 MB each):

Speech file (Audio_Speech_Actors_01-24.zip, 215 MB) contains 1440 files: 60 trials per actor x 24 actors = 1440. 
Song file (Audio_Song_Actors_01-24.zip, 198 MB) contains 1012 files: 44 trials per actor x 23 actors = 1012.

Total=2452

---

---
Toronto emotional speech set (TESS)

---


There are a set of 200 target words were spoken in the carrier phrase "Say the word _' by two actresses (aged 26 and 64 years) and recordings were made of the set portraying each of seven emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral). There are 2800 data points (audio files) in total.

The dataset is organised such that each of the two female actor and their emotions are contain within its own folder. And within that, all 200 target words audio file can be found. The format of the audio file is a WAV format


---



# Mount google drive



In [None]:
from google.colab import drive
drive.mount('/content/drive/')

# Install following libraries

In [9]:
%pip install --upgrade pip
%pip install setuptools wheel

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [10]:
%pip install librosa soundfile numpy sklearn

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [11]:
%pip install soundfile

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


# Make the necessary imports

In [12]:
import librosa
import soundfile
import os, glob, pickle
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

Define a function extract_feature to extract the mfcc, chroma, and mel features from a sound file. This function takes 4 parameters- the file name and three Boolean parameters for the three features:

* mfcc: Mel Frequency Cepstral Coefficient, represents the short-term power spectrum of a sound
* chroma: Pertains to the 12 different pitch classes
* mel: Mel Spectrogram Frequency

In [13]:
def extract_feature(file_name, mfcc, chroma, mel):
    X, sample_rate = librosa.load(os.path.join(file_name), res_type='kaiser_fast')
    if chroma:
        stft=np.abs(librosa.stft(X))
    result=np.array([])
    if mfcc:
        mfccs=np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
        result=np.hstack((result, mfccs))
    if chroma:
        chroma=np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
        result=np.hstack((result, chroma))
    if mel:
        mel=np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
        result=np.hstack((result, mel))
    return result

Now, let’s define a dictionary to hold numbers and the emotions available in the RAVDESS & TESS dataset, and a list to hold all 8 emotions- neutral,calm,happy,sad,angry,fearful,disgust,surprised.

In [14]:
# Emotions in the RAVDESS & TESS dataset
emotions={
  '01':'neutral',
  '02':'calm',
  '03':'happy',
  '04':'sad',
  '05':'angry',
  '06':'fearful',
  '07':'disgust',
  '08':'surprised'
}
# Emotions to observe
observed_emotions=['neutral','calm','happy','sad','angry','fearful', 'disgust','surprised']

# Load the data and extract features for each sound file

In [15]:

def load_data(test_size=0.2):
    x,y=[],[]
    for file in glob.glob('./dataset_features/Actor_*/*.wav'):
        file_name=os.path.basename(file)
        emotion=emotions[file_name.split("-")[2]]
        if emotion not in observed_emotions:
            continue
        feature=extract_feature(file, mfcc=True, chroma=True, mel=True)
        x.append(feature)
        y.append(emotion)
    return train_test_split(np.array(x), y, test_size=test_size, train_size= 0.75,random_state=9)

# Split the Dataset
Time to split the dataset into training and testing sets! Let’s keep the test set 25% of everything and use the load_data function for this.

In [16]:
# Split the dataset
import time
x_train,x_test,y_train,y_test=load_data(test_size=0.25)

 -4.1689116e-05  0.0000000e+00] as keyword args. From version 0.10 passing these as positional arguments will result in an error
  mel=np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
  mel=np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
  6.9158796e-06  0.0000000e+00] as keyword args. From version 0.10 passing these as positional arguments will result in an error
  mel=np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
 -2.1732698e-05  0.0000000e+00] as keyword args. From version 0.10 passing these as positional arguments will result in an error
  mel=np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
 0.0000000e+00] as keyword args. From version 0.10 passing these as positional arguments will result in an error
  mel=np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
 -7.0206770e-06  0.0000000e+00] as keyword args. From version 0.10 passing these as positional arguments will result in an error
 

#Observe the shape of the training and testing datasets:

In [17]:
#Get the shape of the training and testing datasets
print((x_train.shape[0], x_test.shape[0]))

(3939, 1313)


# Number of features extracted.

In [18]:
# Get the number of features extracted
print(f'Features extracted: {x_train.shape[1]}')

Features extracted: 180


# MLP Classifier

In [19]:
# Initialize the Multi Layer Perceptron Classifier
model=MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08, hidden_layer_sizes=(300,), learning_rate='adaptive', max_iter=500)

#Fit/train the model.

In [20]:
# Train the model
model.fit(x_train,y_train)

# Predict the accuracy of our model

Let’s predict the values for the test set. This gives us y_pred (the predicted emotions for the features in the test set).

In [21]:
# Predict for the test set
y_pred=model.predict(x_test)

To calculate the accuracy of our model, we’ll call up the accuracy_score() function we imported from sklearn. Finally, we’ll round the accuracy to 2 decimal places and print it out.

In [22]:
# Calculate the accuracy of our model
accuracy=accuracy_score(y_true=y_test, y_pred=y_pred)
# Print the accuracy
print("Accuracy: {:.2f}%".format(accuracy*100))

Accuracy: 79.97%


#classification Report

In [23]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))


              precision    recall  f1-score   support

       angry       0.97      0.79      0.87       188
        calm       0.80      0.48      0.60        82
     disgust       0.80      0.83      0.81       162
     fearful       0.79      0.84      0.81       203
       happy       0.89      0.75      0.81       192
     neutral       0.64      0.95      0.76       152
         sad       0.77      0.82      0.79       187
   surprised       0.85      0.80      0.82       147

    accuracy                           0.80      1313
   macro avg       0.81      0.78      0.78      1313
weighted avg       0.82      0.80      0.80      1313



# Confusion Matrix

In [24]:
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(y_test,y_pred)
print (matrix)

[[148   0   8  15   4   4   5   4]
 [  0  39   2   2   3  29   7   0]
 [  0   0 134   1   0  12   8   7]
 [  3   0   4 170   5   8  13   0]
 [  2   8   5   9 144  14   4   6]
 [  0   1   0   0   0 145   5   1]
 [  0   1   6  15   1   8 153   3]
 [  0   0   9   4   5   8   4 117]]


#Thank You