<h1 align='center'>Speech Emotion Recognition</h1>
<h3 align='center'>Machine Learning Project</h3>
<h4 align='center'>by Aryan Agarwal (11707334)</h4>
<br>
<br>
<p><b>Assignment Topic:</b> The project to recognize the emotion of the customer for telephony Customer care.</p>
<br>
<br>
This project creates a <b>Multi Layer Perceptron Classifier model</b>. I have trained this model using Speech dataset <b>RAVDESS: Ryerson Audio-Visual Database of Emotional Speech and Song</b>. This dataset contains audio file containing speech of various actors which can be used to train a model to detect their emotion. The original dataset is about 24.5GB and can be downloaded from <a href="https://www.kaggle.com/uwrfkaggler/ravdess-emotional-speech-audio" target="_blank">Kaggle</a>. For this project I have used a reduced dataset which contain only a part of original RAVDESS dataset. This reduced dataset can be downloaded from my <a href="https://drive.google.com/file/d/1v3XBuTWDNa4yCYbTcutro_CY7uQ2YJB6/view?usp=sharing">Google Drive:</a>
<br><br>
https://drive.google.com/file/d/1v3XBuTWDNa4yCYbTcutro_CY7uQ2YJB6/view?usp=sharing


### Required Library
<br>
<br>
<img src="./img/librosa.png" height="100" width = "200"/>

<b>LibROSA: </b>LibROSA is a python package for music and audio analysis. It provides the building blocks necessary to create music information retrieval systems. It has a flatter package layout, standardizes interfaces and names, backwards compatibility, modular functions, and readable code. 




In [1]:
!pip install librosa



<b>SoundFile:</b> SoundFile can read and write sound files. File reading/writing is supported through libsndfile, which is a free, cross-platform, open-source (LGPL) library for reading and writing many different sampled sound file formats that runs on many platforms including Windows, OS X, and Unix.

In [1]:
!pip install soundfile



<img src="./img/numpy.png" height="100" width = "200"/>

<b>NumPy: </b>We will use the Python programming language for all assignments in this course. Python is a great general-purpose programming language on its own, but with the help of a few popular libraries (numpy, scipy, matplotlib) it becomes a powerful environment for scientific computing.

In [3]:
!pip install numpy



<img src="./img/sklearn.png" height="100" width = "200"/>

<b>Scikit-learn: </b>Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

In [4]:
!pip install sklearn



## Working

### Dataset information
The RAVDESS dataset contains audio clips (wav file) with name containing emotion in encoded form as 

    '01':'neutral',
    '02':'calm',
    '03':'happy',
    '04':'sad',
    '05':'angry',
    '06':'fearful',
    '07':'disgust',
    '08':'surprise'
    
So if file name is 01-02-08.wav that mean this audio clip has neutral, calm and surprise emotion. This can act as target value.

### Features Information
We will extract MFCC, Chroma and Mel from the audio files. This will form features we will use.
<br>
<b>MFCC:</b>Mel Frequency Cepstral Coefficient, represents the short-term power spectrum of a sound. Mel Frequency Cepstral Coefficents (MFCCs) are a feature widely used in automatic speech and speaker recognition. The shape of the vocal tract manifests itself in the envelope of the short time power spectrum, and the job of MFCCs is to accurately represent this envelope.
<br>
<b>Chroma: </b>Pertains to the 12 different pitch classes. Chroma features are an interesting and powerful representation for music audio in which the entire spectrum is projected onto 12 bins representing the 12 distinct semitones (or chroma) of the musical octave.
<br>
<b>Mel: </b>Mel Spectrogram Frequency. This is spectogram to represent the sound in mathematical form so that it is easier to visualize.


### Opening dataset and extracting Features
Now we will open each audio file using <b>soundfile</b> library and extract features using <b>Librosa</b> library. We store these features into numpy array. We also store target value (which we get from filename) into another numpy array. We also split this dataset into train and test data.

### Training Model and Predicting
After that we create sklearn Multi Layer Perceptron Classifier, train it using training data and then try to predict the test data using test data. We also calculate accuracy using sklearn.metrics. 

### Code

In [3]:
import librosa
import soundfile
import os, glob, pickle
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

In [4]:
# Extract features from sound file like mfcc, chroma and mel
def extract_feature(file_name, mfcc, chroma, mel):
    
    # Open filename with soundfile
    with soundfile.SoundFile(file_name) as sound_file:
        X = sound_file.read(dtype="float32")
        sample_rate = sound_file.samplerate
        
        # if features are present include it in result ndarray
        if chroma:
            stft = np.abs(librosa.stft(X))
        result = np.array([])
        if mfcc:
            mfccs= np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
            result = np.hstack((result, mfccs))
        if chroma:
            chroma=np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
            result=np.hstack((result, chroma))
        if mel:
            mel = np.mean(librosa.feature.melspectrogram(X, sr = sample_rate).T, axis=0)
            result=np.hstack((result, mel))
            
    # return result ndarray with all the features of that file        
    return result

In [5]:
# Emotion in the RAVDESS dataset

# dictionary to decode encoded target into verbal emotions
emotions = {
    '01':'neutral',
    '02':'calm',
    '03':'happy',
    '04':'sad',
    '05':'angry',
    '06':'fearful',
    '07':'disgust',
    '08':'surprised'
}

observed_emotions = ['calm', 'happy', 'fearful', 'disgust'] 

In [23]:
# Load the data and extract the features for each sound file
def load_data(test_size = 0.2):
    x = []
    y = []
    
    # selecting all filename using same pattern 
    for file in glob.glob(".\\RAVDESS\\Actor_*\\*.wav"):
        
        file_name = os.path.basename(file)
        
        # Splitting each emotion and decoding it in emotion using dictionary
        emotion = emotions[file_name.split("-")[2]]
        if emotion not in observed_emotions:
            continue
        
        # calling extract feature function created above to get feature of file
        feature = extract_feature(file, mfcc=True, chroma=True, mel=True)
        
        # appending features and target for all the file
        x.append(feature)
        y.append(emotion)
    
    # returning train and test splitted dataset
    return train_test_split(np.array(x), y, test_size = test_size, random_state = 9)

In [24]:
# Split dataset into train and test
x_train, x_test, y_train, y_test = load_data(test_size = 0.25)

In [8]:
# Size of train and test data
print((x_train.shape[0], x_test.shape[0]))

(576, 192)


In [9]:
# total features extracted
print("Features extracted:", x_train.shape[1])

Features extracted: 180


In [10]:
# Initialize the Multi Layer Perceptron Classifier
model = MLPClassifier(alpha = 0.01, batch_size = 256, epsilon = 1e-08, hidden_layer_sizes = (300,), learning_rate = 'adaptive', max_iter = 500)

In [11]:
# Train the model
model.fit(x_train, y_train)

MLPClassifier(activation='relu', alpha=0.01, batch_size=256, beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(300,), learning_rate='adaptive',
       learning_rate_init=0.001, max_iter=500, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [31]:
# predict the test set
y_pred = model.predict(x_test)

In [44]:
# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy = ", accuracy)

Accuracy =  0.85416666666666


In [43]:
# Predicting emotion for one of audio file
pred_file = r".\RAVDESS\Actor_01\03-01-08-02-02-02-01.wav"
file_feature = extract_feature(pred_file, mfcc=True, chroma=True, mel=True)
file_feature = file_feature.reshape(1,180)
print(model.predict(file_feature))

['happy']
