# <font color=green>ISAT 449: Emerging Topics in Applied Data Science</font>

## <font color=blue>Mini-Project: How to Make a Speech Emotion Recognizer Using Python And Scikit-learn

<font color=orange>**Building a Speech Emotion Recognition system that detects emotion from a human speech tone using the Scikit-Learn library, Python, and Librosa**</font>

**What is Speech Emotion Recognition?**

Speech Emotion Recognition (SER) is the act of attempting to recognize human emotion and affective states from speech. This is capitalizing on the fact that voice often reflects underlying emotion through tone and pitch. This is also the phenomenon that animals like dogs and horses employ to be able to understand human emotion.

SER is tough because emotions are subjective and annotating audio is challenging.

**What is librosa?**

Librosa is a python library for analyzing audio and music. It has a flatter package layout, standardizes interfaces and names, backwards compatibility, modular functions, and readable code. Further, in this Python mini-project, we demonstrate how to install it (and a few other packages) with pip.

**What is JupyterLab?**

As you have seen, I use JupyterLab, which is an open-source, web-based UI for Project Jupyter and it has all basic functionalities of the Jupyter Notebook, like notebooks, terminals, text editors, file browsers, rich outputs, and more. However, if also provided support for third party extensions. It comes bundled with the Anaconda Data Science Framework if you want to try it out, BUT you can just keep using your Jupyter Notebook and all will be fine.

**Speech Emotion Recognition - Objective**

To build a model to recognize emotion from speech using the librosa and sklearn libraries and the RAVDESS dataset.

**Speech Emotion Recognition - About the Python Mini Project**

In this Python mini project, we will uset the libraries librosa, soundfile, and sklearn (among others) to build a model using an MLPClassifier. This will be able to recognize emotion from sound files. We will load the data, extract features from it, then split the dataset into training and testing sets. Then, we'll initialize an MLPClassifier and train the model. Finally, we'll calculate the accuracy of our model.

**The Dataset**

For this Python mini project, we'll use the RAVDESS dataset; this is the Ryerson Audio-Visual Dataset of Emotional Speech and Song dataset, and is free to download. This dataset has 7356 files rated by 247 individuals 10 times on emotional validity, intensity, and genuineness. The entire dataset is 24.8GB from 24 actors, but we've lowered the sample rate on all the files, and you can download it from Canvas.

**File Summary**

In total, the RAVDESS collection includes 7356 files (2880 + 2024 + 1440 + 1012 files)

**File naming convention**

Each of the 7356 RAVDESS files has unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics:

**Filename Identifiers**

 - Modalitiy (01 = full-AV, 02 = video-only, 03 = audio-only).
 - Vocal channel (01 = speech, 02 = song).
 - Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
 - Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
 - Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
 - Repetition (01 = 1st repitition, 02 = 2nd repitition).
 - Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).
 
_Filename example: 02-01-06-01-02-01-12.mp4_

 - Video-only (02)
 - Speech (01)
 - Fearful (06)
 - Normal intensity (01)
 - Statement "dogs" (02)
 - 1st Repetition (01)
 - 12th Actor (12)
 - Female, as the actor ID number is even.
 
You can find more information on the file structure and filenames from Zenodo: Filename References
(https://zenodo.org/record/1188976#.X3KzGGhKhPa)

### **Prerequisites**

You'll need to install the following libraries with pip:

 - **pip install** _librosa soundfile numpy sklearn pyaudio_
 
If you run into issues installing librosa with **pip**, you can try it with **conda**.

The whole pipeline is as follows (same as any machine learning pipeline):
 - Preparing the Dataset: Here, we download and convert the dataset to be suited for extraction.
 - Loading the Dataset: This process is about loading the dataset in Python which involves extracting audio features, such as obtaining different features (power, pitch, and vocal tract configuration from the speech signal). We will use _librosa_ library to do that.
 - Training the Model: After we prepare and load the dataset, we simply train it on a suited sklearn model.
 - Testing the Model: Measuring how good our model is doing.

### **NOTE on the dataset and file structure**

<font color=red>**BEFORE you begin coding, download the dataset from canvas and _extract it to your project folder_. I have modified the folder name in the zipped file so that it is easier to parse (see the file structure I used below).**</font>

### Let's import the dependencies

 1. import files

In [5]:
import soundfile # to read audio file
import numpy as np
import matplotlib.pyplot as plt
import librosa # to extract speech features
import glob
import os
import pickle # to save model after training

from sklearn.model_selection import train_test_split # for splitting training and testing
from sklearn.neural_network import MLPClassifier # multi-layer perceptron model
from sklearn.metrics import accuracy_score # to measure how good we are