<font color="#483D8B">
<h1  align="center"> Recognizing Emotion in Speech With Neural Networks </h1>
<div align="center">
<font size=3><b>
<br>INET 4061 Project Third Draft
<br>Tony Zeng
<br>December 1, 2019
<br></font></b></div>

# Overview
Human emotions can be found through our daily speech. If we are angry, someone might raise their voice. If someone is sad, you might hear abrupt speech patterns. This project will look into audio files from both a male and female to predict such emotions such as happy, sad, angry, etc. 
<br/>
<br/>

There are some major obstacles with speech emotion recognition:
* Emotions are subjective, people would interpret it differently. It is hard to define the notion of emotions.
* Annotating an audio recording is challenging. Should we label a single word, sentence or a whole conversation? How many emotions should we define to recognize?
* Collecting data is complex. There are lots of audio data can be achieved from films or news. However, both of them are biased since news reporting has to be neutral and actors’ emotions are imitated. It is hard to look for neutral audio recording without any bias.
* Labeling data require high human and time cost. Unlike drawing a bounding box on an image, it requires trained personnel to listen to the whole audio recording, analysis it and give an annotation. The annotation result has to be evaluated by multiple individuals due to its subjectivity.

Definitions:
* Mel Frequency Cepstral Coefficient (MFCC): The Mel scale relates perceived frequency, or pitch, of a pure tone to its actual measured frequency. 

References:
* https://towardsdatascience.com/speech-emotion-recognition-with-convolution-neural-network-1e6bb7130ce3
* http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/
* https://github.com/MITESHPUTHRANNEU/Speech-Emotion-Analyzer
* https://github.com/marcogdepinto/Emotion-Classification-Ravdess/blob/master/EmotionsRecognition.ipynb

# Data

The data we will be using for this is the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). The database contains 24 professional actors (12 female, 12 male) vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.
<br/>

#### File naming convention
Each of the 7356 RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics: 
* Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
* Vocal channel (01 = speech, 02 = song).
* Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
* Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
* Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
* Repetition (01 = 1st repetition, 02 = 2nd repetition).
* Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

#### Emotions
In this database, we will be looking at a couple of different emotions, this includes:
* Neutral (Not in song version of data)
* Calm
* Happy
* Sad
* Angry
* Fearful
* Disgust (Not in song version of data)
* Surprised (Not in song version of data)

#### Steps to Reproduct the data for this notebook
We will be using the Audio-only files. In specific, we will be using the Song files which contains 1012 files: 44 trials per actor X 23 actors = 1012.
1. To get the data, go to : https://zenodo.org/record/1188976#.XcuWi1dKiUl
2. Go down and look for the file named: Audio_Song_Actors_01-24.zip. The size of the file is 225.5 MB.
3. 

In [1]:
#Imports
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from matplotlib.pyplot import specgram
from tensorflow import keras
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Input, Flatten, Dropout, Activation
from tensorflow.keras.layers import Conv1D, MaxPooling1D, AveragePooling1D
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
from sklearn.metrics import confusion_matrix

# Exploratory Data Analysis

In [2]:
#Take a look at one of the audio files
data, sampling_rate = librosa.load(r'C:\Users\tzeng\Documents\INET 4061\INET-4061-Project\ravdess\Actor_01\03-02-01-01-01-01-01.wav')

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\tzeng\\Documents\\INET 4061\\INET-4061-Project\\ravdess\\Actor_01\\03-02-01-01-01-01-01.wav'

In [None]:
%matplotlib inline
import os
import pandas as pd
import glob

plt.figure(figsize=(12, 4))
librosa.display.waveplot(data, sr=sampling_rate)

Add description of graph

In [None]:
import matplotlib.pyplot as plt
import scipy.io.wavfile
import numpy as np
import sys

sr,x = scipy.io.wavfile.read('ravdess/Actor_01/03-02-01-01-01-01-01.wav')

## Parameters: 10ms step, 30ms window
nstep = int(sr * 0.01)
nwin  = int(sr * 0.03)
nfft = nwin

window = np.hamming(nwin)

## will take windows x[n1:n2].  generate
## and loop over n2 such that all frames
## fit within the waveform
nn = range(nwin, len(x), nstep)

X = np.zeros( (len(nn), nfft//2) )

for i,n in enumerate(nn):
    xseg = x[n-nwin:n]
    z = np.fft.fft(window * xseg, nfft)
    X[i,:] = np.log(np.abs(z[:nfft//2]))

plt.imshow(X.T, interpolation='nearest',
    origin='lower',
    aspect='auto')

plt.show()

Add description

In [None]:
import time

path = 'ravdess/'
feeling_list = []

start_time = time.time()

for subdir, dirs, files in os.walk(path):
    for file in files:
        if file[6:-16] == '02' and int(file[18:-4]) % 2 == 0:
            feeling_list.append('female_calm')
        elif file[6:-16] == '02' and int(file[18:-4]) % 2 == 1:
            feeling_list.append('male_calm')
        elif file[6:-16] == '03' and int(file[18:-4]) % 2 == 0:
            feeling_list.append('female_happy')
        elif file[6:-16] == '03' and int(file[18:-4]) % 2 == 1:
            feeling_list.append('male_happy')
        elif file[6:-16] == '04' and int(file[18:-4]) % 2 == 0:
            feeling_list.append('female_sad')
        elif file[6:-16] == '04' and int(file[18:-4]) % 2 == 1:
            feeling_list.append('male_sad')
        elif file[6:-16] == '05' and int(file[18:-4]) % 2 == 0:
            feeling_list.append('female_angry')
        elif file[6:-16] == '05' and int(file[18:-4]) % 2 == 1:
            feeling_list.append('male_angry')
        elif file[6:-16] == '06' and int(file[18:-4]) % 2 == 0:
            feeling_list.append('female_fearful')
        elif file[6:-16] == '06' and int(file[18:-4]) % 2 == 1:
            feeling_list.append('male_fearful')
        elif file[:1] == 'a':
            feeling_list.append('male_angry')
        elif file[:1] == 'f':
            feeling_list.append('male_fearful')
        elif file[:1] == 'h':
            feeling_list.append('male_happy')
        elif file[:2] == 'sa':
            feeling_list.append('male_sad')

print("--- Data loaded. Loading time: %s seconds ---" % (time.time() - start_time))

In [None]:
#Verify that mfccs were populated in the list
print(feeling_list)

In [None]:
labels = pd.DataFrame(feeling_list)

In [None]:
labels.head()

In [None]:
df = pd.DataFrame(columns=['feature'])
bookmark=0

path = 'ravdess/'

start_time = time.time()

for subdir, dirs, files in os.walk(path):
    for file in files:
        try:
            if file[6:-16]!='01' and file[6:-16]!='07' and file[6:-16]!='08' and file[:2]!='su' and file[:1]!='n' and file[:1]!='d':
                X, sample_rate = librosa.load(os.path.join(subdir,file), res_type='kaiser_fast',duration=2.5,sr=22050*2,offset=0.5)
                sample_rate = np.array(sample_rate)
                mfccs = np.mean(librosa.feature.mfcc(y=X, 
                                                    sr=sample_rate, 
                                                    n_mfcc=13), axis=0)
                feature = mfccs
                df.loc[bookmark] = [feature]
                bookmark=bookmark+1
        except ValueError:
            continue
            
print("--- Data loaded. Loading time: %s seconds ---" % (time.time() - start_time))

In [None]:
df[:5]

In [None]:
df3 = pd.DataFrame(df['feature'].values.tolist())

In [None]:
df3[:5]

In [None]:
newdf = pd.concat([df3, labels], axis=1)

In [None]:
rnewdf = newdf.rename(index=str, columns={"0":"label"})
rnewdf[:5]

In [None]:
from sklearn.utils import shuffle
rnewdf = shuffle(newdf)
rnewdf[:10]

In [None]:
rnewdf = rnewdf.fillna(0)

### Divide the data into test and train

In [None]:
newdf1 = np.random.rand(len(rnewdf)) < 0.8
train = rnewdf[newdf1]
test = rnewdf[~newdf1]

In [None]:
train[250:260]

In [None]:
trainfeatures = train.iloc[:, :-1]
trainlabel = train.iloc[:, -1:]
testfeatures = test.iloc[:, :-1]
testlabel = test.iloc[:, -1:]

In [None]:
from tensorflow.python.keras import utils
from sklearn.preprocessing import LabelEncoder

X_train = np.array(trainfeatures)
y_train = np.array(trainlabel)
X_test = np.array(testfeatures)
y_test = np.array(testlabel)

lb = LabelEncoder()

y_train = utils.to_categorical(lb.fit_transform(y_train))
y_test = utils.to_categorical(lb.fit_transform(y_test))

In [None]:
y_train

In [None]:
X_train.shape

# Model

### Building our neural network

In [None]:
x_traincnn = np.expand_dims(X_train, axis=2)
x_testcnn = np.expand_dims(X_test, axis=2)

In [None]:
x_traincnn.shape, x_testcnn.shape

In [None]:
model = Sequential()

model.add(Conv1D(128, 5,padding='same',
                 input_shape=(40,1)))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(MaxPooling1D(pool_size=(8)))
model.add(Conv1D(128, 5,padding='same',))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Flatten())
model.add(Dense(8))
model.add(Activation('softmax'))
opt = keras.optimizers.rmsprop(lr=0.00005, rho=0.9, epsilon=None, decay=0.0)

In [None]:
model.summary()

In [None]:
model.compile(loss='sparse_categorical_crossentropy',
              optimizer=opt,
              metrics=['accuracy'])

In [None]:

cnnhistory=model.fit(x_traincnn, y_train, batch_size=16, epochs=1000, validation_data=(x_testcnn, y_test))

In [None]:
plt.plot(cnnhistory.history['loss'])
plt.plot(cnnhistory.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
plt.plot(cnnhistory.history['acc'])
plt.plot(cnnhistory.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('acc')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
predictions = model.predict_classes(x_testcnn)

In [None]:
predictions

In [None]:
y_test

In [None]:
new_Ytest = y_test.astype(int)

In [None]:
new_Ytest

In [None]:
from sklearn.metrics import classification_report
report = classification_report(new_Ytest, predictions)
print(report)

In [None]:
# 0 = neutral, 1 = calm, 2 = happy, 3 = sad, 4 = angry, 5 = fearful, 6 = disgust, 7 = surprised

from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(new_Ytest, predictions)
print (matrix)

### Save the Model

In [None]:
model_name = 'Emotion_Voice_Detection_Model.h5'
save_dir = '/content/drive/My Drive/Ravdess_model'
# Save model and weights
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)
model_path = os.path.join(save_dir, model_name)
model.save(model_path)
print('Saved trained model at %s ' % model_path)

### Reload the Saved Model

In [None]:
loaded_model = keras.models.load_model('/content/drive/My Drive/Ravdess_model/Emotion_Voice_Detection_Model.h5')
loaded_model.summary()

### Predicting Emotions on Test Data

### Checking Accuracy of the Loaded Model

In [None]:
loss, acc = loaded_model.evaluate(x_testcnn, y_test)
print("Restored model, accuracy: {:5.2f}%".format(100*acc))

# Conclusion