# Speech Emotion Recognition

This program hopes to recognise human emotion and state from speech through tone and pitch. It does this through extracting numerical information of the sound file through finding the mean of the mfcc and chroma, this data is the used to train a MLP Classifier to gain predictions on a test set. 

The program will make use of the package librosa (https://librosa.github.io/librosa/) 

### Dataset

RAVDESS dataset and consists of 7356 files rated by 247 individuals 10 times on emotional validity, intensity, and genuineness. 

In [69]:
#package for audio analysis
import librosa
#soundfile can read and write sound files
import soundfile
# os package for os system dependent functionality
import os, glob, pickle
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score


### Data Extration

From the data files we want to extract certain bit of information. These include:

- mfcc : Mel Frequency Cepstral Coefficient, represents the short-term power spectrum of a sound

- chroma: Pertains to the 12 different pitch classes

- mel: Mel Spectrogram Frequency


Open the file with with() as this will close the file when done. Read the file to X and if Chroma is true get the short fourier transform of X. The short fourier transform breaks down the sound file into shorter sections before finding the fourier transform, giving the pitch of these smaller sections allowing us the decern how the pitch changes over time.

In [66]:

def extract_feature(file_name, mfcc, chroma, mel):      #when called mfcc, chroma and mel set as True
    with soundfile.SoundFile(file_name) as sound_file:
        #collect the sound file data and sample rate
        X = sound_file.read(dtype="float32")
        sample_rate=sound_file.samplerate

        if chroma:
            stft=np.abs(librosa.stft(X))
        result=np.array([])
        if mfcc:
            mfccs=np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
            result=np.hstack((result, mfccs))
           
        if chroma:
            chroma=np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
            result=np.hstack((result, chroma))
        if mel:
            mel=np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
            result=np.hstack((result, mel))
#        i=1
#         if i==1:
#             print('mfccs',mfccs)
#             print('chroma',chroma)
#             print('mel', mel)
#             i=0
    return result


In [67]:
#Emotions in the dataset
emotions={
  '01':'neutral',
  '02':'calm',
  '03':'happy',
  '04':'sad',
  '05':'angry',
  '06':'fearful',
  '07':'disgust',
  '08':'surprised'
}
#Emotions we wish to observe
observed_emotions=['calm', 'happy', 'fearful', 'disgust']

### Load data and assign

In [68]:
def load_data(test_size=0.2):
    x,y=[],[]
    
    for file in glob.glob("./Voice_files/Actor_*/*.wav"):    #uses glob.glob to search through files in area specified
        file_name=os.path.basename(file)     #finds basename of the file
        
        #emotion information is stored at x3 given file name layout ; x1-x2-x3-x4-x5-x6-x7
        emotion=emotions[file_name.split("-")[2]]
        
        #stop point for emotions not included in oberved_emotions
        if emotion not in observed_emotions:
            continue
            
        #extract useful information from soundfile
        feature=extract_feature(file, mfcc=True, chroma=True, mel=True)
        
        x.append(feature) #store freature of file
        y.append(emotion) #store emotion of file
    return train_test_split(np.array(x), y, test_size=test_size, random_state=9)



x_train,x_test,y_train,y_test=load_data(test_size=0.25)

In [48]:
print((x_train.shape[0], x_test.shape[0]))   #number of samples in training and testing data

(576, 192)


In [50]:
print(f'Features extracted: {x_train.shape[1]}')      #number of features in each sample to compare

Features extracted: 180


### Implementing the Model


In [51]:
model=MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08, hidden_layer_sizes=(300,), learning_rate='adaptive', max_iter=500)

In [52]:
model.fit(x_train,y_train)

MLPClassifier(activation='relu', alpha=0.01, batch_size=256, beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(300,), learning_rate='adaptive',
              learning_rate_init=0.001, max_fun=15000, max_iter=500,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [53]:
y_pred=model.predict(x_test)

In [71]:
accuracy=accuracy_score(y_true=y_test, y_pred=y_pred)
print("Accuracy: {:.2f}%".format(accuracy*100))


Accuracy: 64.06%
