Skip to content

This repository is for speech emotion recognition for MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation

Notifications You must be signed in to change notification settings

rsc1102/Speech-Emotion-Recognition-On-MELD-Dataset

Repository files navigation

Speech-Emotion-Recognition-On-MELD-Dataset

This repository is for speech emotion recognition for MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation. The repository contains three jupyter notebooks

  • create_labels.ipynb
  • create_MFCC_dictionary(1).ipynb
  • MAIN.ipynb

The top two are for data preprocessing while the MAIN.ipynb is the notebook for training the models.

Usage:

  • Clone the repository on your system
  • Create a virtual environment
  • Install the requirements using the requirements.txt
  • Download the weights named '2DCNN_1DCNN_GRU.pth' and store them in the model_weights_directory.
  • Run testing_function.py for evaluation.
  • The code takes command line input of the form:

python3 testing_function.py audio_files_directory model_weights_directory

  • The output file will be in the current directory and is named output.txt

Results:

Model Val Accuracy
2D CNN 43.253
1D CNN 56.024
2D CNN & 1D CNN ensemble 44.337
2D CNN & 1D CNN with GRUs 51.927
Bi-LSTM 40.0
Bi-GRU 38.915

Features:

I have used LibROSA to get Mel-frequency cepstral coefficients(MFCCs) of each audio file. The MFCCs are of the form of 2D numpy arrays. I have also calculated the mean values of the MFCC numpy arrays to get additional features to be used by 1D CNNs.

Dataset:

The dataset is a collection of audio files with over 8000 utterances/phrases/conversations from the TV sitcom "Friends".

Dataset Disgust Fear Happy Neutral Sad Total
Training set 232 216 1609 4592 705 7354
Validation set 28 25 181 517 79 830

Issues:

  • The dataset suffers from severe class imbalance. The 'Neutral' class has over 4.5K data points in the training set while 'Fear' only has 216. Such an imbalance is also observed in the validation set.
  • Data is noisy, with the actor's voice being flooded by the background laughter. I was unable to separate the background laughter from the vocals, making classification extremely difficult.
  • Dataset is very small for properly understanding more nuanced emotions such as 'Disgust' and 'Fear'.

Compensatory measures to deal with dataset issues:

  • Choosing a smaller training set. But this has the problem of missing out on much of the available data resource. I have used a dataset with a maximum limit of 1000 data points per class as my training dataset.
  • Using data augmentation to increase dataset size. Since my dataset shrunk in size while trying to compensate for data imbalance, it was necessary to increase my dataset size using data augmentation techniques. I have used adding noise and random shifting data augmentation techniques.

Models Used:

2D CNN:

It is a basic model that simply takes the MFCCs and performs 2D convolutions on it, flattens it and reduces the linear layer to the number of label categories. Such a structure is usually used for audio classification tasks. It performs well on the validation set, however, it overwhelmingly predicts 'Neutral' class.

1D CNN:

It is an even simpler model than 2D CNN and relies on the mean values of MFCC numpy arrays to be its features. Despite its simple structure, it performed the best out of all the models I tested. However, this might change if more training data is used.

Bi-LSTM

The Bi-LSTM model was also tested with MFCC features. The model performs poorly due to a lack of sufficient training data.

Bi-GRU

The Bi-GRU model was also tested with MFCC features. The model performs poorly due to a lack of sufficient training data.

2D CNN 1D CNN Ensemble:

The model is a 2D CNN structure with a 1D CNN structure in parallel. This model was tested to improve upon the results obtained from either of the simpler models. Model performance is worse than 1D CNN, however, that should change if more data is used for training.

2D CNN 1D CNN GRU Ensemble:

The model is a 2D CNN structure with a 1D CNN structure in parallel, both CNN structures are followed by GRUs. This model was tested to improve upon the results obtained from either of the simpler models. Model performance is worse than 1D CNN, however, that should change if more data is used for training.

Take aways:

  • Despite having maximum validation accuracy, 1D CNNs fail to recognize emotions such as sadness, fear and disgust. Their simple structure might be limiting their understanding of these complex emotions.
  • None of the models I tested were able to identify the fear. This might be due to the complexity of fear emotion being higher than what the models can reasonably comprehend based on features extracted from the available dataset.
  • Proper preprocessing was not possible due to the presence of background laughter in the audio clips.
  • The models are extremely sensitive to hyper parameters and can show varying accuracy results based on the test dataset.

Citations:

S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, R. Mihalcea. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation. ACL 2019.

Chen, S.Y., Hsu, C.C., Kuo, C.C. and Ku, L.W. EmotionLines: An Emotion Corpus of Multi-Party Conversations. arXiv preprint arXiv:1802.08379 (2018).

About

This repository is for speech emotion recognition for MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published