<a href="https://colab.research.google.com/github/satvik-dixit/speech_emotion_recognition/blob/main/datasets_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Comparing results on multiple datasets

A demo for comparing results obtained using multiple datasets. For theis notebook, we use 6 open-source speech emotion datasets:
- Dataset 1: CaFE (French)
- Dataset 2: EmoDB (German)
- Dataset 3: ShEMO (Persian)
- Dataset 4: RAVDESS (English)
- Dataset 5: CREMA-D (English)
- Dataset 6: SAVEE (British English)

We look at the results obtained by using Deep Learning based features (hybrid BYOL-S) and DSP based features (openSMILE comPare and openSMILE egemaps) on logistic regression, SVM, random forest classification

Lets start by importing a few packages

### Importing packages

In [None]:
!pip install -q speechbrain
!pip install -q  transformers
!git clone -q https://github.com/GasserElbanna/serab-byols.git
!python3 -m pip install -q -e ./serab-byols

!pip install -q tqdm==4.60.0
!pip install -q opensmile


[K     |████████████████████████████████| 496 kB 17.9 MB/s 
[K     |████████████████████████████████| 750.6 MB 11 kB/s 
[K     |████████████████████████████████| 1.3 MB 49.4 MB/s 
[K     |████████████████████████████████| 101 kB 11.4 MB/s 
[K     |████████████████████████████████| 596 kB 67.9 MB/s 
[K     |████████████████████████████████| 109 kB 74.0 MB/s 
[K     |████████████████████████████████| 546 kB 71.3 MB/s 
[K     |████████████████████████████████| 3.7 MB 50.8 MB/s 
[K     |████████████████████████████████| 3.7 MB 57.3 MB/s 
[K     |████████████████████████████████| 2.9 MB 45.7 MB/s 
[?25h  Building wheel for hyperpyyaml (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.13.1+cu113 requires torch==1.12.1, but you have torch 1.11.0 which is incompatible.
torchtext 0.13.1 requires torch==1.12.1, 

In [None]:
! pip install -q kaggle

from google.colab import files
files.upload()
files.upload()

# Name directory
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

Saving utilities.py to utilities.py


Saving kaggle.json to kaggle.json


In [None]:
import os
import copy
import numpy as np
from tqdm import tqdm
from glob import glob
from random import sample
from pathlib import Path
import pandas as pd

import librosa
import soundfile as sf

import torch
import opensmile
import serab_byols
from transformers import Wav2Vec2Model, HubertModel

from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import GridSearchCV, train_test_split

import warnings
warnings.filterwarnings('ignore')

from utilities import load_audio_files, audio_embeddings_model, audio_embeddings, speaker_normalisation, split_train_test, get_hyperparams


### Defining a function for the pipeline



In [None]:
results = {'EmoDB': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'CaFE': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'ShEMO': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'CREMA-D': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'RAVDESS': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'SAVEE': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0}}
      
logistic_regression_results = copy.deepcopy(results)
support_vector_machine_results = copy.deepcopy(results)
random_forest_classifier_results = copy.deepcopy(results)

In [None]:
# Defining a function for all steps 

def pipeline(audio_list, speakers, labels, model_names, dataset = None, summary_table = None):
  '''
  Loads and resamples audio files 
  
  Parameters
  ------------
  audio_files: string
      The paths of the wav files 
  resampling_frequency: integer
      The frequency which all audios will be resampled to
  audio_list: list 
      The list of torch tensors of audios to which more audios need too be added, empty by default

  Returns
  ------------
  audio_list: list
      A list of torch tensors, one array for each audio file

  '''
  for model_name in model_names:
    print('MODEL: {}'.format(model_name))

    # Embeddings Extraction
    model = audio_embeddings_model(model_name = model_name)
    embeddings_array = audio_embeddings(audio_list, model_name=model_name, model=model)
    print('embeddings_array shape: {}'.format(embeddings_array.shape))

    # Speaker Normalisation
    normalised_embeddings = speaker_normalisation(embeddings_array, speakers)
    print('normalised_embeddings shape: {}'.format(normalised_embeddings.shape))
    columnwise_mean = torch.mean(normalised_embeddings, 0)
    if torch.all(columnwise_mean < 10**(-6)):
      print('PASSED: All means are less than 10**-6')
    else:
      print('FAILED: All means are NOT less than 10**-6')

    # Train Test Splitting
    X_train, X_test, y_train, y_test = split_train_test(normalised_embeddings, labels, speakers, test_size = 0.30)
    print('X_train shape: {}'.format(X_train.shape))
    print('X_test shape: {}'.format(X_test.shape))
    print('y_train len: {}'.format(len(y_train)))
    print('y_test len: {}'.format(len(y_test)))
    print()

    # Getting hyperparameters and checking max_recall
    print('Logistic Regression:')
    classifier = LogisticRegression()
    parameters = {'penalty' : ['l1', 'l2'], 'C': np.logspace(-3,1,3), 'solver': ['lbfgs', 'sag']}
    summary_table['Logistic Regression'][model_name] = logistic_regression_results[dataset][model_name] = np.round(100*get_hyperparams(X_train, X_test, y_train, y_test, classifier, parameters),1)
    print('Support Vector Machine:')
    classifier = SVC()
    parameters = {'C': np.logspace(-2,4,4), 'gamma': np.logspace(-5,-3,5), 'kernel':['rbf']}
    summary_table['Support Vector Machine'][model_name] = support_vector_machine_results[dataset][model_name] = np.round(100*get_hyperparams(X_train, X_test, y_train, y_test, classifier, parameters),1)
    print('Random Forest Classifier:')
    classifier = RandomForestClassifier()
    parameters = {'n_estimators' : [50, 100, 200], 'max_features' : ['auto', 'log2', 'sqrt'], 'bootstrap' : [False]}
    summary_table['Random Forest Classification'][model_name] = random_forest_classifier_results[dataset][model_name] = np.round(100*get_hyperparams(X_train, X_test, y_train, y_test, classifier, parameters),1)
    print()
    print()


# Dataset: Canadian French Emotion (CaFE)


### Details:
- French (Canadian)
- 936 recording
- 12 speakers (6 females + 6 males)
- 7 emotions: happy, sad, angry, fearful, surprise, disgust and neutral

### References:
- Dataset: https://zenodo.org/record/1478765#.YvyXfexBy3I
- Paper: https://www.researchgate.net/publication/326022359_A_canadian_french_emotional_speech_dataset



### Loading CaFE dataset and extracting metadata

In [None]:
# Phase_1
# Load dataset
! wget -q https://zenodo.org/record/1478765/files/CaFE_48k.zip?download=1
! unzip -q CaFE_48k.zip?download=1 -d /content/cafe

# Select all the audio files
audios = []
for file in Path('/content/cafe').glob("**/*.wav"):
    if not file.is_file(): 
        continue
    audios.append(str(file))

# Load and resample audio files
audio_list = load_audio_files(audios, resampling_frequency=16000)

# Making speakers list and labels list 
speakers = []
labels = []
for audio_file in audios:
  file_name = audio_file.split('/')[-1]
  speakers.append(file_name.split('-')[0])
  labels.append(file_name.split('-')[1])


# Verify phase_1
print('Number of audio files: {}'.format(len(audio_list)))
print('Number of speaker classes: {}'.format(len(set(speakers))))
print('Speaker classes: {}'.format(set(speakers)))
print('Number of speakers: {}'.format(len(speakers)))
print('Number of label classes: {}'.format(len(set(labels))))
print('Label classes: {}'.format(set(labels)))
print('Number of labels: {}'.format(len(labels)))


Number of audio files: 936
Number of speaker classes: 12
Speaker classes: {'03', '04', '09', '02', '12', '08', '01', '07', '11', '06', '10', '05'}
Number of speakers: 936
Number of label classes: 7
Label classes: {'J', 'T', 'P', 'C', 'S', 'D', 'N'}
Number of labels: 936


### Metadata:
Speakers: (12 speakers) 
- 01 to 12

Labels: (7 labels)

- P: fearful
- C: angry
- N: Neutral
- D: disgust
- J: happy
- S: surprise
- T: sad

### Getting max_recall of all the models on CaFE

In [None]:
summary_table = {'Logistic Regression': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'Support Vector Machine': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'Random Forest Classification': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0}}
        

In [None]:
model_names = ['hybrid_byols', 'compare', 'egemaps']
pipeline(audio_list, speakers, labels, model_names, 'CaFE', summary_table)


MODEL: hybrid_byols


Generating Embeddings...: 100%|██████████| 936/936 [01:12<00:00, 12.89it/s]


embeddings_array shape: torch.Size([936, 2048])
normalised_embeddings shape: torch.Size([936, 2048])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([702, 2048])
X_test shape: torch.Size([234, 2048])
y_train len: 702
y_test len: 234

Logistic Regression:
recall_macro : 0.727099567099567
Best Parameters: {'C': 10.0, 'penalty': 'l2', 'solver': 'lbfgs'}
recall_macro on test_set: 0.7341269841269841
Support Vector Machine:
recall_macro : 0.7105009276437848
Best Parameters: {'C': 100.0, 'gamma': 0.00031622776601683794, 'kernel': 'rbf'}
recall_macro on test_set: 0.75
Random Forest Classifier:
recall_macro : 0.6494248608534321
Best Parameters: {'bootstrap': False, 'max_features': 'auto', 'n_estimators': 200}
recall_macro on test_set: 0.6746031746031745


MODEL: compare


100%|██████████| 936/936 [02:08<00:00,  7.26it/s]


embeddings_array shape: torch.Size([936, 6373])
normalised_embeddings shape: torch.Size([936, 6373])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([702, 6373])
X_test shape: torch.Size([234, 6373])
y_train len: 702
y_test len: 234

Logistic Regression:
recall_macro : 0.6374644403215832
Best Parameters: {'C': 0.1, 'penalty': 'l2', 'solver': 'sag'}
recall_macro on test_set: 0.6865079365079365
Support Vector Machine:
recall_macro : 0.6320964749536178
Best Parameters: {'C': 100.0, 'gamma': 1e-05, 'kernel': 'rbf'}
recall_macro on test_set: 0.6785714285714286
Random Forest Classifier:
recall_macro : 0.5961162646876932
Best Parameters: {'bootstrap': False, 'max_features': 'sqrt', 'n_estimators': 200}
recall_macro on test_set: 0.6230158730158729


MODEL: egemaps


100%|██████████| 936/936 [02:31<00:00,  6.19it/s]


embeddings_array shape: torch.Size([936, 88])
normalised_embeddings shape: torch.Size([936, 88])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([702, 88])
X_test shape: torch.Size([234, 88])
y_train len: 702
y_test len: 234

Logistic Regression:
recall_macro : 0.5955720470006185
Best Parameters: {'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'}
recall_macro on test_set: 0.611111111111111
Support Vector Machine:
recall_macro : 0.5840940012368584
Best Parameters: {'C': 100.0, 'gamma': 0.0001, 'kernel': 'rbf'}
recall_macro on test_set: 0.5674603174603174
Random Forest Classifier:
recall_macro : 0.6008781694495979
Best Parameters: {'bootstrap': False, 'max_features': 'sqrt', 'n_estimators': 200}
recall_macro on test_set: 0.5714285714285714




### CaFE Results

In [None]:
df = pd.DataFrame(summary_table)
df


Unnamed: 0,Logistic Regression,Support Vector Machine,Random Forest Classification
hybrid_byols,73.4,75.0,67.5
compare,68.7,67.9,62.3
egemaps,61.1,56.7,57.1


# Dataset: Persian Speech Emotion Detection Dataset (ShEMO)

### Details:
- Persian
- 3000 semi-natural utterances
- 87 speakers
- 6 emotions: anger, fear, happiness, sadness, neutral and surprise

### References:
- Dataset: https://github.com/mansourehk/ShEMO
- Paper: https://link.springer.com/article/10.1007/s10579-018-9427-x



### Loading ShEMO dataset and extracting metadata

In [None]:
# Phase_1
# Load dataset
! kaggle datasets download -q -d mansourehk/shemo-persian-speech-emotion-detection-database
! unzip -q shemo-persian-speech-emotion-detection-database.zip -d shemo;

# Select all the audio files
audios = []
for file in Path('/content/shemo').glob("**/*.wav"):
    if not file.is_file(): 
        continue
    audios.append(str(file))

# Load and resample audio files
audio_list = load_audio_files(audios, resampling_frequency=16000)

# Making speakers list and labels list 
speakers = []
labels = []
for audio_file in audios:
  file_name = audio_file.split('/')[4]
  speakers.append(file_name[4:6])
  labels.append(file_name[3])


# Verify phase_1
print('Number of audio files: {}'.format(len(audio_list)))
print('Number of speaker classes: {}'.format(len(set(speakers))))
print('Speaker classes: {}'.format(set(speakers)))
print('Number of speakers: {}'.format(len(speakers)))
print('Number of label classes: {}'.format(len(set(labels))))
print('Label classes: {}'.format(set(labels)))
print('Number of labels: {}'.format(len(labels)))

Number of audio files: 3000
Number of speaker classes: 87
Speaker classes: {'85', '08', '83', '39', '79', '70', '28', '17', '51', '71', '35', '48', '33', '86', '15', '52', '19', '76', '63', '77', '16', '31', '36', '69', '22', '09', '55', '47', '43', '67', '23', '03', '56', '21', '38', '06', '11', '18', '29', '50', '59', '72', '13', '42', '20', '81', '75', '10', '73', '58', '30', '14', '37', '60', '04', '24', '25', '05', '12', '44', '80', '45', '57', '62', '46', '78', '27', '68', '53', '65', '26', '84', '49', '02', '61', '41', '01', '07', '87', '34', '64', '74', '54', '66', '82', '40', '32'}
Number of speakers: 3000
Number of label classes: 6
Label classes: {'F', 'W', 'S', 'N', 'A', 'H'}
Number of labels: 3000



### Metadata:
Speakers: (87 speakers) 
- 01 to 87 

Labels: (6 labels)
- A: anger 
- F: fear
- H: happiness
- N: neutral
- S: sadness
- W: surprise

### Getting max_recall of all the models on ShEMO





In [None]:
summary_table = {'Logistic Regression': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'Support Vector Machine': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'Random Forest Classification': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0}}
        

In [None]:
model_names = ['hybrid_byols', 'compare', 'egemaps']
pipeline(audio_list, speakers, labels, model_names, 'ShEMO', summary_table)


MODEL: hybrid_byols


Generating Embeddings...: 100%|██████████| 3000/3000 [04:01<00:00, 12.41it/s]


embeddings_array shape: torch.Size([3000, 2048])
normalised_embeddings shape: torch.Size([3000, 2048])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([1975, 2048])
X_test shape: torch.Size([1025, 2048])
y_train len: 1975
y_test len: 1025

Logistic Regression:
recall_macro : 0.5923591517600233
Best Parameters: {'C': 10.0, 'penalty': 'l2', 'solver': 'sag'}
recall_macro on test_set: 0.5591476916343564
Support Vector Machine:
recall_macro : 0.5854103433373448
Best Parameters: {'C': 100.0, 'gamma': 1e-05, 'kernel': 'rbf'}
recall_macro on test_set: 0.5474865785406321
Random Forest Classifier:
recall_macro : 0.4478009617998346
Best Parameters: {'bootstrap': False, 'max_features': 'auto', 'n_estimators': 200}
recall_macro on test_set: 0.42086211193159045


MODEL: compare


100%|██████████| 3000/3000 [06:52<00:00,  7.28it/s]


embeddings_array shape: torch.Size([3000, 6373])
normalised_embeddings shape: torch.Size([3000, 6373])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([1942, 6373])
X_test shape: torch.Size([1058, 6373])
y_train len: 1942
y_test len: 1058

Logistic Regression:
recall_macro : 0.5659432304836564
Best Parameters: {'C': 10.0, 'penalty': 'l2', 'solver': 'sag'}
recall_macro on test_set: 0.5200994465037481
Support Vector Machine:
recall_macro : 0.5295765668136451
Best Parameters: {'C': 100.0, 'gamma': 3.1622776601683795e-05, 'kernel': 'rbf'}
recall_macro on test_set: 0.4922373479821831
Random Forest Classifier:
recall_macro : 0.42344982038646817
Best Parameters: {'bootstrap': False, 'max_features': 'sqrt', 'n_estimators': 200}
recall_macro on test_set: 0.40061124914876095


MODEL: egemaps


100%|██████████| 3000/3000 [08:03<00:00,  6.20it/s]


embeddings_array shape: torch.Size([3000, 88])
normalised_embeddings shape: torch.Size([3000, 88])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([1967, 88])
X_test shape: torch.Size([1033, 88])
y_train len: 1967
y_test len: 1033

Logistic Regression:
recall_macro : 0.4613138812155073
Best Parameters: {'C': 10.0, 'penalty': 'l2', 'solver': 'lbfgs'}
recall_macro on test_set: 0.5022154836465113
Support Vector Machine:
recall_macro : 0.4474338675953378
Best Parameters: {'C': 100.0, 'gamma': 0.001, 'kernel': 'rbf'}
recall_macro on test_set: 0.49504624588942175
Random Forest Classifier:
recall_macro : 0.41735865007082057
Best Parameters: {'bootstrap': False, 'max_features': 'auto', 'n_estimators': 200}
recall_macro on test_set: 0.45268220280022553




### ShEMO Results

In [None]:
df = pd.DataFrame(summary_table)
df


Unnamed: 0,Logistic Regression,Support Vector Machine,Random Forest Classification
hybrid_byols,55.9,54.7,42.1
compare,52.0,49.2,40.1
egemaps,50.2,49.5,45.3


# Dataset: EmoDB 

### Details:
- German
- 800 recordings
- 10 actors (5 males and 5 females)
- 7 emotions: anger, neutral, fear, boredom, happiness, sadness, disgust

### References:
- Dataset: http://emodb.bilderbar.info/index-1280.html
- Paper: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.8506&rep=rep1&type=pdf




### Loading EmoDB dataset and extracting metadata

In [None]:
# Phase_1
# Load dataset
! kaggle datasets download -q -d piyushagni5/berlin-database-of-emotional-speech-emodb
! unzip -q berlin-database-of-emotional-speech-emodb.zip

# Load and resample audio files
audio_files = glob(os.path.join('/content/wav','*.wav'))
audio_list= load_audio_files(audio_files, resampling_frequency=16000)

# Making speakers list and labels list 
speakers = []
labels = []
for audio_file in audio_files:
  file_name = audio_file.split('/')[3]
  speakers.append(int(file_name[:2]))
  labels.append(file_name[5:6])


# Verify phase_1
print('Number of audio files: {}'.format(len(audio_list)))
print('Number of speaker classes: {}'.format(len(set(speakers))))
print('Speaker classes: {}'.format(set(speakers)))
print('Number of speakers: {}'.format(len(speakers)))
print('Number of label classes: {}'.format(len(set(labels))))
print('Label classes: {}'.format(set(labels)))
print('Number of labels: {}'.format(len(labels)))

Number of audio files: 535
Number of speaker classes: 10
Speaker classes: {3, 8, 9, 10, 11, 12, 13, 14, 15, 16}
Number of speakers: 535
Number of label classes: 7
Label classes: {'F', 'T', 'W', 'E', 'L', 'N', 'A'}
Number of labels: 535


### Metadata:
Speakers: (10 speakers) 
- 03 - male, 31 years old
- 08 - female, 34 years
- 09 - female, 21 years
- 10 - male, 32 years
- 11 - male, 26 years
- 12 - male, 30 years
- 13 - female, 32 years
- 14 - female, 35 years
- 15 - male, 25 years
- 16 - female, 31 years

Labels: (7 labels)
- T: sadness 
- F: happiness
- E: disgust
- W: anger
- A: fear
- L: boredom
- N: neutral

### Getting max_recall of all the models on EmoDB

In [None]:
summary_table = {'Logistic Regression': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'Support Vector Machine': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'Random Forest Classification': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0}}
        

In [None]:
model_names = ['hybrid_byols', 'compare', 'egemaps']
pipeline(audio_list, speakers, labels, model_names, 'EmoDB', summary_table)


MODEL: hybrid_byols


Generating Embeddings...: 100%|██████████| 535/535 [00:27<00:00, 19.34it/s]


embeddings_array shape: torch.Size([535, 2048])
normalised_embeddings shape: torch.Size([535, 2048])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([371, 2048])
X_test shape: torch.Size([164, 2048])
y_train len: 371
y_test len: 164

Logistic Regression:
recall_macro : 0.8920336555674903
Best Parameters: {'C': 10.0, 'penalty': 'l2', 'solver': 'lbfgs'}
recall_macro on test_set: 0.8673666568076506
Support Vector Machine:
recall_macro : 0.9039384174722521
Best Parameters: {'C': 100.0, 'gamma': 0.0001, 'kernel': 'rbf'}
recall_macro on test_set: 0.8485024154589372
Random Forest Classifier:
recall_macro : 0.8046640410550185
Best Parameters: {'bootstrap': False, 'max_features': 'sqrt', 'n_estimators': 200}
recall_macro on test_set: 0.7469683525584146


MODEL: compare


100%|██████████| 535/535 [00:58<00:00,  9.08it/s]


embeddings_array shape: torch.Size([535, 6373])
normalised_embeddings shape: torch.Size([535, 6373])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([351, 6373])
X_test shape: torch.Size([184, 6373])
y_train len: 351
y_test len: 184

Logistic Regression:
recall_macro : 0.8468769325912182
Best Parameters: {'C': 10.0, 'penalty': 'l2', 'solver': 'sag'}
recall_macro on test_set: 0.8284103381929467
Support Vector Machine:
recall_macro : 0.8472758194186767
Best Parameters: {'C': 100.0, 'gamma': 3.1622776601683795e-05, 'kernel': 'rbf'}
recall_macro on test_set: 0.858094441790094
Random Forest Classifier:
recall_macro : 0.8141104926819211
Best Parameters: {'bootstrap': False, 'max_features': 'sqrt', 'n_estimators': 200}
recall_macro on test_set: 0.7439184185379837


MODEL: egemaps


100%|██████████| 535/535 [01:04<00:00,  8.35it/s]


embeddings_array shape: torch.Size([535, 88])
normalised_embeddings shape: torch.Size([535, 88])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([360, 88])
X_test shape: torch.Size([175, 88])
y_train len: 360
y_test len: 175

Logistic Regression:
recall_macro : 0.7751672184025125
Best Parameters: {'C': 0.1, 'penalty': 'l2', 'solver': 'sag'}
recall_macro on test_set: 0.756169519461445
Support Vector Machine:
recall_macro : 0.7708522196757491
Best Parameters: {'C': 100.0, 'gamma': 0.00031622776601683794, 'kernel': 'rbf'}
recall_macro on test_set: 0.7624359100756616
Random Forest Classifier:
recall_macro : 0.7486227824463119
Best Parameters: {'bootstrap': False, 'max_features': 'log2', 'n_estimators': 200}
recall_macro on test_set: 0.6993356287145108




### EmoDB Results

In [None]:
df = pd.DataFrame(summary_table)
df


Unnamed: 0,Logistic Regression,Support Vector Machine,Random Forest Classification
hybrid_byols,86.7,84.9,74.7
compare,82.8,85.8,74.4
egemaps,75.6,76.2,69.9


# Dataset: Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

### Details:
- English
- 7356 recordings
- 24 actors (12 female, 12 male)
- 8 emotions: neutral, calm, happy, sad, angry, fearful, surprise, and disgust

### References:
- Dataset: https://zenodo.org/record/1188976#.YvyPHexBy3K
- Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0196391



### Loading RAVDESS dataset and extracting metadata

In [None]:
# Phase_1
# Load dataset
! kaggle datasets download -q -d uwrfkaggler/ravdess-emotional-speech-audio
! unzip -q ravdess-emotional-speech-audio.zip -d '/content/ravdess'

# Select all the audio files
audios = []
for file in Path('/content/ravdess/audio_speech_actors_01-24').glob("**/*.wav"):
    if not file.is_file(): 
        continue
    audios.append(str(file))

# Load and resample audio files
audio_list = load_audio_files(audios, resampling_frequency=16000)

# Making speakers list and labels list 
speakers = []
labels = []
for audio_file in audios:
  file_name = audio_file.split('/')[5]
  speakers.append(file_name[18:20])
  labels.append(file_name[6:8])


# Verify phase_1
print('Number of audio files: {}'.format(len(audio_list)))
print('Number of speaker classes: {}'.format(len(set(speakers))))
print('Speaker classes: {}'.format(set(speakers)))
print('Number of speakers: {}'.format(len(speakers)))
print('Number of label classes: {}'.format(len(set(labels))))
print('Label classes: {}'.format(set(labels)))
print('Number of labels: {}'.format(len(labels)))

Number of audio files: 1440
Number of speaker classes: 24
Speaker classes: {'16', '08', '22', '09', '02', '17', '01', '07', '14', '23', '03', '04', '24', '15', '21', '11', '06', '18', '05', '12', '19', '13', '20', '10'}
Number of speakers: 1440
Number of label classes: 8
Label classes: {'03', '04', '02', '08', '01', '07', '06', '05'}
Number of labels: 1440


### Metadata:
Speakers: (24 speakers) 
- Odd numbered actors are male 
- Even numbered actors are female

Labels: (8 labels)
- 01 = neutral
- 02 = calm
- 03 = happy
- 04 = sad
- 05 = angry
- 06 = fearful
- 07 = disgust
- 08 = surprised

### Getting max_recall of all the models on RAVDESS

In [None]:
summary_table = {'Logistic Regression': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'Support Vector Machine': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'Random Forest Classification': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0}}
        

In [None]:
model_names = ['hybrid_byols', 'compare', 'egemaps']
pipeline(audio_list, speakers, labels, model_names, 'RAVDESS', summary_table)


MODEL: hybrid_byols


Generating Embeddings...: 100%|██████████| 1440/1440 [01:36<00:00, 14.89it/s]


embeddings_array shape: torch.Size([1440, 2048])
normalised_embeddings shape: torch.Size([1440, 2048])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([1020, 2048])
X_test shape: torch.Size([420, 2048])
y_train len: 1020
y_test len: 420

Logistic Regression:
recall_macro : 0.7641458078958079
Best Parameters: {'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'}
recall_macro on test_set: 0.7901785714285714
Support Vector Machine:
recall_macro : 0.7702253764753765
Best Parameters: {'C': 100.0, 'gamma': 3.1622776601683795e-05, 'kernel': 'rbf'}
recall_macro on test_set: 0.7991071428571428
Random Forest Classifier:
recall_macro : 0.6453220390720391
Best Parameters: {'bootstrap': False, 'max_features': 'sqrt', 'n_estimators': 200}
recall_macro on test_set: 0.7098214285714286


MODEL: compare


100%|██████████| 1440/1440 [03:01<00:00,  7.92it/s]


embeddings_array shape: torch.Size([1440, 6373])
normalised_embeddings shape: torch.Size([1440, 6373])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([1020, 6373])
X_test shape: torch.Size([420, 6373])
y_train len: 1020
y_test len: 420

Logistic Regression:
recall_macro : 0.6615181115181115
Best Parameters: {'C': 0.001, 'penalty': 'l2', 'solver': 'sag'}
recall_macro on test_set: 0.7008928571428571
Support Vector Machine:
recall_macro : 0.643574481074481
Best Parameters: {'C': 100.0, 'gamma': 1e-05, 'kernel': 'rbf'}
recall_macro on test_set: 0.6875
Random Forest Classifier:
recall_macro : 0.5842236467236468
Best Parameters: {'bootstrap': False, 'max_features': 'auto', 'n_estimators': 200}
recall_macro on test_set: 0.5870535714285714


MODEL: egemaps


100%|██████████| 1440/1440 [03:21<00:00,  7.15it/s]


embeddings_array shape: torch.Size([1440, 88])
normalised_embeddings shape: torch.Size([1440, 88])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([1020, 88])
X_test shape: torch.Size([420, 88])
y_train len: 1020
y_test len: 420

Logistic Regression:
recall_macro : 0.5803291615791617
Best Parameters: {'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'}
recall_macro on test_set: 0.6316964285714285
Support Vector Machine:
recall_macro : 0.5926078551078551
Best Parameters: {'C': 100.0, 'gamma': 0.00031622776601683794, 'kernel': 'rbf'}
recall_macro on test_set: 0.6316964285714286
Random Forest Classifier:
recall_macro : 0.5597018722018722
Best Parameters: {'bootstrap': False, 'max_features': 'sqrt', 'n_estimators': 200}
recall_macro on test_set: 0.6026785714285714




### RAVDESS Results

In [None]:
df = pd.DataFrame(summary_table)
df


Unnamed: 0,Logistic Regression,Support Vector Machine,Random Forest Classification
hybrid_byols,79.0,79.9,71.0
compare,70.1,68.8,58.7
egemaps,63.2,63.2,60.3


# Dataset: Crowd Sourced Emotional Multimodal Actors Dataset (CREMA-D)

### Details:
- English
- 7442 clip 
- 91 actors (48 males and 43 females)
- 6 emotions: angry, disgusted, fearful, happy, neutral, and sad

### References:
- Dataset: https://github.com/CheyneyComputerScience/CREMA-D
- Paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4313618/




### Loading CREMA-D dataset and extracting metadata

In [None]:
# Phase_1
# Load dataset
! kaggle datasets download -q -d ejlok1/cremad
! unzip -q cremad.zip

# Load and resample audio files
audio_files = glob(os.path.join('/content/AudioWAV','*.wav'))
audio_list = load_audio_files(audio_files, resampling_frequency=16000)

# Make speakers list and labels list 
speakers = []
labels = []
for audio_file in audio_files:
  file_name = audio_file.split('/')[3]
  speakers.append(int(file_name[:4]))
  labels.append(file_name[9:12])


# Verify phase_1
print('Number of audio files: {}'.format(len(audio_list)))
print('Number of speaker classes: {}'.format(len(set(speakers))))
print('Speaker classes: {}'.format(set(speakers)))
print('Number of speakers: {}'.format(len(speakers)))
print('Number of label classes: {}'.format(len(set(labels))))
print('Label classes: {}'.format(set(labels)))
print('Number of labels: {}'.format(len(labels)))

Number of audio files: 7442
Number of speaker classes: 91
Speaker classes: {1024, 1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071, 1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087, 1088, 1089, 1090, 1091, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023}
Number of speakers: 7442
Number of label classes: 6
Label classes: {'DIS', 'HAP', 'SAD', 'ANG', 'NEU', 'FEA'}
Number of labels: 7442


### Metadata:
Speakers: (91 speakers) 
- 1001 - 1091

Labels: (8 labels)
- HAP: happy
- ANG: angry
- SAD: sad
- NEU: neutral
- FEA: fearful
- DIS: disgusted

### Getting max_recall of all the models on CREMA-D

In [None]:
summary_table = {'Logistic Regression': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'Support Vector Machine': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'Random Forest Classification': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0}}
        

In [None]:
model_names = ['hybrid_byols']
pipeline(audio_list, speakers, labels, model_names, 'CREMA-D', summary_table)


MODEL: hybrid_byols


Generating Embeddings...: 100%|██████████| 7442/7442 [05:51<00:00, 21.19it/s]


embeddings_array shape: torch.Size([7442, 2048])
normalised_embeddings shape: torch.Size([7442, 2048])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([5240, 2048])
X_test shape: torch.Size([2202, 2048])
y_train len: 5240
y_test len: 2202

Logistic Regression:
recall_macro : 0.7386984092209199
Best Parameters: {'C': 0.001, 'penalty': 'l2', 'solver': 'lbfgs'}
recall_macro on test_set: 0.7387394828421656
Support Vector Machine:
recall_macro : 0.7223840995119339
Best Parameters: {'C': 100.0, 'gamma': 1e-05, 'kernel': 'rbf'}
recall_macro on test_set: 0.7212237346372405
Random Forest Classifier:
recall_macro : 0.636578181862441
Best Parameters: {'bootstrap': False, 'max_features': 'sqrt', 'n_estimators': 200}
recall_macro on test_set: 0.6407647240209683




In [None]:
model_names = ['egemaps']
pipeline(audio_list, speakers, labels, model_names, 'CREMA-D', summary_table)


MODEL: egemaps


100%|██████████| 7442/7442 [12:25<00:00,  9.99it/s]


embeddings_array shape: torch.Size([7442, 88])
normalised_embeddings shape: torch.Size([7442, 88])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([5228, 88])
X_test shape: torch.Size([2214, 88])
y_train len: 5228
y_test len: 2214

Logistic Regression:
recall_macro : 0.6005416807731594
Best Parameters: {'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'}
recall_macro on test_set: 0.6100823045267489
Support Vector Machine:
recall_macro : 0.6106499692433653
Best Parameters: {'C': 100.0, 'gamma': 0.001, 'kernel': 'rbf'}
recall_macro on test_set: 0.6300705467372133
Random Forest Classifier:
recall_macro : 0.5857512573519398
Best Parameters: {'bootstrap': False, 'max_features': 'log2', 'n_estimators': 200}
recall_macro on test_set: 0.5861258083480305




In [None]:
model_names = ['compare']
pipeline(audio_list, speakers, labels, model_names, 'CREMA-D', summary_table)


MODEL: compare


100%|██████████| 7442/7442 [12:06<00:00, 10.24it/s]


embeddings_array shape: torch.Size([7442, 6373])
normalised_embeddings shape: torch.Size([7442, 6373])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([5229, 6373])
X_test shape: torch.Size([2213, 6373])
y_train len: 5229
y_test len: 2213

Logistic Regression:
recall_macro : 0.6831417705091309
Best Parameters: {'C': 0.001, 'penalty': 'l2', 'solver': 'lbfgs'}
recall_macro on test_set: 0.7049677570833403
Support Vector Machine:
recall_macro : 0.6536295909494628
Best Parameters: {'C': 100.0, 'gamma': 1e-05, 'kernel': 'rbf'}
recall_macro on test_set: 0.6837846249610955
Random Forest Classifier:
recall_macro : 0.6135556688609493
Best Parameters: {'bootstrap': False, 'max_features': 'sqrt', 'n_estimators': 200}
recall_macro on test_set: 0.6435669784482995




### CREMA-D Results

In [None]:
df = pd.DataFrame(summary_table)
df


Unnamed: 0,Logistic Regression,Support Vector Machine,Random Forest Classification
hybrid_byols,73.9,72.1,64.1
compare,70.5,68.4,64.4
egemaps,61.0,63.0,58.6


# Dataset: Surrey Audio-Visual Expressed Emotion (SAVEE)

### Details:
- English (British)
- 480 utterances 
- 4 male actors
- 7 emotions: anger, disgust, fear, happiness, sadness, surprise and neutral

### References:
- Dataset: http://kahlan.eps.surrey.ac.uk/savee/Database.html
- Paper: http://personal.ee.surrey.ac.uk/Personal/P.Jackson/pub/ma10/HaqJackson_MachineAudition10_approved.pdf



### Loading SAVEE dataset and extracting metadata

In [None]:
# Phase_1
# Load dataset
! kaggle datasets download -q -d barelydedicated/savee-database
! unzip -q savee-database.zip 

# Select all the audio files
audios = []
for file in Path('/content/AudioData').glob("**/*.wav"):
    if not file.is_file(): 
        continue
    audios.append(str(file))

# Load and resample audio files
audio_list = load_audio_files(audios, resampling_frequency=16000)

# Making speakers list and labels list 
speakers = []
labels = []
for audio_file in audios:
  file_name = audio_file.split('/')[4]
  speakers.append(audio_file.split('/')[3])
  if file_name[0]!='s':
    labels.append(file_name[0])
  else:
    labels.append(file_name[0:2])


# Verify phase_1
print('Number of audio files: {}'.format(len(audio_list)))
print('Number of speaker classes: {}'.format(len(set(speakers))))
print('Speaker classes: {}'.format(set(speakers)))
print('Number of speakers: {}'.format(len(speakers)))
print('Number of label classes: {}'.format(len(set(labels))))
print('Label classes: {}'.format(set(labels)))
print('Number of labels: {}'.format(len(labels)))

Number of audio files: 480
Number of speaker classes: 4
Speaker classes: {'JK', 'DC', 'KL', 'JE'}
Number of speakers: 480
Number of label classes: 7
Label classes: {'d', 'f', 'h', 'n', 'a', 'sa', 'su'}
Number of labels: 480


### Metadata:
Speakers: (4 speakers) 
- DC: male speaker 1
- JE: male speaker 2
- JK: male speaker 3
- KL: male speaker 4

Labels: (7 labels)

- h: happy
- a: anger
- d: disgust
- f: fear
- sa: sadness
- su: surprise
- n: neutral

### Getting max_recall of all the models on SAVEE

In [None]:
summary_table = {'Logistic Regression': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'Support Vector Machine': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0},
        'Random Forest Classification': {'hybrid_byols': 0, 'compare': 0, 'egemaps': 0}}
        

In [None]:
model_names = ['hybrid_byols', 'compare', 'egemaps']
pipeline(audio_list, speakers, labels, model_names, 'SAVEE', summary_table)


MODEL: hybrid_byols


Generating Embeddings...: 100%|██████████| 480/480 [00:37<00:00, 12.70it/s]


embeddings_array shape: torch.Size([480, 2048])
normalised_embeddings shape: torch.Size([480, 2048])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([360, 2048])
X_test shape: torch.Size([120, 2048])
y_train len: 360
y_test len: 120

Logistic Regression:
recall_macro : 0.8507936507936508
Best Parameters: {'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'}
recall_macro on test_set: 0.5285714285714286
Support Vector Machine:
recall_macro : 0.8857142857142856
Best Parameters: {'C': 100.0, 'gamma': 0.0001, 'kernel': 'rbf'}
recall_macro on test_set: 0.5238095238095238
Random Forest Classifier:
recall_macro : 0.7984126984126985
Best Parameters: {'bootstrap': False, 'max_features': 'sqrt', 'n_estimators': 200}
recall_macro on test_set: 0.4666666666666667


MODEL: compare


100%|██████████| 480/480 [01:02<00:00,  7.70it/s]


embeddings_array shape: torch.Size([480, 6373])
normalised_embeddings shape: torch.Size([480, 6373])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([360, 6373])
X_test shape: torch.Size([120, 6373])
y_train len: 360
y_test len: 120

Logistic Regression:
recall_macro : 0.680952380952381
Best Parameters: {'C': 10.0, 'penalty': 'l2', 'solver': 'lbfgs'}
recall_macro on test_set: 0.6380952380952382
Support Vector Machine:
recall_macro : 0.6555555555555557
Best Parameters: {'C': 100.0, 'gamma': 1e-05, 'kernel': 'rbf'}
recall_macro on test_set: 0.5952380952380951
Random Forest Classifier:
recall_macro : 0.6428571428571429
Best Parameters: {'bootstrap': False, 'max_features': 'sqrt', 'n_estimators': 100}
recall_macro on test_set: 0.5333333333333333


MODEL: egemaps


100%|██████████| 480/480 [01:10<00:00,  6.84it/s]


embeddings_array shape: torch.Size([480, 88])
normalised_embeddings shape: torch.Size([480, 88])
PASSED: All means are less than 10**-6
X_train shape: torch.Size([360, 88])
X_test shape: torch.Size([120, 88])
y_train len: 360
y_test len: 120

Logistic Regression:
recall_macro : 0.6444444444444445
Best Parameters: {'C': 0.1, 'penalty': 'l2', 'solver': 'sag'}
recall_macro on test_set: 0.6523809523809525
Support Vector Machine:
recall_macro : 0.673015873015873
Best Parameters: {'C': 100.0, 'gamma': 0.001, 'kernel': 'rbf'}
recall_macro on test_set: 0.619047619047619
Random Forest Classifier:
recall_macro : 0.684126984126984
Best Parameters: {'bootstrap': False, 'max_features': 'sqrt', 'n_estimators': 200}
recall_macro on test_set: 0.6476190476190476




### SAVEE Results

In [None]:
df = pd.DataFrame(summary_table)
df


Unnamed: 0,Logistic Regression,Support Vector Machine,Random Forest Classification
hybrid_byols,52.9,52.4,46.7
compare,63.8,59.5,53.3
egemaps,65.2,61.9,64.8


# Results

### Logistic Regression Results

The recall_macro (%) on different datasets for logistic regression

In [None]:
lr_data = pd.DataFrame(logistic_regression_results)
lr_data


Unnamed: 0,EmoDB,CaFE,ShEMO,CREMA-D,RAVDESS,SAVEE
hybrid_byols,86.7,73.4,55.9,73.9,79.0,52.9
compare,82.8,68.7,52.0,70.5,70.1,63.8
egemaps,75.6,61.1,50.2,61.0,63.2,65.2


### Support Vector Machine Results

The recall_macro (%) on different datasets for SVM

In [None]:
svm_data = pd.DataFrame(support_vector_machine_results)
svm_data


Unnamed: 0,EmoDB,CaFE,ShEMO,CREMA-D,RAVDESS,SAVEE
hybrid_byols,84.9,75.0,54.7,72.1,79.9,52.4
compare,85.8,67.9,49.2,68.4,68.8,59.5
egemaps,76.2,56.7,49.5,63.0,63.2,61.9


### Random Forest Classification Results

The recall_macro (%) on different datasets for random forest classification

In [None]:
rfc_data = pd.DataFrame(random_forest_classifier_results)
rfc_data


Unnamed: 0,EmoDB,CaFE,ShEMO,CREMA-D,RAVDESS,SAVEE
hybrid_byols,74.7,67.5,42.1,64.1,71.0,46.7
compare,74.4,62.3,40.1,64.4,58.7,53.3
egemaps,69.9,57.1,45.3,58.6,60.3,64.8
