<a href="https://colab.research.google.com/github/satvik-dixit/speech_emotion_recognition/blob/main/speech_emotion_recogniser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Speech Emotion Recogniser

A notebook to identify the emotion of an utterance in English. Trained on RAVDESS. The demo has been divided into 3 phases:
- Phase 1: Uploading Audio File
- Phase 2: Loading RAVDESS and extracting metadata
- Phase 3: Speech Emotion Recognotion



### About RAVDESS:
- English
- 7356 recordings
- 24 actors (12 female, 12 male)
- 8 emotions: neutral, calm, happy, sad, angry, fearful, surprise, and disgust

### References:
- Dataset: https://zenodo.org/record/1188976#.YvyPHexBy3K
- Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0196391


Lets start by importing a uploading the audio clip.


### Importing packages

In [1]:
!git clone -q https://github.com/GasserElbanna/serab-byols.git
!python3 -m pip install -q -e ./serab-byols
!pip install -q omegaconf torchaudio pydub
!pip install -q tqdm==4.60.0
!pip install ffmpeg-python

[K     |████████████████████████████████| 79 kB 4.6 MB/s 
[K     |████████████████████████████████| 117 kB 16.7 MB/s 
[?25h  Building wheel for antlr4-python3-runtime (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 75 kB 3.6 MB/s 
[?25hLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ffmpeg-python
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Installing collected packages: ffmpeg-python
Successfully installed ffmpeg-python-0.2.0


In [None]:
# code to restart runtime so the packages get installed
import os
os.kill(os.getpid(), 9)

In [1]:
import os
import numpy as np
from tqdm import tqdm
from glob import glob
from random import sample
from pathlib import Path 
import pickle

import librosa
import soundfile as sf

import torch
import serab_byols

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')


# Phase 1: Loading RAVDESS audio files and extracting metadata

Includes downloading the dataset, loading audio files, resampling audio files, extracting metadata


### Defining a function for loading and resampling audio files

In [2]:
# Defining a function for loading and resampling audio files

def load_audio_files(audio_files, resampling_frequency=16000, audio_list=None):
  '''
  Loads and resamples audio files 
  
  Parameters
  ------------
  audio_files: string
      The paths of the wav files 
  resampling_frequency: integer
      The frequency which all audios will be resampled to
  audio_list: list 
      The list of torch tensors of audios to which more audios need too be added, empty by default

  Returns
  ------------
  audio_list: list
      A list of torch tensors, one array for each audio file

  '''
  # Making audio_list
  if audio_list is None:
    audio_list = []

  # Resampling
  for audio in audio_files:
    signal, fs = librosa.load(audio, sr=resampling_frequency)
    audio_list.append(torch.from_numpy(signal))
      
  return audio_list
        

### Metadata:
Speakers: (24 speakers) 
- Odd numbered actors are male 
- Even numbered actors are female

Labels: (8 labels)
- 01 = neutral
- 02 = calm
- 03 = happy
- 04 = sad
- 05 = angry
- 06 = fearful
- 07 = disgust
- 08 = surprised

### Loading and resampling audiofiles and collecting metadata on EmoDB dataset

In [3]:
# Phase_1
# Load dataset
! wget -O ravdess-emotional-speech-audio.zip -q https://zenodo.org/record/1188976/files/Audio_Speech_Actors_01-24.zip?download=1
! unzip -q ravdess-emotional-speech-audio.zip -d '/content/ravdess'

# Select all the audio files
audios = []
for file in Path('/content/ravdess').glob("**/*.wav"):
    if not file.is_file(): 
        continue
    audios.append(str(file))

# Load and resample audio files
audio_list = load_audio_files(audios, resampling_frequency=16000)

# Making speakers list and labels list 
speakers = []
old_labels = []
for audio_file in audios:
  file_name = audio_file.split('/')[4]
  speakers.append(file_name[18:20])
  old_labels.append(file_name[6:8])

label_dict = {'01':'NEUTRAL', '02':'CALM', '03':'HAPPY', '04':'SAD', '05':'ANGRY', '06':'FEARFUL', '07':'DISGUST', '08':'SURPRISE'}
labels = []
for old_label in old_labels:
  labels.append(label_dict[old_label])

# Verify phase_1
print('Number of audio files: {}'.format(len(audio_list)))
print('Number of speaker classes: {}'.format(len(set(speakers))))
print('Speaker classes: {}'.format(set(speakers)))
print('Number of speakers: {}'.format(len(speakers)))
print('Number of label classes: {}'.format(len(set(labels))))
print('Label classes: {}'.format(set(labels)))
print('Number of labels: {}'.format(len(labels)))


Number of audio files: 1440
Number of speaker classes: 24
Speaker classes: {'08', '14', '24', '05', '12', '09', '21', '04', '06', '16', '17', '10', '03', '15', '13', '20', '11', '01', '02', '22', '23', '07', '18', '19'}
Number of speakers: 1440
Number of label classes: 8
Label classes: {'DISGUST', 'ANGRY', 'NEUTRAL', 'HAPPY', 'SAD', 'FEARFUL', 'CALM', 'SURPRISE'}
Number of labels: 1440


# Phase 2: Defining functions for Speech Emotion Recognition



### Audio embeddings extraction functions

In [4]:
# Defining a function for generating audio embedding extraction models

def audio_embeddings_model(model_name):
  '''
  Generates model for embedding extraction 
  
  Parameters
  ------------
  mode_name: string
      The model to used, could be 'hybrid_byols'

  Returns
  ------------
  model: object
      The embedding extraction model
  '''
  if model_name=='hybrid_byols':
    model_name = 'cvt'
    checkpoint_path = "serab-byols/checkpoints/cvt_s1-d1-e64_s2-d1-e256_s3-d1-e512_BYOLAs64x96-osandbyolaloss6373-e100-bs256-lr0003-rs42.pth"
    model = serab_byols.load_model(checkpoint_path, model_name)
  return model


# Defining a function for embedding exctraction from the audio list
def audio_embeddings(audio_list, model_name, model, sampling_rate=16000):
  '''
  Loads and resamples audio files 
  
  Parameters
  ------------
  audio_list: list
      A list of arrays, one array for each audio file
  model_name: string
      The model to used, could be 'hybrid_byols'
  model: object
      The embedding extraction model generated by audio_embeddings_model function
  sampling_rate: int
      The sampling rate, 16 kHz by default

  Returns
  ------------
  embeddings_array: array
      The array containg embeddings of all audio_files, dimension (number of audio files × n_feats)
      
  '''
  if model_name=='hybrid_byols':
    embeddings_array = serab_byols.get_scene_embeddings(audio_list, model)
  return embeddings_array


### Speaker normalisation functions

In [5]:
# Defining a function for speaker normalisation using standard scaler

def speaker_normalisation(embeddings_array, speakers):
  '''
  Normalises embeddings_array for each speaker
  
  Parameters
  ------------
  embeddings_array: array
      The array of embeddings, one row for each audio file
  speakers: list 
      The list of speakers

  Returns
  ------------
  embeddings_array: array
      The array containg normalised embeddings of all audio_files, dimension (number of audio files × n_feats)
      
  '''
  speaker_ids = set(speakers)
  for speaker_id in speaker_ids:
    speaker_embeddings_indices = np.where(np.array(speakers)==speaker_id)[0]
    speaker_embeddings = embeddings_array[speaker_embeddings_indices,:]
    scaler = StandardScaler()
    normalised_speaker_embeddings = scaler.fit_transform(speaker_embeddings)
    embeddings_array[speaker_embeddings_indices] = torch.tensor(normalised_speaker_embeddings).float()
  return embeddings_array


### Hyperparameter tuning functions

In [6]:
# Defining a function for hyperparameter tuning and getting the accuracy on the test set

def get_hyperparams(X_train, y_train, classifier, parameters):
  '''
  Splits into training and testing set with different speakers

  Parameters
  ------------
  X_train: torch tensor
    The normalised embeddings that will be used for training
  X_test: torch tensor
    The normalised embeddings that will be used for testing
  y_train: list
    The labels that will be used for training
  y_test: list
    The labels that will be used for testing
  classifier: object
    The instance of the classification model 
  parameters: dictionary
    The dictionary of parameters for GridSearchCV 

  Returns
  ------------
    The dictionary of the best hyperparameters
  
  '''
  grid = GridSearchCV(classifier, param_grid = parameters, cv=5, scoring='recall_macro')                     
  trained_model = grid.fit(X_train,y_train)
  print('recall_macro :',grid.best_score_)
  print('Best Parameters: {}'.format(grid.best_params_))
  # prediction = grid.predict(X_test)
  # print('PREDICTION: {}'.format(prediction))
  return trained_model


### Function for recording audio

In [7]:
# specify record_seconds
record_seconds = 3

from os.path import exists
if not exists('silero-models'):
  !git clone -q --depth 1 https://github.com/snakers4/silero-models
%cd silero-models

# silero imports
from omegaconf import OmegaConf
from src.silero.utils import (init_jit_model, 
                       split_into_batches,
                       read_audio,
                       read_batch,
                       prepare_model_input)
from colab_utils import (record_audio,
                         audio_bytes_to_np,
                         upload_audio)

device = torch.device('cpu')   # you can use any pytorch device
models = OmegaConf.load('models.yml')

# imports for uploading/recording
import ipywidgets as widgets
from scipy.io import wavfile
from IPython.display import Audio, display, clear_output
from torchaudio.functional import vad

model, decoder = init_jit_model(models.stt_models.en.latest.jit, device=device)
language = "English"
use_VAD = "No"
record_or_upload = "Record" 
sample_rate = 16000


def _record_audio(b):
  clear_output()
  audio = record_audio(record_seconds)
  wavfile.write('recorded.wav', sample_rate, (32767*audio).numpy().astype(np.int16))
%cd ..

/content/silero-models


  0%|          | 0.00/112M [00:00<?, ?B/s]

/content


### Pipeline function

In [18]:
# Defining a function for all steps 

def training_pipeline(audio_list, speakers, labels):
  '''
  Loads and resamples audio files 
  
  Parameters
  ------------
  audio_files: string
      The paths of the wav files 
  resampling_frequency: integer
      The frequency which all audios will be resampled to
  audio_list: list 
      The list of torch tensors of audios to which more audios need too be added, empty by default

  Returns
  ------------
  audio_list: list
      A list of torch tensors, one array for each audio file

  '''

  # Embeddings Extraction
  model = audio_embeddings_model(model_name = 'hybrid_byols')
  embeddings_array = audio_embeddings(audio_list, model_name = 'hybrid_byols', model=model)
  # test_embeddings_array = audio_embeddings(test_list, model_name = 'hybrid_byols', model=model)
  print('embeddings_array shape: {}'.format(embeddings_array.shape))
  # print('test_embeddings_array shape: {}'.format(test_embeddings_array.shape))

  # Speaker Normalisation
  normalised_embeddings = speaker_normalisation(embeddings_array, speakers)
  print('normalised_embeddings shape: {}'.format(normalised_embeddings.shape))
  columnwise_mean = torch.mean(normalised_embeddings, 0)
  if torch.all(columnwise_mean < 10**(-6)):
    print('PASSED: All means are less than 10**-6')
  else:
    print('FAILED: All means are NOT less than 10**-6')

  X_train = normalised_embeddings
  y_train = labels
  # X_test = test_embeddings_array

  # Getting hyperparameters and checking max_recall
  print('Support Vector Machine:')
  classifier = SVC()
  parameters = {'C': np.logspace(-2,4,5), 'gamma': np.logspace(-5,-3,5), 'kernel':['linear', 'poly', 'rbf']}
  trained_model = get_hyperparams(X_train, y_train, classifier, parameters)
  return trained_model
  

# Phase 3: Training the model

In [19]:
trained_model = training_pipeline(audio_list, speakers, labels)

Generating Embeddings...: 100%|██████████| 1440/1440 [01:26<00:00, 16.67it/s]


embeddings_array shape: torch.Size([1440, 2048])
normalised_embeddings shape: torch.Size([1440, 2048])
PASSED: All means are less than 10**-6
Support Vector Machine:
recall_macro : 0.7888950742240215
Best Parameters: {'C': 10.0, 'gamma': 3.1622776601683795e-05, 'kernel': 'rbf'}


# Phase 4: Uploading Audio File

In [25]:
button = widgets.Button(description="Record Speech")
button.on_click(_record_audio)
display(button)

Starting recording for 3 seconds...


<IPython.core.display.Javascript object>

Finished recording!


In [26]:
display(Audio('recorded.wav', rate=sample_rate, autoplay=False))
audio = read_audio('recorded.wav', sample_rate)
test_list = [audio]
print(test_list)
model = audio_embeddings_model(model_name = 'hybrid_byols')
test_embeddings_array = audio_embeddings(test_list, model_name = 'hybrid_byols', model=model)

Generating Embeddings...: 100%|██████████| 1/1 [00:00<00:00, 21.17it/s]

[tensor([0.0000, 0.0000, 0.0000,  ..., 0.0007, 0.0007, 0.0009])]





# Results

Getting the emotion of the audio based on a model trained using Hybrid BYOL-S on RAVDESS

In [27]:
emotion = trained_model.predict(test_embeddings_array)
print(emotion)

['DISGUST']
