# Audio Speech Emotion Recognition (SER) Tutorial

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sensein/senselab/blob/main/tutorials/audio/audio_speech_emotion_recognition.ipynb)

In this tutorial we will use Senselab to demonstrate its utility in audio classification tasks, namely the task of Speech Emotion Recognition (SER). Note: we recommend this tutorial to be done with GPUs as some of the models tend to perform slowly in CPU settings.

## Background
SER is the objective of recognizing one or more emotions in an audio clip. Transformers have been shown to perform quite well in SER and the SER task itself can be useful in therapeutics as well as call center technologies. SER typically comes in two main formats: continuous and discrete emotion recognition. In the continuous setting, SER models aim to predict continuous values (e.g. valence, arousal, dominance) in speech segments that are highly correlated with different perceived emotions. On the other hand, the discrete SER task attempts to take an audio clip and classify it with one (multi-class) or more (multi-label) emotions.

# Installing Requirements

The only requirement in this tutorial is the Senselab package, since many of the other requirements (transformers, huggingface, datasets, etc.) are included in package.

In [None]:
%pip install senselab

In [1]:
from datasets import load_dataset, Audio
import torchaudio.transforms as T
import torch
import time

from transformers import AutoConfig, pipeline
from transformers.pipelines.pt_utils import KeyDataset

from senselab.utils.data_structures.device import DeviceType
from senselab.utils.data_structures.model import HFModel
from senselab.utils.data_structures.dataset import SenselabDataset
from senselab.audio.tasks.classification.speech_emotion_recognition import speech_emotion_recognition_with_hf_models
from senselab.audio.tasks.preprocessing import resample_audios

# Discrete Speech Emotion Recognition

For discrete SER, we will be focusing on the RAVDESS dataset. The authors describe this dataset as "gender balanced consisting of 24 professional actors, vocalizing lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions." While the dataset is multi-modal, we will only be focusing on the speech portions of the dataset. For more information about RAVDESS, see their [paper](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0196391).

In [2]:
ravdess = load_dataset('xbgoose/ravdess', split='train')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/669 [00:00<?, ?B/s]

(…)-00000-of-00002-94d632c9f1f51bbe.parquet:   0%|          | 0.00/167M [00:00<?, ?B/s]

(…)-00001-of-00002-bcaf733d4b46d6b2.parquet:   0%|          | 0.00/158M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1440 [00:00<?, ? examples/s]

We can print the dataset below to gain some basic information about it. The Audio feature is based off of HuggingFace's Audio implementation for storing the dataset, which includes the path, the actual audio data array, and the sampling rate. The label we want to classify is the Emotion feature, but the dataset is rich in other information that could be useful depending on the model including an actor ID, the actor's gender, and the semantic information of the sentence the actor is using.

In [3]:
print(ravdess)
print(ravdess[0])

Dataset({
    features: ['audio', 'modality', 'vocal_channel', 'emotion', 'emotional_intensity', 'statement', 'repetition', 'actor', 'gender'],
    num_rows: 1440
})
{'audio': {'path': '03-01-05-01-02-01-16.wav', 'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
       -3.05175781e-05, -6.10351562e-05, -3.05175781e-05]), 'sampling_rate': 48000}, 'modality': 'audio-only', 'vocal_channel': 'speech', 'emotion': 'angry', 'emotional_intensity': 'normal', 'statement': 'Dogs are sitting by the door', 'repetition': '1st repetition', 'actor': 16, 'gender': 'female'}


The model we will use is a fine-tuned DistillHuBERT which has been chosen because of it showing good performance on this dataset (notably it has been trained on it), and the model card contains the descriptive information that Senselab uses to determine if it can immediately run the model in our classification pipeline. More details of the model can be found [here](https://huggingface.co/pollner/distilhubert-finetuned-ravdess).

In [4]:
model_to_use = 'pollner/distilhubert-finetuned-ravdess'

We will use Senselab's DeviceType to store the device information that we want to run inference on since it is easily compatible with both Senselab and can be given to HuggingFace pipelines by just using `device.value`. As mentioned in the introduction, it is best to run these models with a GPU due to their size and the time it takes to run the computations.

In [5]:
device = DeviceType.CUDA if torch.cuda.is_available() else DeviceType.CPU
print(device)

DeviceType.CUDA


## Running with HuggingFace vs. Senselab

In this section we will show some of the differences in using Senselab vs. HuggingFace, including the complexity of code that needs to be written as well as looking at a timing comparison. In most cases we shouldn't expect huge differences in time since under the hood, Senselab is using some of HuggingFace's APIs, but in doing so Senselab can optimize its calls while obfuscating this complexity from the user.

Like many models, the one we are going to use requires the audios to be a specific sampling rate, in this case 16KHz whereas our audio as is starts at 48KHz.

We will then run the classification pipeline and generate the results in terms of unweighted accuracy, whichc in this case is the same as weighted accuracy since the classes are balanced within the dataset. Since both solutions are running the same model on the same dataset, we should expect to get the same results.

### Running with HuggingFace

Here we resample using functionality from torchauedio. Note that we do this using a dataset mapping since we especially need to maintain the structure of the Audio objects from HuggingFace in order to use their pipeline.

The output of the HuggingFace pipeline is a mapping of labels to their probabilities, where the highest probability will be taken as the model's recognized emotion.

In [6]:
start_time = time.perf_counter()
# Resampling the dataset using torchaudio

resampler = T.Resample(48000, 16000)

resampled_ravdess = ravdess.map(lambda sample: {'audio':{
    'array': resampler(torch.Tensor(sample['audio']['array'])).numpy(),
    'path': sample['audio']['path'],
    'sampling_rate': 16000
}})

resample_time = time.perf_counter()

classification_pipeline = pipeline(
        task="audio-classification",
        model=model_to_use,
        device=device.value,
    )

emotion_classification = []
for out in classification_pipeline(KeyDataset(resampled_ravdess,"audio")):
  highest_emotion_score_label = max(out, key=lambda x: x['score'])
  emotion_classification.append(highest_emotion_score_label['label'])

classification_time = time.perf_counter()

accuracy = 0
for i in range(len(emotion_classification)):
  accuracy += int(emotion_classification[i]==resampled_ravdess[i]['emotion'])

end_time = time.perf_counter()
total_time = end_time - start_time
resampling_time = resample_time - start_time
classifying_time = classification_time - resample_time

print('Accuracy (%):', accuracy/len(emotion_classification)*100)
print(f"Time for resampling: {resampling_time} seconds")
print(f"Time for classifying, including classification pipeline setup: {classifying_time} seconds")
print(f"Total time for HuggingFace classification: {total_time} seconds")

Map:   0%|          | 0/1440 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/94.8M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

Accuracy (%): 99.02777777777779
Time for resampling: 139.90873038099997 seconds
Time for classifying, including classification pipeline setup: 29.229274503 seconds
Total time for HuggingFace classification: 170.780722365 seconds


### Running with Senselab

Similar to the above, we will test the discrete classification using the same model and dataset, but now performing all of the same steps with the `senselab` package, comparing the times they take. The first step is to convert the HuggingFace dataset to a SenselabDataset. To try and give the fairest performance comparison, we'll include the time to make this conversion into our final timing. After the conversion, we again need to resample, but here resampling is a simple one line statement. Note that most senselab functions take lists of Audio objects which we get from the SenselabDataset class by using the `audios` attribute. The output of the audio classification methods (including the SER method we use here) will be a tuple per Audio object where the first value of the tuple is the classification label with the highest probability and the latter tuple entry being a dictionary mapping all of the labels to their corresponding probabilities.

In [8]:
start_time = time.perf_counter()

# Create a HFModel object and convert the HuggingFace style dataset to SenselabDataset
model = HFModel(path_or_uri=model_to_use)
senselab_ravdess =  SenselabDataset.convert_hf_dataset_to_senselab_dataset({'audios': ravdess})

converted_time = time.perf_counter()

# Need to resample the audios
senselab_ravdess.audios = resample_audios(senselab_ravdess.audios, resample_rate=16000)
resampled_time = time.perf_counter()

# Run SER classification
senselab_result = speech_emotion_recognition_with_hf_models(senselab_ravdess.audios, model, device=device)
classified_time = time.perf_counter()

# show what a result from Senselab looks like
print(senselab_result[0])

# Compute accuracy
accuracy = 0
for i in range(len(senselab_result)):
  accuracy += int(senselab_result[i][0]==ravdess[i]['emotion'])

end_time = time.perf_counter()
total_time = end_time - start_time
conversion_time = converted_time - start_time
resampling_time = resampled_time - converted_time
classifying_time = classified_time - resampled_time

print('Accuracy (%):', accuracy/len(senselab_result)*100)
print(f"Time for converting to a SenselabDataset: {conversion_time} seconds")
print(f"Time for resampling: {resampling_time} seconds")
print(f"Time for classifying: {classifying_time} seconds")
print(f"Total time for HuggingFace classification: {total_time} seconds")

('angry', {'angry': 0.9938356280326843, 'disgust': 0.002489604288712144, 'surprised': 0.0016016672598198056, 'happy': 0.000815633567981422, 'neutral': 0.0006909818621352315})
Accuracy (%): 98.95833333333334
Time for converting to a SenselabDataset: 4.235388228999909 seconds
Time for resampling: 12.933568482000055 seconds
Time for classifying: 25.657619488000023 seconds
Total time for HuggingFace classification: 45.39131917299994 seconds


### Comparing the results

You'll notice that overall Senselab ran much faster than how one might do this normally with HuggingFace (about a 4x speedup). Most of this speedup isn't actually from the classification (those are roughly equivalent) but rather these come from being able to resample the audios faster.


You might wonder why the accuracy of Senselab is lower than that of running the traditional HuggingFace way? This is actually an artifact of the resampling process in Senselab which does some filtering before downsampling in order to prevent/reduce aliasing. If we wanted to show getting the same result then we can use the resampled audios that were generated by torchaudio. We will ignore the timing differences here since most of the delay previously came from resampling.

In [10]:
start_time = time.perf_counter()

# Create a HFModel object and convert the HuggingFace style dataset to SenselabDataset
model = HFModel(path_or_uri=model_to_use)
senselab_ravdess =  SenselabDataset.convert_hf_dataset_to_senselab_dataset({'audios': resampled_ravdess})

converted_time = time.perf_counter()

# Run SER classification
senselab_result = speech_emotion_recognition_with_hf_models(senselab_ravdess.audios, model, device=device)
classified_time = time.perf_counter()

# Compute accuracy
accuracy = 0
for i in range(len(senselab_result)):
  accuracy += int(senselab_result[i][0]==ravdess[i]['emotion'])

end_time = time.perf_counter()
total_time = end_time - start_time
conversion_time = converted_time - start_time
classifying_time = classified_time - converted_time

print('Accuracy (%):', accuracy/len(senselab_result)*100)
print(f"Time for converting to a SenselabDataset: {conversion_time} seconds")
print(f"Time for classifying: {classifying_time} seconds")
print(f"Total time for HuggingFace classification: {total_time} seconds")

Accuracy (%): 99.02777777777779
Time for converting to a SenselabDataset: 1.3866998060000242 seconds
Time for classifying: 25.773058340000034 seconds
Total time for HuggingFace classification: 29.622843560999854 seconds


# Dimensional Attributes (Continuous Speech Emotion Recognition)

As mentioned before, the SER task typically comes in two main types: discrete and continuous/dimensional SER. In the dimensional case, we're not going to directly classify an emotion but rather try to estimate some continuous features of the voice that highly correlate with specific emotions. There are typically 3 dimensions that we focus on (though there are discussions in the literature of using 2 to 29 different dimensions): valence, dominance, and arousal. In the following tutorial section we will show Senselab being used to run a model estimating the dimensional attributes in a SER dataset known as IEMOCAP (Interactive Emotional Dyadic Motion Capture).

IEMOCAP is a rich dataset including audio, video, and facial motion capture data. The authors describe IEMOCAP as "consist[ing] of dyadic sessions where actors perform improvisations or scripted scenarios, specifically selected to elicit emotional expressions. IEMOCAP database is annotated by multiple annotators into categorical labels, such as anger, happiness, sadness, neutrality, as well as dimensional labels such as valence, activation and dominance." For more information on IEMOCAP see their [official website](https://sail.usc.edu/iemocap/index.html) or read their [2008 paper](https://sail.usc.edu/iemocap/Busso_2008_iemocap.pdf).

The model we will be using is a fine-tuned Wav2Vec2 model. Notably, this model was fine-tuned on the MSP-Podcast corpora which is a real world dataset of podcast clips that have been annotated for both discrete and continuous emotion classification. For more details on MSP-Podcast see their [official website](https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Podcast.html) and for further model details refer to their [HuggingFace model card](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim).

In [11]:
iemocap = load_dataset('AbstractTTS/IEMOCAP', split='train')
print(iemocap)
print(iemocap[0])

README.md:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/489M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/456M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/462M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10039 [00:00<?, ? examples/s]

Dataset({
    features: ['file', 'audio', 'frustrated', 'angry', 'sad', 'disgust', 'excited', 'fear', 'neutral', 'surprise', 'happy', 'EmoAct', 'EmoVal', 'EmoDom', 'gender', 'transcription', 'major_emotion', 'speaking_rate', 'pitch_mean', 'pitch_std', 'rms', 'relative_db'],
    num_rows: 10039
})
{'file': 'Ses01F_impro01_F000.wav', 'audio': {'path': 'Ses01F_impro01_F000.wav', 'array': array([-0.0050354 , -0.00497437, -0.0038147 , ..., -0.00265503,
       -0.00317383, -0.00418091]), 'sampling_rate': 16000}, 'frustrated': 0.0062500000931322575, 'angry': 0.0062500000931322575, 'sad': 0.0062500000931322575, 'disgust': 0.0062500000931322575, 'excited': 0.0062500000931322575, 'fear': 0.0062500000931322575, 'neutral': 0.949999988079071, 'surprise': 0.0062500000931322575, 'happy': 0.0062500000931322575, 'EmoAct': 2.3333330154418945, 'EmoVal': 2.6666669845581055, 'EmoDom': 2.0, 'gender': 'Female', 'transcription': ' Excuse me.', 'major_emotion': 'neutral', 'speaking_rate': 5.139999866485596, 'p

In [12]:
model_to_use = 'audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim'
model = HFModel(path_or_uri=model_to_use)
senselab_iemocap =  SenselabDataset.convert_hf_dataset_to_senselab_dataset({'audios': iemocap})
senselab_iemocap.audios = resample_audios(senselab_iemocap.audios, resample_rate=16000)
senselab_result = speech_emotion_recognition_with_hf_models(senselab_iemocap.audios, model, device=device)
print(senselab_result[0])

config.json:   0%|          | 0.00/2.34k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/661M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


preprocessor_config.json:   0%|          | 0.00/214 [00:00<?, ?B/s]

('dominance', {'dominance': 0.33526355028152466, 'valence': 0.3346194326877594, 'arousal': 0.33011701703071594})


Note that since we are running a model trained on a real-world dataset and evaluating it on an acted dataset, our results might not be as good as our previous evaluation where we were evaluating the model on the data it was trained on (partially since it was an easily accessible, publicly available dataset whereas MSP-Podcast is not currently available through HuggingFace).

For evaluating the model's performance, we will use the mean squared error since it is easily calculable, though the literature describes many other available metrics. Importantly, IEMOCAP dimensions are given on a scale of 1.0-5.0 whereas our model outputs dimensions from 0.0 to 1.0 so we will need to convert to the same scale for a fair evaluation.

In [14]:
valence_sum = 0
dominance_sum = 0
arousal_sum = 0

for i in range(len(senselab_result)):
  # Convert model outputs to [0,1] to [1,5]
  model_valence = 4*senselab_result[i][1]['valence']+1
  model_dominance = 4*senselab_result[i][1]['dominance']+1
  model_arousal = 4*senselab_result[i][1]['arousal']+1

  ground_truth_valence = iemocap[i]['EmoVal']
  ground_truth_dominance = iemocap[i]['EmoDom']
  ground_truth_arousal = iemocap[i]['EmoAct'] # known as activation in some of the literature

  valence_sum += (model_valence-ground_truth_valence)**2
  dominance_sum += (model_dominance-ground_truth_dominance)**2
  arousal_sum += (model_arousal-ground_truth_arousal)**2

print(f"Mean squared error (MSE) of valence: {valence_sum/len(senselab_result)}")
print(f"Mean squared error (MSE) of dominance: {dominance_sum/len(senselab_result)}")
print(f"Mean squared error (MSE) of arousal: {arousal_sum/len(senselab_result)}")

Mean squared error (MSE) of valence: 1.014115344351324
Mean squared error (MSE) of dominance: 1.338170928954536
Mean squared error (MSE) of arousal: 1.0744655428555485
