# Speech Emotion Recognition (SER) Tutorial

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sensein/senselab/blob/main/tutorials/audio/speech_emotion_recognition.ipynb)

In this tutorial we will use Senselab to demonstrate its utility in audio classification tasks, namely the task of Speech Emotion Recognition (SER). Note: we recommend this tutorial to be done with GPUs as some of the models tend to perform slowly in CPU settings.

## Background
SER is the objective of recognizing one or more emotions in an audio clip. Transformers have been shown to perform quite well in SER and the SER task itself can be useful in therapeutics as well as call center technologies. SER typically comes in two main formats: continuous and discrete emotion recognition. In the continuous setting, SER models aim to predict continuous values (e.g. valence, arousal, dominance) in speech segments that are highly correlated with different perceived emotions. On the other hand, the discrete SER task attempts to take an audio clip and classify it with one (multi-class) or more (multi-label) emotions. This tutorial will focus on the discrete case since predicting the dimensional attributes in the continuous setting is a form of regression, a related but different task than classification.

# Installing Requirements

The only requirement in this tutorial is the Senselab package, since many of the other requirements (transformers, huggingface, datasets, etc.) are included in package.

In [None]:
def is_colab() -> bool:
    """Check if running on Google Colab."""
    try:
        import google.colab
        return True
    except ImportError:
        return False

if is_colab():
    %pip install senselab
else:
    print("Not running on Colab. Skipping installation.")

Not running on Colab. Skipping installation.


In [2]:
import time

import torch
import torchaudio.transforms as T
from datasets import load_dataset
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset

from senselab.audio.tasks.classification.speech_emotion_recognition import classify_emotions_from_speech
from senselab.audio.tasks.preprocessing import resample_audios
from senselab.utils.data_structures.dataset import SenselabDataset
from senselab.utils.data_structures.device import DeviceType
from senselab.utils.data_structures.model import HFModel

device = DeviceType.CUDA if torch.cuda.is_available() else DeviceType.CPU

INFO:speechbrain.utils.quirks:Applied quirks (see `speechbrain.utils.quirks`): [disable_jit_profiling, allow_tf32]
INFO:speechbrain.utils.quirks:Excluded quirks specified by the `SB_DISABLE_QUIRKS` environment (comma-separated list): []


# Discrete Speech Emotion Recognition

For discrete SER, we will be focusing on the RAVDESS dataset. The authors describe this dataset as "gender balanced consisting of 24 professional actors, vocalizing lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions." While the dataset is multi-modal, we will only be focusing on the speech portions of the dataset. For more information about RAVDESS, see their [paper](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0196391).

In [3]:
ravdess = load_dataset('xbgoose/ravdess', split='train')

We can print the dataset below to gain some basic information about it. The Audio feature is based off of HuggingFace's Audio implementation for storing the dataset, which includes the path, the actual audio data array, and the sampling rate. The label we want to classify is the Emotion feature, but the dataset is rich in other information that could be useful depending on the model including an actor ID, the actor's gender, and the semantic information of the sentence the actor is using.

In [4]:
print(ravdess)
print(ravdess[0])

Dataset({
    features: ['audio', 'modality', 'vocal_channel', 'emotion', 'emotional_intensity', 'statement', 'repetition', 'actor', 'gender'],
    num_rows: 1440
})
{'audio': {'path': '03-01-05-01-02-01-16.wav', 'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
       -3.05175781e-05, -6.10351562e-05, -3.05175781e-05]), 'sampling_rate': 48000}, 'modality': 'audio-only', 'vocal_channel': 'speech', 'emotion': 'angry', 'emotional_intensity': 'normal', 'statement': 'Dogs are sitting by the door', 'repetition': '1st repetition', 'actor': 16, 'gender': 'female'}


The model we will use is a fine-tuned DistillHuBERT which has been chosen because of it showing good performance on this dataset (notably it has been trained on it), and the model card contains the descriptive information that Senselab uses to determine if it can immediately run the model in our classification pipeline. More details of the model can be found [here](https://huggingface.co/pollner/distilhubert-finetuned-ravdess).

In [5]:
model_to_use = 'pollner/distilhubert-finetuned-ravdess'

We will use Senselab's DeviceType to store the device information that we want to run inference on since it is easily compatible with both Senselab and can be given to HuggingFace pipelines by just using `device.value`. As mentioned in the introduction, it is best to run these models with a GPU due to their size and the time it takes to run the computations.

In [6]:
print(device)

DeviceType.CPU


## Running with HuggingFace vs. Senselab

In this section we will show some of the differences in using Senselab vs. HuggingFace, including the complexity of code that needs to be written as well as looking at a timing comparison. In most cases we shouldn't expect huge differences in time since under the hood, Senselab is using some of HuggingFace's APIs, but in doing so Senselab can optimize its calls while obfuscating this complexity from the user.

Like many models, the one we are going to use requires the audios to be a specific sampling rate, in this case 16KHz whereas our audio as is starts at 48KHz.

We will then run the classification pipeline and generate the results in terms of unweighted accuracy, whichc in this case is the same as weighted accuracy since the classes are balanced within the dataset. Since both solutions are running the same model on the same dataset, we should expect to get the same results.

### Running with HuggingFace

Here we resample using functionality from torchauedio. Note that we do this using a dataset mapping since we especially need to maintain the structure of the Audio objects from HuggingFace in order to use their pipeline.

The output of the HuggingFace pipeline is a mapping of labels to their probabilities, where the highest probability will be taken as the model's recognized emotion.

In [7]:
start_time = time.perf_counter()
# Resampling the dataset using torchaudio

resampler = T.Resample(48000, 16000)

resampled_ravdess = ravdess.map(lambda sample: {'audio':{
    'array': resampler(torch.Tensor(sample['audio']['array'])).numpy(),
    'path': sample['audio']['path'],
    'sampling_rate': 16000
}})

resample_time = time.perf_counter()

classification_pipeline = pipeline(
        task="audio-classification",
        model=model_to_use,
        device=device.value,
    )

emotion_classification = []
for out in classification_pipeline(KeyDataset(resampled_ravdess,"audio")):
  highest_emotion_score_label = max(out, key=lambda x: x['score'])
  emotion_classification.append(highest_emotion_score_label['label'])

classification_time = time.perf_counter()

accuracy = 0
for i in range(len(emotion_classification)):
  accuracy += int(emotion_classification[i]==resampled_ravdess[i]['emotion'])

end_time = time.perf_counter()
total_time = end_time - start_time
resampling_time = resample_time - start_time
classifying_time = classification_time - resample_time

print('Accuracy (%):', accuracy/len(emotion_classification)*100)
print(f"Time for resampling: {resampling_time} seconds")
print(f"Time for classifying, including classification pipeline setup: {classifying_time} seconds")
print(f"Total time for HuggingFace classification: {total_time} seconds")

Device set to use cpu


Accuracy (%): 99.02777777777779
Time for resampling: 0.007214999990537763 seconds
Time for classifying, including classification pipeline setup: 76.64158837508876 seconds
Total time for HuggingFace classification: 76.9297172910301 seconds


### Running with Senselab

Similar to the above, we will test the discrete classification using the same model and dataset, but now performing all of the same steps with the `senselab` package, comparing the times they take. The first step is to convert the HuggingFace dataset to a SenselabDataset. To try and give the fairest performance comparison, we'll include the time to make this conversion into our final timing. After the conversion, we again need to resample, but here resampling is a simple one line statement. Note that most senselab functions take lists of Audio objects which we get from the SenselabDataset class by using the `audios` attribute. The output of the audio classification methods (including the SER method we use here) will be a list of `AudioClassificationResult` objects which combine together the outputted labels and their scores and by default lists them from highest to lowest score, normally indicating the most probable label to classify. The `AudioClassificationResult` has a variety of utility functions like `top_label` which we use here, but the internal lists that the classification pipeline returned can be accessed with `labels` and/or `scores` object attributes.

In [8]:
start_time = time.perf_counter()

# Create a HFModel object and convert the HuggingFace style dataset to SenselabDataset
model = HFModel(path_or_uri=model_to_use)
senselab_ravdess =  SenselabDataset.convert_hf_dataset_to_senselab_dataset({'audios': ravdess})

converted_time = time.perf_counter()

# Need to resample the audios
senselab_ravdess.audios = resample_audios(senselab_ravdess.audios, resample_rate=16000)
resampled_time = time.perf_counter()

# Run SER classification
senselab_result = classify_emotions_from_speech(senselab_ravdess.audios, model, device=device)
classified_time = time.perf_counter()

# show what a result from Senselab looks like
print(senselab_result[0])

# Compute accuracy
accuracy = 0
for i in range(len(senselab_result)):
  accuracy += int(senselab_result[i].top_label()==ravdess[i]['emotion'])

end_time = time.perf_counter()
total_time = end_time - start_time
conversion_time = converted_time - start_time
resampling_time = resampled_time - converted_time
classifying_time = classified_time - resampled_time

print('Accuracy (%):', accuracy/len(senselab_result)*100)
print(f"Time for converting to a SenselabDataset: {conversion_time} seconds")
print(f"Time for resampling: {resampling_time} seconds")
print(f"Time for classifying: {classifying_time} seconds")
print(f"Total time for HuggingFace classification: {total_time} seconds")

Device set to use cpu
2025-01-16 13:32:11,552 - senselab - INFO - Time taken to initialize the hugging face audio classification             pipeline: 0.16 seconds
INFO:senselab:Time taken to initialize the hugging face audio classification             pipeline: 0.16 seconds
2025-01-16 13:33:30,880 - senselab - INFO - Time taken for classifying the audios: 79.33 seconds
INFO:senselab:Time taken for classifying the audios: 79.33 seconds


labels=['angry', 'disgust', 'surprised', 'happy', 'neutral'] scores=[0.9938356280326843, 0.0024896019604057074, 0.0016016679583117366, 0.0008156367111951113, 0.0006909818621352315]
Accuracy (%): 98.95833333333334
Time for converting to a SenselabDataset: 0.9162732081022114 seconds
Time for resampling: 5.8923929169541225 seconds
Time for classifying: 79.60864354204386 seconds
Total time for HuggingFace classification: 87.08913145808037 seconds


### Comparing the results

You'll notice that overall Senselab ran much faster than how one might do this normally with HuggingFace (about a 4x speedup). Most of this speedup isn't actually from the classification (those are roughly equivalent) but rather these come from being able to resample the audios faster.


You might wonder why the accuracy of Senselab is lower than that of running the traditional HuggingFace way? This is actually an artifact of the resampling process in Senselab which does some filtering before downsampling in order to prevent/reduce aliasing. If we wanted to show getting the same result then we can use the resampled audios that were generated by torchaudio. We will ignore the timing differences here since most of the delay previously came from resampling.

In [9]:
start_time = time.perf_counter()

# Create a HFModel object and convert the HuggingFace style dataset to SenselabDataset
model = HFModel(path_or_uri=model_to_use)
senselab_ravdess =  SenselabDataset.convert_hf_dataset_to_senselab_dataset({'audios': resampled_ravdess})

converted_time = time.perf_counter()

# Run SER classification
senselab_result = classify_emotions_from_speech(senselab_ravdess.audios, model, device=device)
classified_time = time.perf_counter()

# Compute accuracy
accuracy = 0
for i in range(len(senselab_result)):
  accuracy += int(senselab_result[i].top_label()==ravdess[i]['emotion'])

end_time = time.perf_counter()
total_time = end_time - start_time
conversion_time = converted_time - start_time
classifying_time = classified_time - converted_time

print('Accuracy (%):', accuracy/len(senselab_result)*100)
print(f"Time for converting to a SenselabDataset: {conversion_time} seconds")
print(f"Time for classifying: {classifying_time} seconds")
print(f"Total time for HuggingFace classification: {total_time} seconds")

2025-01-16 13:33:32,157 - senselab - INFO - Time taken to initialize the hugging face audio classification             pipeline: 0.00 seconds
INFO:senselab:Time taken to initialize the hugging face audio classification             pipeline: 0.00 seconds
2025-01-16 13:34:47,775 - senselab - INFO - Time taken for classifying the audios: 75.61 seconds
INFO:senselab:Time taken for classifying the audios: 75.61 seconds


Accuracy (%): 99.02777777777779
Time for converting to a SenselabDataset: 0.4978597089648247 seconds
Time for classifying: 75.7225547080161 seconds
Total time for HuggingFace classification: 76.83492187492084 seconds
