# 1. Speech Emotion Recognition

Speech Emotion Recognition (SER) is a key area of speech processing that focuses on detecting and classifying human emotions from spoken language. It has applications in customer service automation, mental health monitoring, human-computer interaction, and more. The ability to correctly interpret human emotions enhances user experience and improves AI-driven systems.

## 1. Task Explanation & Importance

Speech Emotion Recognition (SER) involves identifying human emotions (e.g., happiness, anger, sadness) from vocal features like pitch, tone, and intensity. It is critical for:

- **Mental health monitoring**: Detecting depression or anxiety from speech patterns (e.g., the Vietnamese Dynamic Attention-GRU model for depression diagnosis).
- **Human-computer interaction**: Enabling empathetic AI in chatbots, virtual assistants, and robotics.
- **Customer service**: Analyzing call center interactions to improve user experience.
- **Healthcare**: Assisting in autism therapy or PTSD treatment by tracking emotional states.

# 2. Importance in the Real World

SER plays a crucial role in various domains:

- **Healthcare**: Detecting depression, stress, or other emotional disorders.
- **Customer Support**: Identifying frustrated or dissatisfied customers to improve service quality.
- **Human-Robot Interaction**: Enhancing AI and robotic systems to respond empathetically.
- **Security & Surveillance**: Identifying distress in emergency call systems.

# 3. State-of-the-Art (SOTA) Models

Several advanced models and techniques have been developed for SER:

## a. Deep Learning-Based Models

- **Convolutional Neural Networks (CNNs)**: Extracts local features from speech spectrograms.
- **Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM)**: Captures temporal dependencies in speech signals.
- **Transformer-based Models**: Such as Wav2Vec and Speech2Text models, achieving superior results.

## b. Hybrid Approaches

- **CNN-LSTM**: Combines CNN for feature extraction with LSTM for temporal modeling.
- **Pretrained Models**: BERT-like architectures (e.g., Wav2Vec 2.0) for feature-rich embeddings.


# 4. Strengths and Limitations of SOTA Models

| Model                  | Strengths                              | Limitations                                |
|------------------------|----------------------------------------|--------------------------------------------|
| **CNN**                | Efficient for feature extraction       | Struggles with long-term dependencies      |
| **RNN/LSTM**           | Good for sequential data processing    | Computationally expensive                  |
| **Transformers**       | Superior accuracy, less manual feature engineering | Requires large-scale data and high computational power |
| **CNN-LSTM**           | Effective for dynamic speech patterns  | Complexity increases training time         |
| **Dynamic Attention-GRU** | High accuracy for specific tasks    | Limited cross-lingual generalization       |
| **MDRE**               | Resolves class bias with multimodal fusion | Heavy computational requirements          |
| **emotion2vec**        | Generalizes across languages and tasks | High pre-training data and computational cost |

## 4.1 Comparative Analysis of Recent Models

| Title                          | Year | Dataset(s)                                    | Models                                      | Performance Metrics         |
|--------------------------------|------|-----------------------------------------------|---------------------------------------------|-----------------------------|
| Ensemble 1D                    | 2021 | TESS, EMO-DB, RAVDESS, SAVEE, CREMA-D         | DNN, CNN, Ensemble                          | Precision, Recall, Accuracy |
| Novel approach                 | 2022 | TESS, SAVEE, RAVDESS (Survey)                 | -                                           | -                           |
| Wavelet Packet                 | 2020 | RAVDESS, SUSAS, ESD                           | Gradient Boosting, Random Forest            | Accuracy                    |
| Selective Interpolation        | 2020 | SAVEE, EMO-DB                                 | SVM, Decision Tree, KNN, LR, Naive Bayes    | Precision, Recall, Accuracy |
| Two way feature                | 2022 | RAVDESS, TESS                                 | Decision Tree, Random Forest, MLP, Proposed | Accuracy                    |
| A real time emotion            | 2019 | RAVDESS, SAVEE                                | SVM, KNN, Gradient Boosting                 | Accuracy                    |
| 3D CNN                         | 2019 | SAVEE, RML                                    | SVM, Random Forest, 2D CNN, Proposed        | Accuracy                    |
| SER using clustering           | 2021 | RAVDESS, SAVEE                                | SVM                                         | Accuracy                    |
| Clustering based Deep BiLSTM   | 2020 | RAVDESS, EMO-DB                               | CNN, RCNN, Proposed                         | Precision, Recall, F1 Score |
| Feature extraction algorithm   | 2020 | RAVDESS, EMO-DB                               | SVM, MLP, DT, KNN                           | Accuracy                    |
| SER using ML                   | 2020 | SAVEE                                         | CNN                                         | Accuracy                    |
| SER using CNN                  | 2020 | RAVDESS, TESS                                 | CNN                                         | Accuracy                    |
| Using speech signal            | 2019 | RML, SUSAS                                    | Random Forest, SVM                          | Precision, Accuracy         |
| Audio-textual emotion recognition | 2019 | TESS                                         | CNN, LSTM                                   | Accuracy, Recall            |
| Speech Recognition using MFCC  | 2020 | SAVEE, EMO-DB                                 | SVM                                         | Accuracy                    |
| Automatic SER                  | 2019 | CREMA-D, TESS                                 | MLR, SVM                                    | Accuracy                    |

# 5. Evaluation Metrics in SER

SER models are evaluated using different performance metrics:

- **Accuracy**: Measures overall correctness.
- **Precision, Recall, and F1-score**: Assess class-wise performance.
- **Confusion Matrix**: Visualizes misclassification patterns.
- **Mean Squared Error (MSE) & Root Mean Square Error (RMSE)**: Evaluates regression-based emotion models.
- **Weighted Accuracy (WA)**: Reflects overall accuracy but biased toward majority classes.
- **Unweighted Accuracy (UA)**: Balances class-wise performance, critical for imbalanced datasets.

## Strengths and Limitations

- **Accuracy** provides a good overall performance measure but can be misleading in imbalanced datasets.
- **Precision & Recall** are critical for assessing performance in multi-class SER problems.
- **F1-score** balances precision and recall, making it ideal for imbalanced datasets.
- **Confusion Matrix** allows deeper insight into misclassification tendencies.
- **WA** is biased toward majority classes.
- **UA** is critical for imbalanced datasets.

# 6. Open Challenges and Future Opportunities

- **Data Scarcity**: Limited high-quality emotional speech datasets.
- **Cross-Language and Cross-Cultural Variability**: SER models often fail to generalize across languages.
- **Real-World Noisy Environments**: Many models perform poorly in real-world settings with background noise.
- **Multimodal Emotion Recognition**: Combining facial and textual cues with speech could improve accuracy.
- **Lightweight Models for Edge Devices**: Optimizing models for deployment on low-resource hardware.
- **Cross-Cultural Generalization**: Models trained on Western datasets (e.g., IEMOCAP) underperform on non-English languages (e.g., Bangla in SUBESCO).
- **Noise Robustness**: Performance drops in real-world noisy environments (e.g., call centers).
- **Data Scarcity**: Labeled emotion datasets are small and expensive to collect.
- **Opportunity**: Self-supervised models (e.g., emotion2vec) reduce reliance on labeled data.
- **Multimodal Fusion**: Integrating text, video, or physiological signals could improve accuracy (e.g., MDRE’s audio-text fusion).
- **Ethical Concerns**: Bias in emotion labeling (e.g., cultural stereotypes) and privacy risks.


Data Preprocessing

In [4]:
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

"""
    Extracts the Mel-frequency cepstral coefficients (MFCC) from an audio file.

    Parameters:
    file_path (str): Path to the audio file.
    n_mfcc (int): Number of MFCCs to return. Default is 40.

    Returns:
    numpy.ndarray: A 1D array containing the mean MFCC values.

    Uses:
    - This function is useful for audio signal processing and feature extraction in speech and music analysis.
    - The MFCCs are commonly used in various applications such as speech recognition, speaker identification, and audio classification.

    Libraries:
    - librosa: Used for loading the audio file and extracting MFCC features.
    - numpy: Used for computing the mean of the MFCCs.
"""

def extract_mfcc(file_path, n_mfcc=40):
    y, sr = librosa.load(file_path, sr=22050)
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)
    return np.mean(mfcc.T, axis=0)

Model Training

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Define the CNNLSTM model class
class CNNLSTM(nn.Module):
    def __init__(self):
        super(CNNLSTM, self).__init__()
        # Define a 1D convolutional layer
        self.conv1 = nn.Conv1d(in_channels=40, out_channels=64, kernel_size=3, stride=1)
        # Define a ReLU activation function
        self.relu = nn.ReLU()
        # Define an LSTM layer
        self.lstm = nn.LSTM(64, 128, batch_first=True)
        # Define a fully connected layer
        self.fc = nn.Linear(128, 4)  # Assuming 4 emotion classes
    
    def forward(self, x):
        # Apply the convolutional layer
        x = self.conv1(x)
        # Apply the ReLU activation function
        x = self.relu(x)
        # Apply the LSTM layer
        x, _ = self.lstm(x)
        # Apply the fully connected layer to the last output of the LSTM
        x = self.fc(x[:, -1, :])
        return x

In [3]:
model = CNNLSTM()

# Define the loss function (CrossEntropyLoss is suitable for classification tasks)
criterion = nn.CrossEntropyLoss()

# Define the optimizer (Adam optimizer with learning rate 0.001)
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Function to train the model
def train_model(model, train_loader, epochs=10):
    # Loop over the number of epochs
    for epoch in range(epochs):
        # Loop over each batch in the training data
        for batch_X, batch_y in train_loader:
            # Zero the gradients for the optimizer
            optimizer.zero_grad()
            # Forward pass: compute the model output
            outputs = model(batch_X)
            # Compute the loss
            loss = criterion(outputs, batch_y)
            # Backward pass: compute the gradients
            loss.backward()
            # Update the model parameters
            optimizer.step()
        # Print the loss for the current epoch
        print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item():.4f}")

PyTorch CRNN model for SER

In [6]:
import torch
import torch.nn as nn

# Define the CRNN model class for Speech Emotion Recognition
class SER_CRNN(nn.Module):
    def __init__(self, n_mels=128, num_classes=5):
        super().__init__()
        # Define the CNN layers
        self.cnn = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1),  # 2D convolutional layer
            nn.ReLU(),  # ReLU activation function
            nn.MaxPool2d(2),  # Max pooling layer
            nn.Conv2d(32, 64, kernel_size=3, padding=1),  # Another 2D convolutional layer
            nn.ReLU(),  # ReLU activation function
            nn.MaxPool2d(2)  # Max pooling layer
        )
        # Define the GRU layer
        self.rnn = nn.GRU(64 * (n_mels // 4), 128, bidirectional=True)
        # Define the fully connected layer
        self.fc = nn.Linear(256, num_classes)
    
    def forward(self, x):
        x = x.unsqueeze(1)  # Add channel dimension
        x = self.cnn(x)  # Apply CNN layers
        x = x.permute(0, 3, 1, 2).flatten(2)  # Rearrange and flatten dimensions
        x, _ = self.rnn(x)  # Apply GRU layer
        x = x[:, -1, :]  # Take the output from the last time step
        return self.fc(x)  # Apply fully connected layer

model = SER_CRNN()  # Instantiate the model
criterion = nn.CrossEntropyLoss()  # Define the loss function
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)  # Define the optimizer