# Speech Emotion Recognition Model Training

## Overview

In this project, we aim to classify emotions from audio data using machine learning models. The RAVDESS dataset serves as the foundation, with extracted audio features (e.g., MFCCs) used to train and evaluate three distinct deep learning architectures. Each architecture will be associated with one team member, highlighting its unique strengths and limitations:

---

## Models and Responsibilities

1. **CNN (Sydney)**:
   - **Purpose**: Convolutional Neural Networks are used to capture spatial patterns in the extracted audio features.
   - **Strengths**: Effective for static feature representation (e.g., MFCCs or spectrograms).
   - **Limitation**: CNNs struggle to capture temporal dependencies in sequential data.

2. **RNN (Ryan)**:
   - **Purpose**: Recurrent Neural Networks are designed to model temporal dependencies and sequential data, making them a natural choice for audio analysis.
   - **Strengths**: Effective for capturing patterns over time (e.g., pitch or tone variations).
   - **Limitation**: Can suffer from vanishing gradients in long sequences.

3. **Transformer (Edgar)**:
   - **Purpose**: Transformers leverage attention mechanisms to process audio sequences more effectively than traditional RNNs.
   - **Strengths**: High performance on sequential data with parallelized computations.
   - **Limitation**: Computationally intensive and requires larger datasets for training.

---

## Objective

We aim to evaluate the performance of these models on Speech Emotion Recognition and determine which architecture is best suited for this task. Each team member will focus on the following:

- **Sydney**: Optimize and evaluate the CNN model.
- **Ryan**: Implement and train the RNN model.
- **Edgar**: Explore the capabilities of the Transformer model.

---

## Dataset

- **RAVDESS**: A dataset of emotional speech audio files, each labeled with one of eight emotions (e.g., happy, sad, angry).

---

## Evaluation Metrics

The models will be evaluated using:
- **Accuracy**: Percentage of correctly classified samples.
- **Confidence Scores**: Model certainty in predictions.
- **Qualitative Analysis**: Playback of predicted samples for subjective evaluation.

---

## Conclusion

By comparing these architectures, we aim to gain insights into their strengths and limitations for Speech Emotion Recognition, ultimately guiding future work in audio-based emotion analysis.

---


In [1]:
# pip install librosa matplotlib scikit-learn torch torchaudio tensorflow kagglehub


# Import libraries
import os
import librosa
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import torch
import torch.nn as nn
import torch.optim as optim


# Data Preparation for Speech Emotion Recognition

## Overview

This script prepares the RAVDESS dataset for training and evaluating a Speech Emotion Recognition (SER) model. It loads audio data, extracts features (MFCCs), and splits the data into training and testing sets for further use in a machine learning pipeline.

---

## Key Steps

1. **Dataset Download**:
   - The RAVDESS dataset is downloaded using `kagglehub`.

2. **Emotion Mapping**:
   - Each audio file is labeled with one of the following emotions:
     - Neutral, Calm, Happy, Sad, Angry, Fearful, Disgust, Surprised.

3. **Feature Extraction**:
   - **MFCCs (Mel-Frequency Cepstral Coefficients)** are extracted from each audio file using `librosa`.
   - The mean of the MFCC coefficients across frames is used as the feature representation.

4. **Data Loading**:
   - Audio files are processed, and their corresponding MFCC features and labels are stored in arrays.

5. **Label Encoding**:
   - Emotion labels are encoded into numerical values using `LabelEncoder`.

6. **Train/Test Split**:
   - The data is split into 80% training and 20% testing using `train_test_split`.

7. **Tensor Conversion**:
   - The features (`X`) and labels (`y`) are converted into PyTorch tensors for use in deep learning models.

---

## Example Output

After running the script, you will see:
- The number of training and testing samples:


In [68]:
import os
import librosa
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import torch

# Download the RAVDESS dataset using kagglehub
import kagglehub

# Download latest version
dataset_path = kagglehub.dataset_download("uwrfkaggler/ravdess-emotional-speech-audio")

# Check the path to the dataset files
print("Path to dataset files:", dataset_path)

# Emotion mapping
emotions = {
    '01': 'neutral', '02': 'calm', '03': 'happy', '04': 'sad', 
    '05': 'angry', '06': 'fearful', '07': 'disgust', '08': 'surprised'
}

# Function to load audio data and extract MFCCs
def load_data(dataset_path):
    data = []
    labels = []
    for root, _, files in os.walk(dataset_path):
        for file in files:
            if file.endswith('.wav'):
                file_path = os.path.join(root, file)
                # Extract emotion from the filename
                emotion = emotions.get(file.split("-")[2])
                if emotion:  # Only process files with valid emotion codes
                    audio, sr = librosa.load(file_path, sr=22050)
                    mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13).mean(axis=1)
                    data.append(mfccs)
                    labels.append(emotion)
    return np.array(data), np.array(labels)

# Load and preprocess dataset
X, y = load_data(dataset_path)

# Encode labels
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# Convert to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.long)
y_test = torch.tensor(y_test, dtype=torch.long)

print(f"Data loaded successfully! Training samples: {len(X_train)}, Test samples: {len(X_test)}")


Path to dataset files: /Users/sydneyani/.cache/kagglehub/datasets/uwrfkaggler/ravdess-emotional-speech-audio/versions/1
Data loaded successfully! Training samples: 2304, Test samples: 576


# ImprovedCNNModel for Speech Emotion Recognition

## Overview

This is a Convolutional Neural Network (CNN) designed for classifying emotions in speech data. The model takes input features (e.g., MFCCs or spectrograms) and predicts the emotion class.

---

## Architecture

1. **Convolutional Layers**:
   - Three convolutional layers with increasing filters (32, 64, 128) extract spatial features from the input.
   - Each layer uses:
     - **Kernel Size**: 3
     - **Stride**: 1
     - **Padding**: 1
   - **ReLU Activation** is applied after each convolution.

2. **Pooling Layers**:
   - Max pooling layers reduce the spatial dimensions after each convolutional layer.

3. **Dropout Layer**:
   - A dropout rate of 0.5 prevents overfitting by randomly dropping units during training.

4. **Fully Connected Layers**:
   - First layer: Reduces feature dimensions to 256 units.
   - Second layer: Outputs predictions for the number of emotion classes.

---

## Key Features

- **Input Flexibility**: Processes audio features with one input channel (e.g., MFCCs or spectrograms).
- **Regularization**: Includes dropout to improve generalization.
- **Classification**: Outputs a probability distribution over emotion classes using the final fully connected layer.

---

## Example Usage

```python
# Instantiate the model
improved_cnn_model = ImprovedCNNModel(input_size=X_train.shape[1], num_classes=len(emotions))


In [91]:
class ImprovedCNNModel(nn.Module):
    def __init__(self, input_size, num_classes):
        super(ImprovedCNNModel, self).__init__()
        self.conv1 = nn.Conv1d(in_channels=1, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv1d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)
        self.conv3 = nn.Conv1d(in_channels=64, out_channels=128, kernel_size=3, stride=1, padding=1)
        self.relu = nn.ReLU()
        self.pool = nn.MaxPool1d(kernel_size=2)
        self.dropout = nn.Dropout(0.5)
        self.fc1 = nn.Linear((input_size // 8) * 128, 256)  # Adjust size based on pooling
        self.fc2 = nn.Linear(256, num_classes)
        
    def forward(self, x):
        x = x.unsqueeze(1)  # Add channel dimension
        x = self.conv1(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.conv2(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.conv3(x)
        x = self.relu(x)
        x = self.pool(x)
        x = x.view(x.size(0), -1)  # Flatten
        x = self.dropout(x)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Instantiate the model
improved_cnn_model = ImprovedCNNModel(input_size=X_train.shape[1], num_classes=len(emotions))



# Evaluation Function for Speech Emotion Recognition

## Overview

This function evaluates the performance of a trained CNN model on Speech Emotion Recognition (SER) by:

1. Predicting the emotions for a specified number of test samples (`runs`).
2. Displaying the **true emotion**, **predicted emotion**, and the **confidence level** of each prediction.
3. Playing the corresponding audio sample for qualitative analysis.

---

## Key Features

1. **Batch Processing**:
   - Processes `runs` samples in a single forward pass through the model for efficiency.

2. **Prediction Confidence**:
   - Calculates the confidence level as the maximum probability for each predicted class.

3. **Audio Playback**:
   - Locates and plays the audio file for each test sample, enabling qualitative assessment of predictions.

---

## Example Usage

```python
# Evaluate the first 10 samples
evaluate_audio_sample(improved_cnn_model, X_test, y_test, encoder, dataset_path, runs=10)


In [92]:
from IPython.display import Audio, display
import torch.nn.functional as F
import numpy as np
import os

# Updated evaluation function
def evaluate_audio_sample(model, X_test, y_test, encoder, dataset_path, runs=10):
    """
    Evaluate the model on the first `runs` samples from the test set.
    Display true emotion, predicted emotion, confidence level, and play audio.
    """
    model.eval()  # Set the model to evaluation mode

    # Run the model on the first `runs` test samples
    with torch.no_grad():
        output = model(X_test[:runs])  # Forward pass for `runs` samples
        probabilities = F.softmax(output, dim=1).cpu().numpy()  # Convert logits to probabilities
        predicted_labels = np.argmax(probabilities, axis=1)  # Get predicted class indices
        confidences = np.max(probabilities, axis=1)  # Confidence is the max probability for each prediction

    # Decode true and predicted labels for `runs` samples
    true_emotions = encoder.inverse_transform(y_test[:runs].numpy())
    predicted_emotions = encoder.inverse_transform(predicted_labels)

    for i in range(runs):
        # Print the evaluation results
        print(f"\nSample {i + 1}:")
        print(f"  True Emotion: {true_emotions[i]}")
        print(f"  Predicted Emotion: {predicted_emotions[i]} (Confidence: {confidences[i]:.2f})")

        # Locate the corresponding audio file for playback
        file_name = None
        for root, _, files in os.walk(dataset_path):
            for file in files:
                # Match audio file to the true emotion
                if file.endswith('.wav') and emotions.get(file.split("-")[2]) == true_emotions[i]:
                    file_name = os.path.join(root, file)
                    break
            if file_name:
                break

        # Play the audio sample if file is found
        if file_name:
            print("Playing the audio sample...")
            display(Audio(file_name, autoplay=True))
        else:
            print("Error: Corresponding audio file not found.")

# Example usage: Evaluate the first 10 samples
evaluate_audio_sample(improved_cnn_model, X_test, y_test, encoder, dataset_path, runs=10)



Sample 1:
  True Emotion: disgust
  Predicted Emotion: disgust (Confidence: 0.35)
Playing the audio sample...



Sample 2:
  True Emotion: fearful
  Predicted Emotion: surprised (Confidence: 0.38)
Playing the audio sample...



Sample 3:
  True Emotion: fearful
  Predicted Emotion: surprised (Confidence: 0.37)
Playing the audio sample...



Sample 4:
  True Emotion: sad
  Predicted Emotion: disgust (Confidence: 0.38)
Playing the audio sample...



Sample 5:
  True Emotion: surprised
  Predicted Emotion: disgust (Confidence: 0.35)
Playing the audio sample...



Sample 6:
  True Emotion: angry
  Predicted Emotion: surprised (Confidence: 0.33)
Playing the audio sample...



Sample 7:
  True Emotion: happy
  Predicted Emotion: disgust (Confidence: 0.35)
Playing the audio sample...



Sample 8:
  True Emotion: calm
  Predicted Emotion: disgust (Confidence: 0.47)
Playing the audio sample...



Sample 9:
  True Emotion: surprised
  Predicted Emotion: surprised (Confidence: 0.36)
Playing the audio sample...



Sample 10:
  True Emotion: happy
  Predicted Emotion: surprised (Confidence: 0.38)
Playing the audio sample...


# Evaluation of CNN Model for Speech Emotion Recognition

## Overview

This evaluation assesses a Convolutional Neural Network (CNN) for Speech Emotion Recognition (SER). The dataset contains audio samples labeled with emotions like **happy**, **angry**, and **sad**, with features extracted as MFCCs or spectrograms.

---

## Key Findings

1. **Model Behavior**:
   - The CNN correctly predicted the emotion for **2 out of 10 samples**, achieving a 20% accuracy in this evaluation.
   - The true emotions were correctly retrieved from labels, but the predicted emotions frequently lacked variation.
   - The confidence levels for incorrect predictions were often higher than 30%, suggesting the model is overconfident in its biased outputs.

2. **Limitations of CNN**:
   - CNNs excel at processing spatial features but struggle with sequential dependencies in audio.
   - Speech emotion recognition relies heavily on temporal context, better captured by sequential models like RNNs, LSTMs, or Transformers.

3. **Dataset Issues**:
   - The dataset may be imbalanced, leading the model to favor specific dominant classes.
   - Audio complexity (e.g., overlapping pitch and tone variations) requires richer feature representations beyond MFCCs or spectrograms.

---

## Conclusion

The CNN model demonstrated significant limitations for SER, capturing spatial but not temporal features of the audio data. Despite correctly classifying 2 samples, its overall performance was poor, with overfitting and bias evident in the predictions. Future models should incorporate sequential learning and advanced feature extraction to better handle speech data.

---

## Next Steps

1. **Implement Sequential Models**: Explore RNNs and Transformers for better temporal modeling.
2. **Enhance Feature Representation**: Experiment with log-mel spectrograms or pretrained embeddings like Wav2Vec.
3. **Address Class Imbalance**: Apply data augmentation or weighted loss functions to mitigate bias.

---
