# A. Speech Emotion Recognition (SER) 

Feature Extraction: 
- GeMAPS (OpenSMILE)
- Full acoustic feature set (librosa)

Feature Selection: 
- Algorithm 1
- Algorithm 2
- Algorithm 1 + 2

Modeling Approach: 
**Traditional ML:** 
- Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel 
- Random Forest 
- XGBoost (Extreme Gradient Boosting) 
- Gaussian Naive Bayes 
- k-Nearest Neighbors (k-NN) 
**Traditional DL:** 
- Convolutional Neural Network (CNN) with 2-3 convolutional layers, max pooling, and fully connected layers 
- Recurrent Neural Network (RNN) with LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit) layers 
- Hybrid CNN-RNN 


Evaluation Strategy: 10-fold cross validation?? (GridSearchCV??) - less on more expensive tasks (FER/MER)




Total combinations for SER: 2 * 2 * 2 * 2 = 16 experiments 


In [1]:
import sys
import os

module_path = os.path.abspath(os.path.join('..', '..')) # or the path to your source code
sys.path.insert(0, module_path)

from src.utils import load_cmu_mosi

cmu_mosei = load_cmu_mosi()


  from .autonotebook import tqdm as notebook_tqdm


Downloading from https://www.kaggle.com/api/v1/datasets/download/samarwarsi/cmu-mosei?dataset_version_number=1...


 96%|█████████▌| 27.9G/29.1G [14:14<00:34, 35.1MB/s]  


OSError: [Errno 28] No space left on device

It sounds like you're diving into a fascinating and complex area! Let's break down the input choices for your Deep Learning experiments in Speech Emotion Recognition (SER). You're right to pause and consider this carefully, as the input features significantly impact the performance of your neural networks.

## Understanding the Input Options: GeMAPS vs. Full Acoustic Feature Set (including MFCCs)

To make the right decision, let's understand what each feature set offers:

**1. GeMAPS (Geneva Minimalistic Acoustic Parameter Set):**

* **What it is:** GeMAPS is a carefully selected set of 88 acoustic features designed to capture essential paralinguistic information related to emotion. It's based on extensive research in affective computing and aims for a compact yet informative representation of the speech signal.
* **Characteristics:**
    * **Low dimensionality:** 88 features per time frame. This can lead to faster training and potentially require less data for the models to generalize.
    * **Emotionally relevant:** The features are specifically chosen for their known correlation with emotional states.
    * **Interpretability:** Some of the features have relatively clear psychoacoustic interpretations (e.g., fundamental frequency, jitter, shimmer, energy-related features, spectral slope).
* **When to consider it for DL:**
    * **Smaller datasets:** The lower dimensionality might be advantageous when you have a limited amount of training data, as it reduces the risk of overfitting.
    * **Focus on paralinguistic cues:** If you believe that the emotional content is primarily conveyed through these higher-level acoustic characteristics.
    * **Faster experimentation:** The smaller input size can lead to quicker training times, allowing for more rapid iteration.

**2. Full Acoustic Feature Set (librosa - often including MFCCs):**

* **What it is:** This typically involves extracting a wide range of acoustic features using a library like `librosa`. A very common and powerful component of this set is Mel-Frequency Cepstral Coefficients (MFCCs).
* **Mel-Frequency Cepstral Coefficients (MFCCs):**
    * **What they are:** MFCCs are a compact representation of the spectral envelope of a speech signal, mimicking the human auditory system's perception of frequencies. They are derived by taking the Fourier transform of short segments of the audio, mapping the powers of these frequencies onto the mel scale (a perceptual scale of pitches), taking the logarithm of these mel-scaled powers, and then taking the discrete cosine transform (DCT) of the list of mel log powers.
    * **Characteristics:** MFCCs capture the short-term power spectrum of the sound, which is highly relevant for phonetic content and also carries emotional information. The number of MFCCs typically ranges from 13 to 40.
* **Other features in a "full" set (from librosa):** Besides MFCCs, `librosa` can extract a plethora of other features, including:
    * **Spectral features:** Spectral centroid, spectral bandwidth, spectral contrast, spectral flatness, spectral roll-off.
    * **Temporal features:** Zero-crossing rate, root-mean-square (RMS) energy.
    * **Chromagram:** Representation of the spectral energy distribution over 12 pitch classes.
    * **And many more.**
* **Characteristics of the full set:**
    * **High dimensionality:** Combining MFCCs (e.g., 20-40 coefficients per frame) with other spectral and temporal features can result in a significantly higher-dimensional input compared to GeMAPS.
    * **Rich information:** This set captures a broader range of acoustic characteristics, potentially including subtle cues that GeMAPS might miss.
    * **Potentially better performance (with enough data):** With sufficient training data, the richer input can allow deep learning models to learn more complex patterns and achieve higher accuracy.
* **When to consider it for DL:**
    * **Larger datasets:** Deep learning models thrive on large amounts of data to learn complex relationships in high-dimensional spaces.
    * **Potential for higher accuracy:** The comprehensive feature set might capture more nuanced emotional information.
    * **Flexibility for the model:** The neural network can learn to weigh the importance of different features within the set.

## Input for Your Deep Learning Architectures: Detailed Explanation

Now, let's specifically address how you would use these features as input for your CNN, RNN, and Hybrid models:

**1. Convolutional Neural Networks (CNNs):**

* **Input format:** CNNs are typically designed to process grid-like data. For audio, this often translates to a 2D representation of the features over time.
* **How to prepare the input:**
    * **Frame-based extraction:** Extract your chosen feature set (GeMAPS or the full set) for short, overlapping frames of the audio signal. This will give you a sequence of feature vectors over time.
    * **Creating the 2D input:** Stack these feature vectors as rows to form a 2D matrix where one dimension represents time frames and the other dimension represents the acoustic features.
    * **Example:** If you use MFCCs with 20 coefficients and your audio segment is processed into 100 time frames, your input for one audio segment would be a matrix of shape (100, 20).
    * **Channel dimension:** For CNNs, you often need to add a channel dimension. So, your input shape would become (100, 20, 1). If you were to use something like a spectrogram (which is inherently 2D - frequency x time), you might have multiple channels (e.g., magnitude and phase). However, with MFCCs or GeMAPS, you'll likely start with a single channel.
* **Considerations:**
    * **Input shape consistency:** Ensure all your input sequences have the same length or use techniques like padding or truncation to handle variable lengths.
    * **Normalization/Standardization:** It's crucial to normalize or standardize your features (e.g., using StandardScaler from scikit-learn) before feeding them into the neural network to improve training stability and performance.

**2. Recurrent Neural Networks (RNNs - LSTMs/GRUs):**

* **Input format:** RNNs are designed to process sequential data.
* **How to prepare the input:**
    * **Frame-based extraction:** Similar to CNNs, extract your chosen feature set for each time frame.
    * **Sequential input:** The input for each audio segment will be a sequence of feature vectors.
    * **Example:** If you use GeMAPS (88 features) and your audio segment has 150 time frames, your input shape for one segment would be (150, 88).
* **Considerations:**
    * **Variable sequence lengths:** RNNs can naturally handle variable-length sequences. However, for batch processing, you'll typically need to pad shorter sequences to the length of the longest sequence in the batch. Masking layers can be used to ignore the padded parts.
    * **Normalization/Standardization:** Again, normalizing or standardizing your features is essential.

**3. Hybrid CNN-RNN:**

* **Input format:** These models typically leverage the strengths of both CNNs for feature extraction and RNNs for sequential modeling.
* **How to prepare the input:**
    * **CNN part:** The initial input to the CNN part would be similar to the CNN case – a 2D representation of your chosen features over time (e.g., (time frames, features, channels)).
    * **Feature maps to sequence:** The CNN layers would learn to extract higher-level spatio-temporal features. The output of the CNN (after flattening or using techniques like global pooling) would then be reshaped into a sequential form to be fed into the RNN layers (e.g., a sequence of feature vectors where each vector represents a summary of a local time-frequency region).
    * **RNN processing:** The RNN (LSTM or GRU) would then process this sequence of CNN-extracted features to model the temporal dependencies for emotion recognition.
* **Considerations:**
    * **Careful design of the CNN output to sequence mapping:** How you transition from the CNN's feature maps to the RNN's input sequence is a crucial design choice.
    * **Normalization/Standardization:** Apply normalization/standardization to the initial input features.

## Which Input Should You Choose for DL Experiments?

There's no single "best" answer, and it often depends on your dataset size, the complexity of the emotional cues in your data, and the specific architecture you're using. However, here's a general guideline and some suggestions for your experiments:

* **Start with MFCCs (as part of the full acoustic set):** MFCCs have a strong track record in speech-related tasks, including emotion recognition. They capture crucial information about the spectral envelope that is relevant to both phonetic content and emotional expression.
* **Experiment with the full acoustic feature set from `librosa`:** This allows the model to potentially learn from a wider range of acoustic cues. Be mindful of the increased dimensionality and the potential need for more data.
* **Consider GeMAPS as a baseline or for smaller datasets:** If your dataset is relatively small, GeMAPS' lower dimensionality might help prevent overfitting and provide a good starting point.
* **Don't be afraid to combine:** You could even explore combining GeMAPS with other features (though this wasn't explicitly in your initial plan).

**Recommendations for your experiments:**

Given your setup, I would recommend the following initial DL experiments:

1.  **CNN with MFCCs:**
    * Input: 2D representation of MFCCs (time frames x MFCC coefficients x 1 channel).
    * Rationale: MFCCs are a strong baseline, and CNNs excel at learning local patterns in grid-like data, which can be effective for analyzing the spectral evolution over time.

2.  **RNN (LSTM or GRU) with MFCCs:**
    * Input: Sequence of MFCC vectors (time frames x MFCC coefficients).
    * Rationale: RNNs are designed to capture temporal dependencies in sequential data, which is crucial for understanding the evolution of emotional expression in speech.

3.  **Hybrid CNN-RNN with MFCCs:**
    * Input: 2D representation of MFCCs.
    * Rationale: This architecture can leverage the CNN to extract meaningful local features from the MFCCs and the RNN to model the temporal relationships between these features.

After exploring these with MFCCs, you can then venture into using the full acoustic feature set from `librosa` with these same architectures to see if the richer input leads to improved performance. Finally, you could try GeMAPS, especially if you find that your dataset is limited.

## Evaluation Strategy: 10-Fold Cross-Validation and GridSearchCV

Your initial thought of using 10-fold cross-validation is a standard and good practice for evaluating the generalization performance of your models.

**GridSearchCV:**

* **Purpose:** GridSearchCV is a technique for hyperparameter tuning. It systematically searches through a predefined set of hyperparameter combinations for your chosen model and evaluates the performance of each combination using cross-validation on your training data.
* **Integration with 10-fold CV:** You would typically use 10-fold cross-validation *within* each hyperparameter combination being evaluated by GridSearchCV. This ensures a robust estimate of the model's performance for each set of hyperparameters.
* **Considerations for DL:** GridSearchCV can be computationally expensive, especially for deep learning models with a large number of hyperparameters and long training times. You might consider:
    * **RandomizedSearchCV:** This is a less exhaustive but often more efficient alternative that samples a fixed number of hyperparameter combinations.
    * **More focused hyperparameter ranges:** Based on literature or initial pilot experiments, narrow down the ranges of hyperparameters you want to search.
    * **Early stopping:** Implement early stopping during the training within each cross-validation fold to save time.

**Less on More Expensive Tasks (FER/MER):**

Your intuition about reducing the cross-validation folds or the granularity of the hyperparameter search for more computationally expensive tasks like Facial Emotion Recognition (FER) or Music Emotion Recognition (MER) is correct. These tasks often involve larger input dimensions (e.g., images or longer audio sequences) and more complex models, leading to significantly longer training times. In such cases, you might consider:

* **Fewer cross-validation folds (e.g., 3 or 5).**
* **A coarser grid of hyperparameters in GridSearchCV or using RandomizedSearchCV.**
* **Focusing on tuning the most critical hyperparameters.**

For your SER experiments, starting with 10-fold cross-validation combined with GridSearchCV (if computationally feasible) is a good approach to thoroughly evaluate your models and find optimal hyperparameters.

In summary, for your deep learning SER experiments, begin by focusing on **MFCCs** as input for your CNN, RNN, and hybrid models. Experiment with the 2D representation for CNNs and the sequential representation for RNNs. Remember to normalize your features and consider the impact of input dimensionality on your model complexity and data requirements. Good luck with your research!

1.  **Data Loading and Preprocessing:**
    * **Load Audio Files:** Use libraries like `librosa` to load your audio files. This will typically give you a 1-dimensional NumPy array representing the audio waveform and the sampling rate.

In [None]:
import librosa
import numpy as np

def load_audio(file_path, target_sr=16000):
    y, sr = librosa.load(file_path, sr=target_sr)
    return y, sr

audio_path = 'path/to/your/audio.wav'
audio_signal, sampling_rate = load_audio(audio_path)
print(f"Audio shape: {audio_signal.shape}")
print(f"Sampling rate: {sampling_rate}")

This approach aims to let the network learn the relevant features directly from the waveform, potentially bypassing the need for handcrafted feature extraction like MFCCs or GeMAPS.

* **Resampling (Optional but Recommended):** Standardize the sampling rate across your dataset. This ensures consistent input dimensions for your network. Choose a common sampling rate (e.g., 16000 Hz).
    * **Normalization:** Normalize the audio signal to a consistent range (e.g., between -1 and 1). This helps with training stability.

In [None]:
def normalize_audio(audio):
    return audio / np.max(np.abs(audio))

normalized_signal = normalize_audio(audio_signal)
print(f"Normalized audio range: {np.min(normalized_signal):.4f} to {np.max(normalized_signal):.4f}")

* **Padding or Truncation:** Audio files will likely have variable lengths. Neural networks typically require fixed-size input. You'll need to either:
    * **Pad:** Add zeros to the end of shorter audio signals to match the length of the longest signal (or a predefined maximum length).
    * **Truncate:** Cut off longer audio signals to a predefined maximum length.
    * **Segmentation:** Split longer audio into fixed-length segments. This can also increase your training data.

* **Creating Batches:** When training, you'll feed data to the network in batches. Organize your processed audio signals and their corresponding labels into batches.

In [None]:
def pad_or_truncate(audio, target_length):
    current_length = len(audio)
    if current_length < target_length:
        padding = np.zeros(target_length - current_length)
        return np.concatenate((audio, padding))
    elif current_length > target_length:
        return audio[:target_length]
    else:
        return audio

target_length_samples = int(2 * sampling_rate) # Example: 2 seconds at 16kHz
processed_signal = pad_or_truncate(normalized_signal, target_length_samples)
print(f"Processed audio shape: {processed_signal.shape}")

2.  **Neural Network Architecture:**
    * **1D Convolutional Neural Networks (1D CNNs):** These are the most common architecture for processing raw audio. 1D convolutional layers can learn temporal patterns directly from the waveform.
    * **Input Layer:** The input layer of your network will have a shape corresponding to the fixed length of your processed audio signals (e.g., `(target_length_samples, 1)` if you consider it a single channel).
    * **1D Convolutional Layers:** These layers will slide 1D filters across the time dimension of the audio, learning local temporal features. You'll typically have multiple convolutional layers with increasing numbers of filters to capture increasingly complex patterns.
    * **Activation Functions:** Use non-linear activation functions (e.g., ReLU) after each convolutional layer.
    * **Pooling Layers (e.g., MaxPooling1D):** Downsample the temporal dimension, reducing the number of parameters and increasing the receptive field of subsequent layers.
    * **Flatten Layer:** Flatten the output of the convolutional/pooling layers into a 1D vector before feeding it into fully connected layers.
    * **Fully Connected (Dense) Layers:** These layers will learn the final mapping from the learned features to the emotion classes.
    * **Output Layer:** A dense layer with the number of units equal to the number of emotion classes and a softmax activation function for classification.

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, models

def create_raw_audio_cnn_model(input_shape, num_classes):
    model = models.Sequential([
        layers.Input(shape=input_shape),
        layers.Conv1D(32, kernel_size=5, activation='relu', padding='same'),
        layers.MaxPooling1D(pool_size=2),
        layers.Conv1D(64, kernel_size=3, activation='relu', padding='same'),
        layers.MaxPooling1D(pool_size=2),
        layers.Flatten(),
        layers.Dense(128, activation='relu'),
        layers.Dense(num_classes, activation='softmax')
    ])
    return model

target_length = int(2 * 16000)
num_emotions = 8 # Example
input_shape = (target_length, 1) # Single channel for raw audio

raw_audio_model = create_raw_audio_cnn_model(input_shape, num_emotions)
raw_audio_model.summary()

raw_audio_model.compile(optimizer='adam',
                        loss='sparse_categorical_crossentropy',
                        metrics=['accuracy'])

# Assuming you have your training data (X_train, y_train) prepared
# X_train should have shape (num_samples, target_length) and needs a channel dimension
# X_train = np.expand_dims(X_train, axis=-1)
# raw_audio_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)


4.  **Training and Evaluation:**
    * Train your model using your prepared data and labels.
    * Use an appropriate loss function (e.g., `sparse_categorical_crossentropy` for integer labels or `categorical_crossentropy` for one-hot encoded labels) and an optimizer (e.g., Adam).
    * Evaluate the model's performance on your test set using appropriate metrics (e.g., accuracy, F1-score).
    * Use your 10-fold cross-validation strategy (potentially with GridSearchCV for hyperparameter tuning) as you planned.

**Advantages of Using Raw Audio Input:**

* **End-to-end learning:** The network learns features directly tailored to the task, potentially capturing subtle nuances that handcrafted features might miss.
* **Reduced reliance on domain expertise:** You don't need to manually design features based on acoustic knowledge.
* **Potential for capturing non-linear relationships:** Deep neural networks can learn complex, non-linear relationships in the raw waveform.

**Challenges and Considerations:**

* **High dimensionality:** Raw audio signals have a very high dimensionality in the time domain. For example, a 1-second audio clip at 16 kHz has 16,000 data points. This can lead to:
    * **Increased computational cost:** Training can be much slower and require more memory.
    * **Larger number of parameters:** The network might need more parameters to learn from the high-dimensional input, potentially requiring more training data to avoid overfitting.
* **Sensitivity to irrelevant variations:** Raw audio can contain variations (e.g., background noise, speaker characteristics) that are not directly related to emotion. The network might learn to focus on these irrelevant details if not trained carefully with sufficient data and regularization.
* **Long-range dependencies:** Capturing long-range temporal dependencies in the raw waveform can be challenging for simple CNNs. You might need very deep networks or combine them with recurrent layers.
* **Need for substantial data:** Training deep networks on raw audio effectively often requires a significantly larger dataset compared to using well-engineered features.

**When might raw audio input be beneficial?**

* **Very large datasets:** If you have a massive amount of training data, the network has more opportunities to learn meaningful features.
* **Tasks where handcrafted features might be limiting:** In cases where the emotional cues are very subtle or complex and not well-captured by traditional features.
* **Research settings:** For exploring the capabilities of deep learning to automatically learn representations from raw sensory data.

**In your case:**

Given that you are also exploring traditional ML with handcrafted features, it would be a valuable experiment to try a 1D CNN with raw audio input as a comparison. Start with a relatively shallow CNN architecture and see how it performs. Be prepared for potentially longer training times and the need for careful data preprocessing (especially handling variable lengths).

**Conclusion:**

Yes, you can definitely input raw audio into a neural network, primarily using 1D CNNs. However, be aware of the challenges related to high dimensionality and the potential need for large datasets and careful model design. It's a different paradigm compared to using handcrafted features, and the results might vary depending on your specific dataset and the complexity of the emotional cues. Experimentation is key!