# In this notebook we are exploring Speech emotion recognition using wav2vec2 model

what we are trying to achieve is the following structure:

![Model architecture](images/SERmodelArchitecture.png)

- The input is raw audio or .wav file

In [1]:
import torch
import torchaudio
from IPython.display import Audio, display
import torch.nn as nn
import torch.optim as optim
from speechbrain.lobes.models.huggingface_transformers.wav2vec2 import Wav2Vec2

## Function to preprocess an load the dataset

### Explanation of how stereo audio is converted to mono audio.  
1. The Condition: if wav.shape[0] > 1:  

In PyTorch, audio tensors are typically loaded in the shape [channels, samples].

- wav.shape[0] refers to the number of channels.
- If the value is 1, the audio is already Mono.
- If the value is 2, the audio is Stereo (Left and Right channels).
- The if statement checks if there is more than one channel present.

2. The Operation: torch.mean(wav, dim=0, keepdim=True)  

If the audio has multiple channels, this line merges them:

- torch.mean(..., dim=0): It calculates the average value across the channel dimension. For a stereo file, it adds the Left and Right samples together and divides by 2. This is the standard way to downmix stereo to mono without clipping the audio.

- keepdim=True: This ensures the tensor retains its 2D shape.
    - Without it, a shape of [2, 44100] would become [44100].
    - With it, the shape becomes [1, 44100]. This is important because most deep learning models expect the channel dimension to be present, even if it's just a single channel.

In [2]:
def load_audio(file_path):
    """
    Loads an audio file and prepares it for wav2vec2
    Returns a tensor of shape [time_samples]
    """
    # Load the audio
    wav, sr = torchaudio.load(file_path)

    # resample the audio if it is not 16kHz
    if(sr!=16000):
        resampler = torchaudio.transforms.Resample(
            orig_freq=sr,
            new_freq=16000
        )
        wav = resampler(wav)
    
    # if the audio is stereo convert it into mono
    if wav.shape[0]>1:
        wav = torch.mean(wav, dim=0, keepdim=True)

    # remove the channel dimention
    # since we want to remove the first dimention which is at 0th index so we provide that into the squeeze function
    wav = wav.squeeze(0)

    return wav

## Now we will define the function that initializes the wav2vec2 model

In [3]:
def initialize_models(num_emotions=2, freeze_wav2vec=True):
    """
    Creates wav2vec2 encoder and classifier
    Returns: wav2vec2_model, classifier
    """
    # Load wav2vec2 from HuggingFace via SpeechBrain
    wav2vec2 = Wav2Vec2(
        source="facebook/wav2vec2-base-960h",
        save_path="./wav2vec2_checkpoints"
    )
    
    # Freeze if needed
    if freeze_wav2vec:
        for param in wav2vec2.parameters():
            param.requires_grad = False
    
    # Dense layers on top
    classifier = nn.Sequential(
        nn.Linear(768, 256),
        nn.ReLU(),
        nn.Dropout(0.3),
        nn.Linear(256, 128),
        nn.ReLU(),
        nn.Dropout(0.3),
        nn.Linear(128, num_emotions)
    )
    
    return wav2vec2, classifier

## Now we define the function which gives the prediction

### The unsqueeze() and squeeze() what they do and their purpose

let us take a concrete example:  

1. Start with a simple 1D Audio Tensor: wav = [0.1, 0.2, 0.3]  
    - Shape: (3) (3 samples)

2. Apply unsqueeze(0): wav = [[0.1, 0.2, 0.3]]
    - Shape: (1, 3) (1 row, 3 columns)

Intuition: You added a new set of outer brackets. You now have a **"List of Lists,"** even though the outer list only contains one thing.

3. Apply unsqueeze(1) instead: wav = [[0.1], [0.2], [0.3]]
    - Shape: (3, 1) (3 rows, 1 column)
Intuition: You put a box around every individual sample. This is like turning a horizontal line into a vertical column.  

As we know that neural networks expect numbers on each of their nodes and they take 1 nuber from each column in a row that is why we need to unsqueze the data so that it becomes a matrix with 1 row and 768 colums in our case which is the size of the vector embedding that wav2vec2 gives us.

In [4]:
def predict_emotion(wav, wav2vec2, classifier):
    '''
    Gets the emotion prediction for a single audio file
    wav : the raw audio waveform
    return : logits i.e. the number corresponding to a specific emotion
    '''
    #Add batch dimension : [time_samples]->[1,time_samples]
    wav = wav.unsqueeze(0)

    #Get the wav2vec2 Embeddings
    with torch.no_grad():
        feats = wav2vec2(wav) # Shape: [1, time_frames, 768]
    
    # Mean pooling across time
    embeddings = torch.mean(feats, dim=1) # Shape: [1, 768]

    logits = classifier(embeddings) # Shape: [1, num_emotions]

    return logits.squeeze(0) # Remove batch dim: [num_emotions]

let us talk a bit about the output of the wav2vec2 embedding.

- Why is the shape [1, time_frames, 768]?  
    This shape represents a 3D Tensor, which is standard for sequence modeling. Each dimension has a specific meaning:
    - `1` (Batch Size): This represents the number of audio samples you processed at once. Since you passed a single waveform, the batch size is 1.
    - `time_frames` (Sequence Length): Audio is a continuous signal. Wav2Vec2 divides the audio into small, overlapping windows (frames).
        - For Wav2Vec2, the CNN encoder typically produces one feature vector for every 20ms of audio.
        - If your audio is 1 second long, you will have roughly 50 time frames.

    - `768` (Embedding Dimension/Hidden Size): This is the "depth" of the model. For the Wav2Vec2-Base architecture, every single time frame is represented by a vector of 768 numbers. These numbers capture the phonetic and acoustic characteristics of that specific slice of time.

- Why does Mean Pooling "remove" the time_frames?  
    Mean pooling is a mathematical reduction. When you call torch.mean(feats, dim=1), you are telling PyTorch to collapse the second dimension (the time axis) by calculating the average.  
    
    The Mathematics  
    
    If your `feats` tensor is visualized as a matrix for each batch:  
    
    $X = [f_1, f_2, f_3, \ldots, f_T]$  

    Where each $f_tâ€‹$ is a vector of size 768, mean pooling performs:  

    $\text{Embedding} = \frac{1}{T} \sum_{t=1}^{T} f_t$  

    The Result:  
    
    - Input: 100 different vectors (one for each moment in time).
    - Output: 1 single vector (the "average" of all those moments).

    The dimension dim=1 disappears because it has been aggregated. You move from a "sequence of features" to a "single global feature" that represents the entire audio clip.

## Training function

In [5]:
def train_one_batch(wavs, labels, wav2vec2, classifier, optimizer, criterion):
    '''
    Trains one batch of data
    wavs : list of wavform tensors
    labels : list of emotion tensors
    '''
    # We need to pad waveforms to same length in batch
    max_len = max([w.shape[0] for w in wavs])
    padded_wavs = []
    for w in wavs:
        if w.shape[0]<max_len:
            padding = torch.zeroes(max_len - w.shape[0])
            w = torch.cat([w,padding])
        padded_wavs.append(w)

    # Stack into batch : [batch_size, time_samples]
    batch_wavs = torch.stack(padded_wavs)

    #forward pass
    feats = wav2vec2(batch_wavs) # [batch_size, time_frames, 768]
    embeddings = torch.mean(feats, dim=1)  # [batch_size, 768]
    logits = classifier(embeddings)  # [batch_size, num_emotions]

    # Compute loss
    loss = criterion(logits, labels)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    return loss.item()

## Why do we pad the waveforms for them to be of same size?
- most machine learning and deep learning models expect rectandgular data
- a jagged matrix will not be accepted as a input

That is why we need to pad the waveforms to make their length equal to the longest wav file


In [6]:
def prepare_dataset(audio_files, emotion_labels):
    """
    Loads all audio files and prepares them for training
    audio_files: list of file paths ['audio1.wav', 'audio2.wav', ...]
    emotion_labels: list of integers [0, 1, 0, 1, ...] where 0=sad, 1=happy, etc.
    Returns: list of waveforms, tensor of labels
    """
    waveforms = []
    for audio_path in audio_files:
        waveform = load_audio(audio_path)
        waveforms.append(waveform)
    
    # Convert labels to tensor
    labels = torch.tensor(emotion_labels, dtype=torch.long)
    
    return waveforms, labels

In [7]:
def train_model(waveforms, labels, wav2vec2, classifier, num_epochs=10, batch_size=8, learning_rate=0.001):
    optimizer = optim.Adam(classifier.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(num_epochs):
        total_loss = 0
        num_batches = 0
    
        for i in range(0, len(waveforms), batch_size):
            batch_waveforms = waveforms[i:i+batch_size]
            batch_labels = labels[i:i+batch_size]
            
            # Pad waveforms to same length
            max_len = max([w.shape[0] for w in batch_waveforms])
            padded = []
            for w in batch_waveforms:
                if w.shape[0] < max_len:
                    padding = torch.zeros(max_len - w.shape[0])
                    w = torch.cat([w, padding])
                padded.append(w)
            
            # Stack into batch tensor
            batch_tensor = torch.stack(padded)
            
            # Forward pass through wav2vec2
            with torch.no_grad():
                feats = wav2vec2(batch_tensor)
            
            # Mean pooling
            embeddings = torch.mean(feats, dim=1)
            
            # Forward through classifier
            logits = classifier(embeddings)
            
            # Compute loss
            loss = criterion(logits, batch_labels)
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            num_batches += 1
        
        avg_loss = total_loss / num_batches
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")
    
    return classifier

In [None]:
# Example usage
if __name__ == "__main__":
    # Your training data
    audio_files = [
   'D:/work/Aiml/conversational AI/speechbrain demo/archive/Actor_02/03-01-01-01-01-01-02.wav',
   'D:/work/Aiml/conversational AI/speechbrain demo/archive/Actor_02/03-01-01-01-01-02-02.wav',
   'D:/work/Aiml/conversational AI/speechbrain demo/archive/Actor_02/03-01-01-01-02-01-02.wav',
   'D:/work/Aiml/conversational AI/speechbrain demo/archive/Actor_02/03-01-01-01-02-02-02.wav',
    ]
    # These are the labels for each audio that I provide
    emotion_labels = [1, 0, 1, 0]  # 1=happy, 0=sad
    
    # Initialize models
    wav2vec2, classifier = initialize_models(num_emotions=2, freeze_wav2vec=True)
    
    # Load dataset
    waveforms, labels = prepare_dataset(audio_files, emotion_labels)
    
    # Train
    trained_classifier = train_model(
        waveforms, 
        labels, 
        wav2vec2, 
        classifier,
        num_epochs=20,
        batch_size=4,
        learning_rate=0.001
    )
    
    # Save the trained classifier
    torch.save(trained_classifier.state_dict(), 'emotion_classifier.pth')
    print("Training complete! Model saved.")

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2Model: ['lm_head.weight', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'lm_head.bias']
- This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2ve

Epoch 1/20, Loss: 0.7053
Epoch 2/20, Loss: 0.6984
Epoch 3/20, Loss: 0.6911
Epoch 4/20, Loss: 0.6792
Epoch 5/20, Loss: 0.6809
Epoch 6/20, Loss: 0.6611
Epoch 7/20, Loss: 0.6903
Epoch 8/20, Loss: 0.7361
Epoch 9/20, Loss: 0.7025
Epoch 10/20, Loss: 0.6504
Epoch 11/20, Loss: 0.7235
Epoch 12/20, Loss: 0.6563
Epoch 13/20, Loss: 0.6902
Epoch 14/20, Loss: 0.7031
Epoch 15/20, Loss: 0.7111
Epoch 16/20, Loss: 0.6548
Epoch 17/20, Loss: 0.7161
Epoch 18/20, Loss: 0.6728
Epoch 19/20, Loss: 0.6488
Epoch 20/20, Loss: 0.6930
Training complete! Model saved.


Some useful terms:
- Batch size - batch_size is the number of training examples processed before the model's weights are updated. One update = one batch (one iteration).
- epoch - One epoch = one full pass over the dataset.
- Iterations per epoch = ceil(dataset_size / batch_size)