<a href="https://colab.research.google.com/github/tomer9080/DL-Speech-exercises/blob/main/PyTorch_hands_on_046747_ex1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <img src="https://pytorch.org/assets/images/pytorch-logo.png" alt="PyTorch" width="40"/> PyTorch Hands On

In the second part of our course, we'll focus on hands-on and advanced techniques for tackling different speech recognition tasks. Most modern systems rely on **deep neural networks (DNNs)**, which serve as the core of the algorithm.

In this tutorial, you'll go through a **step-by-step process** to build a neural network that can classify different types of speech events. By the end, you'll know how to **import your data**, **prepare it for training**, and **feed it into a convolutional neural network (CNN)** to perform **speech classification** efficiently.

* Note: Before you run your code, make sure your runtime uses a GPU and not CPU, for faster execution.

## Importing python libraries

In [None]:
import warnings
warnings.filterwarnings('ignore', category=UserWarning)

In [None]:
import torch
import torchaudio
import torch.nn as nn

## Import the data

Before proceeding, we need to first load our data.
Upload the data archive `speech_classification_ex1.tar.gz` attached in the moodle. Then unzip it using the command below:

In [None]:
# Unzip the downloaded file
!tar -xzf speech_classification_ex1.tar.gz

## Preprocessing with the Torch `Dataset` Class

Before feeding our data into the classifier, we first need to **preprocess** it.  
In speech processing, we usually don’t pass the raw waveform directly into a neural network. Instead, we extract **acoustic features** such as **Spectrograms**, **MFCCs**, or **Log-Mel Spectrograms** that better represent the signal for learning.

Don’t worry if these terms sound new — we’ll cover each of them later in the course.  
For this exercise, we’ll use a **Log-Mel Spectrogram** as our input feature map.

We’ll also use PyTorch’s `Dataset` class to conveniently load both the feature set (the sample **`X`**) and its corresponding label (**`y`**).

Fill in the missing parts `( ... )` in the code cell below:


In [None]:
import os
import pandas as pd

class SpeechDataset(torch.utils.data.Dataset):
    def __init__(self, data_path, split=None, n_mels=80, n_fft=400, hop_length=160, padded_length_seconds=5):

        self.data_path = data_path
        self.sample_rate = 16000
        self.data_frame = pd.read_csv(os.path.join(data_path, 'metadata.csv')) # Assuming a metadata.csv file

        if split is not None:
            self.data_frame = self.data_frame[self.data_frame['split'] == split].reset_index(drop=True) # Filter by split column

        self.mel_spectrogram = torchaudio.transforms.MelSpectrogram(
            n_mels=...,
            n_fft=...,
            hop_length=...,
            sample_rate=...
        )

        self.padded_length_seconds = 5

    def __len__(self):
        return len(self.data_frame)

    def __getitem__(self, idx):
        audio_file_path = os.path.join(self.data_frame.loc[idx, 'wav_path'])
        label = self.data_frame.loc[idx, 'class']

        waveform, sample_rate = ... # Load audio file using torchaudio

        # Pad waveform to fixed length
        num_frames = ...
        if waveform.shape[-1] < num_frames:
            padding = ...
            waveform = ...
        else:
            waveform = ...

        # Do not change below !!!
        mel_features = self.mel_spectrogram(waveform)[..., :-1] # Apply transformation to waveform

        return mel_features, label

## Dataloaders

To train our model efficiently, we'll load the data in **mini-batches** using a **DataLoader**.  
The DataLoader handles batching for us and lets us control important parameters like the **batch size**. It also helps us **read data efficiently from memory** during training, which becomes especially important when working with larger datasets.


In [None]:
BATCH_SIZE = 32
data_path = '...' # use page or endure

train_dataset = SpeechDataset(data_path=data_path, split="...")

trainloader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=2
)

val_dataset = SpeechDataset(data_path=data_path, split="...")

valloader = torch.utils.data.DataLoader(
    val_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=2
)

## Defining Our Model

Next, we'll define a class called `MyClassifier`.  
This class will contain two main parts:
- The **network architecture**, defined inside the `__init__` method.
- The **forward pass**, defined inside the `forward` method.

Your task is to fill in the missing parts `( ... )` so that the network can handle inputs of shape **(B, C, M, T)**, where:
- **B** — batch size  
- **C** — number of channels (in our case, `C = 1`)  
- **M** — number of bins in the log-Mel spectrogram  
- **T** — time dimension


In [None]:
class MyClassifier(nn.Module):
    def __init__(self, num_classes=2):
        super(MyClassifier, self).__init__()
        self.relu = nn.ReLU()

        self.conv1 = nn.Conv2d(..., 16, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(16, 16, kernel_size=3, stride=1, padding=1)

        self.fc1 = nn.Linear(..., 128)
        self.fc2 = nn.Linear(128, num_classes)


    def forward(self, x):
        x = ...  # Apply conv 1 and activation
        x = ...  # Apply conv 2 and activation

        x = x.view(x.size(0), -1)  # Flatten the tensor

        x = ...  # Apply FC1 and activation
        x = ...  # Apply FC2
        return x

Let's try forwarding a single batch from our DataLoader through the model.  
We'll take a look at the **input** and **output** shapes to make sure everything is working as expected.


In [None]:
model = MyClassifier()
for mel_features, labels in trainloader:
    print(f"{mel_features.shape=}")
    output = model(mel_features)
    print(f"{output.shape=}")
    break

Make sure the **output** has the shape **(B, N)**, where **N** is the number of classes (in our case, `N = 2`).  
The output (after applying normalization, such as the `softmax` function) represents the **probability distribution** for each sample, indicating the likelihood of belonging to each of the **N different classes**.


## Training

Before we start training, we need to set up a **loss function** and an **optimizer**.  
We'll also define a small helper function to **evaluate our model’s accuracy** during training.

Since this is a **classification task**, we’ll use `nn.CrossEntropyLoss`.  
Technically, since we’re dealing with **two classes**, we could also use `nn.BCELoss` (Binary Cross-Entropy Loss), but that would require a few changes in our network setup.

For now, we’ll keep things simple and stick with `nn.CrossEntropyLoss`.


In [None]:
learning_rate = 0.001
num_epochs = 5
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = MyClassifier().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [None]:
def calc_accuracy(y_pred, y_true):
    y_pred_softmax = torch.log_softmax(y_pred, dim=-1)
    # Your code here
    ...
    return accuracy

We now proceed to train our network:

In [None]:
for epoch in range(num_epochs):
    model.train()
    running_loss = ...

    for i, (inputs, labels) in enumerate(...):
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()

        outputs = ... # Model forward pass
        loss = ... # Calculate loss

        loss.backward()
        optimizer.step()

        running_loss += ... # aggregate loss (hint - use .item())

        if (i + 1) % 3 == 0:  # Print every 3 mini-batches
            print(f"Epoch [{epoch + 1}/{num_epochs}], Step [{i + 1}/{len(trainloader)}], Loss: {running_loss / 10:.4f}")
            running_loss = 0.0 # Reset running loss

    # Validation loop
    model.eval()
    val_loss = ...
    val_accuracy = ...
    with torch.no_grad():
        for inputs, labels in valloader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = ...
            loss = ...

            val_loss += ...
            val_accuracy += ...

    print(f"Epoch [{epoch + 1}/{num_epochs}], Validation Loss: {val_loss / len(valloader):.4f}, Validation Accuracy: {val_accuracy / len(valloader):.4f}")

print("Finished Training")

## Evaluating Our Model

Now that our model is trained, let's check how well it performs.  

Complete the cell below so that it **runs predictions on the test DataLoader** and then **reports the accuracy** on the test set.


In [None]:
# Your evalutaion code here

## Conclusions

Great work! You’ve successfully implemented a **CNN-based classifier** to distinguish between different speech events.  

Keep in mind that this is a **basic version** of a classifier. You can always make the network more powerful by **adding more convolutional layers**, incorporating **pooling techniques**, or experimenting with other architectural improvements.
