<a href="https://colab.research.google.com/github/samarjeet22/102115053-SESS_LE1/blob/main/102115053_SamarjeetSingh.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Name: **Samarjeet Singh**  
Email: `ssingh10_be21@thapar.edu`  
Roll No: **102115053**  
Group: **4NC2**  
Start Timestamp: 20240911-1000

## Question
 Consider the paper: <https://arxiv.org/abs/1804.03209>

  1. Read and summarise the paper in about 50 words.
  2. Download the dataset in the paper, statistically analyse and
     describe it, so that it may be useful for posterity. (Include code
     snippets in your .ipynb file to evidence your analysis.)
  3. Train a classifier so that you are able to distinguish the commands
     in the dataset.
  4. Report the performance results using standard benchmarks.
  5. Record about 30 samples of each command in your voice and create a
     new dataset (including a new user id for yourself).  You may use a
     timer on your computer to synchronise.
  6. Fine tune your classifier to perform on your voice.
  7. Report the results.

## SUMMARY
The paper "Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition" introduces a dataset to aid in training keyword spotting systems. It details how the data was collected, the types of words included, and the challenges in creating models for recognizing specific words from background noise efficiently. The paper also discusses methods for evaluation, the importance of lightweight models for on-device processing, and provides baseline results for comparing future models. The dataset is open-source, enabling broad usage and advancements in speech recognition research.

In [1]:
!pip install torchaudio torch sounddevice


Collecting sounddevice
  Downloading sounddevice-0.5.0-py3-none-any.whl.metadata (1.4 kB)
Downloading sounddevice-0.5.0-py3-none-any.whl (32 kB)
Installing collected packages: sounddevice
Successfully installed sounddevice-0.5.0


In [3]:
import os
import torchaudio
from collections import Counter
from torch.utils.data import DataLoader, Subset

# Create the data directory if it doesn't exist
data_dir = './data'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

# Download the Speech Commands dataset
dataset = torchaudio.datasets.SPEECHCOMMANDS(root=data_dir, download=True)


100%|██████████| 2.26G/2.26G [01:45<00:00, 23.1MB/s]


In [9]:
# Select 10 commands to work with
selected_commands = ['dog', 'cat', 'up', 'down', 'wow', 'yes', 'go', 'stop', 'on', 'off']

# Limit the number of samples per command (e.g., 100 samples per command)
samples_per_command = 100


In [10]:
# Create a subset of the dataset by filtering for the selected commands
subset_indices = []
command_counter = Counter()

for idx, sample in enumerate(dataset):
    label = sample[2]
    if label in selected_commands and command_counter[label] < samples_per_command:
        subset_indices.append(idx)
        command_counter.update([label])

    # Stop when we have enough samples for each command
    if all(command_counter[cmd] >= samples_per_command for cmd in selected_commands):
        break

# Create a subset of the dataset
subset_dataset = Subset(dataset, subset_indices)

# Check the sample count for each command in the subset
print(f"Sample counts in subset: {command_counter}")
print(f"Total subset size: {len(subset_dataset)}")

Sample counts in subset: Counter({'cat': 100, 'dog': 100, 'down': 100, 'go': 100, 'off': 100, 'on': 100, 'stop': 100, 'up': 100, 'wow': 100, 'yes': 100})
Total subset size: 1000


##DATA PREPROCESSING

In [11]:
#Data Preprocessing (Padding and Truncating)
import torch
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence


In [12]:
# Define a fixed length for all audio samples (1 second = 16000 samples at 16kHz)
fixed_length = 16000

# Custom collate function to pad and truncate audio data
def collate_fn(batch):
    waveforms = []
    labels = []

    for item in batch:
        waveform = item[0]
        label = item[2]

        if waveform.shape[1] > fixed_length:
            waveform = waveform[:, :fixed_length]
        elif waveform.shape[1] < fixed_length:
            pad_amount = fixed_length - waveform.shape[1]
            waveform = torch.nn.functional.pad(waveform, (0, pad_amount))

        waveforms.append(waveform)
        labels.append(label)

    waveforms = torch.stack(waveforms)
    return waveforms, labels

In [13]:
# DataLoader for the subset dataset
loader = DataLoader(subset_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)

##CNN CLASSIFIER

In [14]:
#Define and Train a CNN Classifier (with correct fully connected layer input size)
import torch.nn as nn
import torch.optim as optim
import torchaudio.transforms as transforms

# Define the MelSpectrogram transform to convert audio waveforms into spectrograms
mel_spectrogram = transforms.MelSpectrogram(
    sample_rate=16000, n_mels=128, n_fft=400, hop_length=160
)

# Define a simple CNN model for speech command classification
class SimpleCNN(nn.Module):
    def __init__(self, num_classes):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)
        self.fc1 = nn.Linear(32 * 25 * 32, 128)  # Correct input size based on shape (32*25=800)
        self.fc2 = nn.Linear(128, num_classes)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = self.pool(self.relu(self.conv2(x)))
        x = x.view(x.size(0), -1)  # Flatten the output for the fully connected layers
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Get the number of classes (commands)
num_classes = len(selected_commands)

# Create a dictionary to map commands (labels) to numerical values
label_to_index = {label: idx for idx, label in enumerate(selected_commands)}

# Function to convert string labels to numerical indices
def label_to_tensor(label):
    return torch.tensor(label_to_index[label])

# Instantiate the model
model = SimpleCNN(num_classes=num_classes).to('cuda')

# Define optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Training loop for 10 epochs
num_epochs = 10

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0

    for waveforms, labels in loader:
        # Convert waveforms to spectrograms
        waveforms = mel_spectrogram(waveforms)
        waveforms = waveforms.squeeze(1).unsqueeze(1).to('cuda')  # Remove extra dimension, add channel dimension
        labels = torch.tensor([label_to_tensor(label) for label in labels]).to('cuda')

        optimizer.zero_grad()
        outputs = model(waveforms)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {running_loss / len(loader)}')

print('Training completed!')



Epoch 1/10, Loss: 3.4963080808520317
Epoch 2/10, Loss: 2.012493845075369
Epoch 3/10, Loss: 1.5896470323204994
Epoch 4/10, Loss: 1.1217845231294632
Epoch 5/10, Loss: 0.8777167685329914
Epoch 6/10, Loss: 0.6619481900706887
Epoch 7/10, Loss: 0.4038319722749293
Epoch 8/10, Loss: 0.33765738760121167
Epoch 9/10, Loss: 0.37012161873281
Epoch 10/10, Loss: 0.17110937379766256
Training completed!


##MODEL EVALUATION

In [15]:
#Evaluate the Model
model.eval()  # Set the model to evaluation mode
correct = 0
total = 0

with torch.no_grad():  # Disable gradient calculation for faster evaluation
    for waveforms, labels in loader:  # Assuming you're using the same loader for validation
        waveforms = mel_spectrogram(waveforms)
        waveforms = waveforms.squeeze(1).unsqueeze(1).to('cuda')
        labels = torch.tensor([label_to_tensor(label) for label in labels]).to('cuda')

        outputs = model(waveforms)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f'Accuracy: {accuracy}%')

Accuracy: 98.7%
