# Fine-tuning Whisper on Speech Pathology Dataset

## Goal

The goal of the Cleft Palate project (name TBD) at Vanderbilt DSI is to classify audio clips of patients' voices as containing hypernasality (a speech impediment) or not. The patients with hypernasality can then be recommended for speech pathology intervention. This is currently evaluated by human speech pathologists, which requires access to these medical providers. Our hope is to train a model that can classify this speech impediment for expedited patient access to a speech pathologist.

Tutorial created with guidance from ["Fine Tuning OpenAI Whisper Model for Audio Classifcation in PyTorch"](https://www.daniweb.com/programming/computer-science/tutorials/540802/fine-tuning-openai-whisper-model-for-audio-classification-in-pytorch)

## Model

We plan to use the Whisper embedings from OpenAI and train a classification model, either using Whisper with a sequence classification head or another classification LLM.

## Data

The data in this notebook is publicly available voice recordings featuring hypernasality and control groups. In the future we hope to train our model on private patient data from Vanderbilt University Medical Center (VUMC).

### Split Data

We need to split our data into train and test sets, then save those for further experiments.

In [1]:
!pip install torch
!pip install datasets
!pip install librosa
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting huggingface-hub>=0.19.4 (from datasets)
  Downloading huggingface_hub-0.22.2-py3-none-any.whl.metadata (12 kB)
Downloading datasets-2.18.0-p

In [70]:
# import libraries
import datasets
from datasets import load_dataset, DatasetDict,  Audio
import pandas as pd
import os
import glob
import librosa
import io
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report, accuracy_score
from transformers import WhisperModel, WhisperFeatureExtractor, AdamW
import torch
import torch.nn as nn
import torch.utils.data
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset
from sklearn.metrics import f1_score, classification_report, accuracy_score

In [4]:
pwd

'/workspace/cleft_palate_choja'

In [71]:
data_path = "/workspace/cleft_palate_choja/WAV_PUBLIC_SAMPLES/NOISE"

train_catalog = "/workspace/cleft_palate_choja/train_noise.csv"
test_catalog = "/workspace/cleft_palate_choja/test_noise.csv"

model_checkpoint = "openai/whisper-base"

In [72]:
train_metadata = pd.read_csv(train_catalog)
train_metadata

Unnamed: 0,File_Name,Sampling_Rate_(Hz),Channels,Duration_(seconds),folder,hypernasality,original_text,OPENAI_Whisper_text,WAV_filename,WAV_folder
0,ACPA ted had a dog with white feet-3.mp3,44100.0,1.0,4.13,CASES,1.0,ted had a dog with white feet,Ted and a dog with white feet.,ACPA ted had a dog with white feet-3.wav,CASES_WAV
1,cdc 4 (and then go to school).mp3,44100.0,2.0,1.41,CONTROLS,0.0,and then go to school,and then go to school.,cdc 4 (and then go to school).wav,CONTROLS_WAV
2,Video 1_4 (and can I have some more material).mp3,44100.0,2.0,3.60,CONTROLS,0.0,and can I have some more material,And can I have some more material?,Video 1_4 (and can I have some more material).wav,CONTROLS_WAV
3,NEW - video 2 (three times).mp3,44100.0,2.0,1.28,CONTROLS,0.0,three times,Three times.,NEW - video 2 (three times).wav,CONTROLS_WAV
4,cdc 4 (and then he brushed his teeth).mp3,44100.0,2.0,1.52,CONTROLS,0.0,and then he brushed his teeth,And then he brushed his teeth.,cdc 4 (and then he brushed his teeth).wav,CONTROLS_WAV
...,...,...,...,...,...,...,...,...,...,...
289,video 1 (pizza bundt).mp3,44100.0,2.0,1.80,CONTROLS,0.0,pizza bundt,Pizza Funt!,NOISE-video 1 (pizza bundt).wav,CONTROLS_WAV
290,ACPA most boys like to play football-3.mp3,48000.0,1.0,3.31,CASES,1.0,most boys like to play football,Most boys like to play football.,NOISE-ACPA most boys like to play football-3.wav,CASES_WAV
291,Facebook (take a tire).mp3,44100.0,1.0,1.75,CASES,1.0,take a tire,See you next time!,NOISE-Facebook (take a tire).wav,CASES_WAV
292,Video 5_1 (feet).mp3,44100.0,2.0,1.04,CASES,1.0,feet,Peace.,NOISE-Video 5_1 (feet).wav,CASES_WAV


In [73]:
train_df, val_df = train_test_split(train_metadata, test_size = 0.3, random_state = 42)

In [74]:
train_files = train_df["WAV_filename"].tolist()

train_folder = train_df["WAV_folder"].tolist()

train_full_paths = [os.path.join(data_path,train_folder[i], train_files[i]) for i in range(0,len(train_files))]

#train_full_paths

In [75]:
train_labels = train_df["hypernasality"].tolist()

train_labels[0:10]

[0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0]

In [76]:
# val set
val_files = val_df["WAV_filename"].tolist()

val_folder = val_df["WAV_folder"].tolist()

val_full_paths = [os.path.join(data_path,val_folder[i], val_files[i]) for i in range(0,len(val_files))]

val_labels = val_df["hypernasality"].tolist()

In [77]:
len(val_labels)

89

In [78]:
test_metadata = pd.read_csv(test_catalog)

In [79]:
# add cols for wav data

# Replace ".mp3" with ".wav" in the "Filename" column
test_metadata['WAV_filename'] = test_metadata['File_Name'].str.replace('.mp3', '.wav')

# Create "WAV_folder" column by concatenating "_WAV" to the "folder" column
test_metadata['WAV_folder'] = test_metadata['folder'] + "_WAV"

  test_metadata['WAV_filename'] = test_metadata['File_Name'].str.replace('.mp3', '.wav')


In [80]:
test_files = test_metadata["WAV_filename"].tolist()

test_folder = test_metadata["WAV_folder"].tolist()

test_full_paths = [os.path.join(data_path,test_folder[i], test_files[i]) for i in range(0,len(test_files))]

#test_full_paths

In [81]:
test_labels = test_metadata["hypernasality"].tolist()

### Create PyTorch datasets

In [93]:


train_audio_dataset = datasets.Dataset.from_dict({"audio": train_full_paths,
                                                  "labels":train_labels}
                                                 ).cast_column("audio", Audio(sampling_rate=16_000))

test_audio_dataset = datasets.Dataset.from_dict({"audio": test_full_paths,
                                                  "labels": test_labels}
                                                 ).cast_column("audio", Audio(sampling_rate=16_000))

val_audio_dataset = datasets.Dataset.from_dict({"audio": val_full_paths,
                                                 "labels": val_labels }
                                             ).cast_column("audio", Audio(sampling_rate=16_000))

In [94]:
#model_checkpoint = "openai/whisper-base"

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_checkpoint)
encoder = WhisperModel.from_pretrained(model_checkpoint)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


In [95]:

class SpeechClassificationDataset(torch.utils.data.Dataset):
    def __init__(self, audio_data,  text_processor):
        self.audio_data = audio_data
        self.text_processor = text_processor

    def __len__(self):
        return len(self.audio_data)

    def __getitem__(self, index):

      inputs = self.text_processor(self.audio_data[index]["audio"]["array"],
                                   return_tensors="pt",
                                   sampling_rate=self.audio_data[index]["audio"]["sampling_rate"])
      input_features = inputs.input_features
      decoder_input_ids = torch.tensor([[1, 1]]) * encoder.config.decoder_start_token_id

      labels = np.array(self.audio_data[index]['labels'])

      return input_features, decoder_input_ids, torch.tensor(labels)


In [97]:
train_dataset = SpeechClassificationDataset(train_audio_dataset,  feature_extractor)
test_dataset = SpeechClassificationDataset(test_audio_dataset,  feature_extractor)
val_dataset = SpeechClassificationDataset(val_audio_dataset,  feature_extractor)

batch_size = 5

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

## Fine Tune Whisper Model

Whisper model from HuggingFace.

In [98]:

class SpeechClassifier(nn.Module):
    def __init__(self, num_labels, encoder):
        super(SpeechClassifier, self).__init__()
        self.encoder = encoder
        self.classifier = nn.Sequential(
            nn.Linear(self.encoder.config.hidden_size, 4096),
            nn.ReLU(),
            nn.Linear(4096, 2048),
            nn.ReLU(),
            nn.Linear(2048, 1024),
            nn.ReLU(),
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Linear(512, num_labels)
        )

    def forward(self, input_features, decoder_input_ids):
        outputs = self.encoder(input_features, decoder_input_ids=decoder_input_ids)
        pooled_output = outputs['last_hidden_state'][:, 0, :]
        logits = self.classifier(pooled_output)
        return logits



In [99]:
num_labels = 2

model = SpeechClassifier(num_labels, encoder).to(device)
optimizer = AdamW(model.parameters(), lr=2e-5, betas=(0.9, 0.999), eps=1e-08)
criterion = nn.CrossEntropyLoss()

In [100]:
# Define the training function NO VAL
def train(model, train_loader, optimizer, criterion, device, num_epochs):

  for epoch in range(num_epochs):

    model.train()

    for i, batch in enumerate(train_loader):

          input_features, decoder_input_ids, labels = batch

          input_features = input_features.squeeze()
          input_features = input_features.to(device)

          decoder_input_ids = decoder_input_ids.squeeze()
          decoder_input_ids = decoder_input_ids.to(device)

          labels = labels.view(-1)
          labels = labels.type(torch.LongTensor)
          labels = labels.to(device)

          optimizer.zero_grad()

          logits = model(input_features, decoder_input_ids)

          loss = criterion(logits, labels)
          loss.backward()

          optimizer.step()

          if (i+1) % 8 == 0:
              print(f'Epoch {epoch+1}/{num_epochs}, Batch {i+1}/{len(train_loader)}, Train Loss: {loss.item():.4f}')

    torch.save(model.state_dict(), 'best_model.pt')

In [101]:
# Define the training function
def train(model, train_loader, val_loader, optimizer,  criterion, device, num_epochs):
    best_accuracy = 0.0
    for epoch in range(num_epochs):
        model.train()
        for i, batch in enumerate(train_loader):
            input_features, decoder_input_ids, labels = batch
            input_features = input_features.squeeze()
            input_features = input_features.to(device)
            decoder_input_ids = decoder_input_ids.squeeze()
            decoder_input_ids = decoder_input_ids.to(device)
            labels = labels.view(-1)
            labels = labels.type(torch.LongTensor)
            labels = labels.to(device)
            optimizer.zero_grad()
            logits = model(input_features, decoder_input_ids)
            loss = criterion(logits, labels)
            loss.backward()
            optimizer.step()
            if (i+1) % 8 == 0:
                print(f'Epoch {epoch+1}/{num_epochs}, Batch {i+1}/{len(train_loader)}, Train Loss: {loss.item() :.4f}')
                train_loss = 0.0
        val_loss, val_accuracy, val_f1, _ , _ = evaluate(model, val_loader, device)
        if val_accuracy > best_accuracy:
            best_accuracy = val_accuracy
            torch.save(model.state_dict(), 'best_model.pt')
        print("========================================================================================")
        print(f'Epoch {epoch+1}/{num_epochs}, Val Loss: {val_loss:.4f}, Val Accuracy: {val_accuracy:.4f}, Val F1: {val_f1:.4f}, Best Accuracy: {best_accuracy:.4f}')
        print("========================================================================================")

In [102]:
def evaluate(model, data_loader,  device):
    all_labels = []
    all_preds = []
    total_loss = 0.0
    with torch.no_grad():
        for i, batch in enumerate(data_loader):
          input_features, decoder_input_ids, labels = batch
          input_features = input_features.squeeze()
          input_features = input_features.to(device)
          decoder_input_ids = decoder_input_ids.squeeze()
          decoder_input_ids = decoder_input_ids.to(device)
          labels = labels.view(-1)
          labels = labels.type(torch.LongTensor)
          labels = labels.to(device)
          optimizer.zero_grad()
          logits = model(input_features, decoder_input_ids)
          loss = criterion(logits, labels)
          total_loss += loss.item()
          _, preds = torch.max(logits, 1)
          all_labels.append(labels.cpu().numpy())
          all_preds.append(preds.cpu().numpy())
    all_labels = np.concatenate(all_labels, axis=0)
    all_preds = np.concatenate(all_preds, axis=0)
    loss = total_loss / len(data_loader)
    accuracy = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds, average='macro')
    return loss, accuracy, f1, all_labels, all_preds

In [103]:
import librosa
num_epochs = 5
train(model, train_loader, val_loader, optimizer, criterion, device, num_epochs)

Epoch 1/5, Batch 8/41, Train Loss: 0.6397
Epoch 1/5, Batch 16/41, Train Loss: 0.4986
Epoch 1/5, Batch 24/41, Train Loss: 0.2008
Epoch 1/5, Batch 32/41, Train Loss: 0.2401
Epoch 1/5, Batch 40/41, Train Loss: 0.2498
Epoch 1/5, Val Loss: 0.1442, Val Accuracy: 0.9551, Val F1: 0.9549, Best Accuracy: 0.9551
Epoch 2/5, Batch 8/41, Train Loss: 0.0223
Epoch 2/5, Batch 16/41, Train Loss: 0.4859
Epoch 2/5, Batch 24/41, Train Loss: 0.0414
Epoch 2/5, Batch 32/41, Train Loss: 0.0206
Epoch 2/5, Batch 40/41, Train Loss: 0.0325
Epoch 2/5, Val Loss: 0.1103, Val Accuracy: 0.9551, Val F1: 0.9549, Best Accuracy: 0.9551
Epoch 3/5, Batch 8/41, Train Loss: 0.0023
Epoch 3/5, Batch 16/41, Train Loss: 0.0025
Epoch 3/5, Batch 24/41, Train Loss: 0.0088
Epoch 3/5, Batch 32/41, Train Loss: 0.0349
Epoch 3/5, Batch 40/41, Train Loss: 0.0020
Epoch 3/5, Val Loss: 0.1116, Val Accuracy: 0.9775, Val F1: 0.9775, Best Accuracy: 0.9775
Epoch 4/5, Batch 8/41, Train Loss: 0.0039
Epoch 4/5, Batch 16/41, Train Loss: 0.0025
Epoch 

### Validation

Before running the model on the test set, let's examine the validation set and see how our model is doing.

In [104]:
#VALIDATION
state_dict = torch.load('best_model.pt')

# Create a new instance of the model and load the state dictionary
num_labels = 2
model = SpeechClassifier(num_labels, encoder).to(device)
model.load_state_dict(state_dict)

_, _, _, all_labels, all_preds = evaluate(model, val_loader, device)

In [105]:
#VALIDATION
print(classification_report(all_labels, all_preds))
print(accuracy_score(all_labels, all_preds))

              precision    recall  f1-score   support

           0       0.95      1.00      0.98        41
           1       1.00      0.96      0.98        48

    accuracy                           0.98        89
   macro avg       0.98      0.98      0.98        89
weighted avg       0.98      0.98      0.98        89

0.9775280898876404


This is too good to be true. Checking the contents of labels, preds, and data balance.

In [106]:
all_labels

array([1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       0])

In [107]:
all_preds

array([1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       0])

In [108]:
sum(train_labels)/len(train_labels)

0.5170731707317073

In [109]:
sum(val_labels)/len(val_labels)

0.5393258426966292

In [110]:
# TESTING ONLY
state_dict = torch.load('best_model.pt')

# Create a new instance of the model and load the state dictionary
num_labels = 2
model = SpeechClassifier(num_labels, encoder).to(device)
model.load_state_dict(state_dict)

_, _, _, all_labels, all_preds = evaluate(model, test_loader, device)


print(classification_report(all_labels, all_preds))
print(accuracy_score(all_labels, all_preds))

              precision    recall  f1-score   support

           0       0.72      1.00      0.84        36
           1       1.00      0.63      0.77        38

    accuracy                           0.81        74
   macro avg       0.86      0.82      0.81        74
weighted avg       0.86      0.81      0.80        74

0.8108108108108109


I don't want to run testing yet as we want to explore more models.

### Model Troubleshooting

So far our results look too good to be true (98% validation accuracy). In the cells below I run through some troubleshooting methods to ensure our model is not overfit or learning the wrong representations.

Ensure that the labels are correct.

In [111]:
train_df[train_df["WAV_folder"] == "CONTROLS_WAV"]["hypernasality"]

118    0.0
216    0.0
18     0.0
193    0.0
176    0.0
      ... 
121    0.0
293    0.0
20     0.0
188    0.0
270    0.0
Name: hypernasality, Length: 99, dtype: float64

In [112]:
train_df

Unnamed: 0,File_Name,Sampling_Rate_(Hz),Channels,Duration_(seconds),folder,hypernasality,original_text,OPENAI_Whisper_text,WAV_filename,WAV_folder
118,video 1 (four).mp3,44100.0,2.0,1.04,CONTROLS,0.0,four,Voila!,video 1 (four).wav,CONTROLS_WAV
216,ACPA sue roasted a duck for supper.mp3,44100.0,1.0,2.59,CONTROLS,0.0,sue roasted a duck for supper,Sue roasted a duck for supper.,NOISE-ACPA sue roasted a duck for supper.wav,CONTROLS_WAV
97,NEW - video 7 (puppy).mp3,44100.0,2.0,0.91,CASES,1.0,puppy,Fuck me!,NEW - video 7 (puppy).wav,CASES_WAV
18,cdc 4 (and then he was a boy).mp3,44100.0,2.0,1.57,CONTROLS,0.0,and then he was a boy,And then he was a boy.,cdc 4 (and then he was a boy).wav,CONTROLS_WAV
170,ACPA Tom had ham and eggs for breakfast-2.mp3,48000.0,1.0,3.65,CASES,1.0,Tom had ham and eggs for breakfast,Tom has ham and eggs for breakfast.,NOISE-ACPA Tom had ham and eggs for breakfast-...,CASES_WAV
...,...,...,...,...,...,...,...,...,...,...
188,Video 1_3 (its all of our birthdays).mp3,44100.0,2.0,1.78,CONTROLS,0.0,its all of our birthdays,It's Alma's birthday.,NOISE-Video 1_3 (its all of our birthdays).wav,CONTROLS_WAV
71,Video 4_4 (well it will help me).mp3,44100.0,2.0,2.32,CASES,1.0,well it will help me,"Wow, em vừa học đĩa",Video 4_4 (well it will help me).wav,CASES_WAV
106,ACPA buy baby a bib.mp3,48000.0,1.0,1.92,CASES,1.0,buy baby a bib,"Hi, I'm Hayley Mim.",ACPA buy baby a bib.wav,CASES_WAV
270,cdc 3 (he won_t fly away).mp3,44100.0,2.0,2.14,CONTROLS,0.0,he won't fly away,He won't fly away!,NOISE-cdc 3 (he won_t fly away).wav,CONTROLS_WAV


Making a dummy label set to make sure that my model isn't taking random guesses.

In [113]:
# dummy data
import random

# Define the length of the list you want
length = len(train_labels)  # Change this to your desired length

# Generate a list of random 1s and 0s of the specified length
dummy_list = [random.choice([0, 1]) for _ in range(length)]



In [114]:
dummy_df = train_df
dummy_df["DUMMY"] = dummy_list

In [115]:
dummy_audio_dataset = datasets.Dataset.from_dict({"audio": train_full_paths,
                                                  "labels":dummy_list}
                                                 ).cast_column("audio", Audio(sampling_rate=16_000))

dummy_dataset = SpeechClassificationDataset(dummy_audio_dataset,  feature_extractor)

batch_size = 8

dummy_loader = DataLoader(dummy_dataset, batch_size=batch_size, shuffle=True)


In [116]:
model_checkpoint = "openai/whisper-base"

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_checkpoint)
encoder = WhisperModel.from_pretrained(model_checkpoint)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [117]:
num_labels = 2

model = SpeechClassifier(num_labels, encoder).to(device)
optimizer = AdamW(model.parameters(), lr=2e-5, betas=(0.9, 0.999), eps=1e-08)
criterion = nn.CrossEntropyLoss()



In [118]:
num_epochs = 5
train(model, dummy_loader, val_loader, optimizer, criterion, device, num_epochs)

Epoch 1/5, Batch 8/26, Train Loss: 0.7364
Epoch 1/5, Batch 16/26, Train Loss: 0.6733
Epoch 1/5, Batch 24/26, Train Loss: 0.6919
Epoch 1/5, Val Loss: 0.7062, Val Accuracy: 0.4719, Val F1: 0.4652, Best Accuracy: 0.4719
Epoch 2/5, Batch 8/26, Train Loss: 0.7694
Epoch 2/5, Batch 16/26, Train Loss: 0.6981
Epoch 2/5, Batch 24/26, Train Loss: 0.5655
Epoch 2/5, Val Loss: 0.8443, Val Accuracy: 0.4045, Val F1: 0.3791, Best Accuracy: 0.4719
Epoch 3/5, Batch 8/26, Train Loss: 0.2857
Epoch 3/5, Batch 16/26, Train Loss: 0.4488
Epoch 3/5, Batch 24/26, Train Loss: 0.6563
Epoch 3/5, Val Loss: 1.1812, Val Accuracy: 0.5506, Val F1: 0.4137, Best Accuracy: 0.5506
Epoch 4/5, Batch 8/26, Train Loss: 0.1893
Epoch 4/5, Batch 16/26, Train Loss: 0.2174
Epoch 4/5, Batch 24/26, Train Loss: 0.5401
Epoch 4/5, Val Loss: 1.0358, Val Accuracy: 0.4719, Val F1: 0.4621, Best Accuracy: 0.5506
Epoch 5/5, Batch 8/26, Train Loss: 0.3944
Epoch 5/5, Batch 16/26, Train Loss: 0.0689
Epoch 5/5, Batch 24/26, Train Loss: 0.0494
Epoc

Model is not learning with the dummy data....

## Simpler Model

Let's train a simpler model to see how our model does compared to a simpler one such as SVM or Random Forrest. Generated with help from ChatGPT4

### SVM

Support Vector Machine

In [119]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler


# Define a function to extract MFCCs from an audio file
def extract_mfcc_features(file_path, n_mfcc=13):
    audio, sample_rate = librosa.load(file_path, sr=None)
    mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=n_mfcc)
    mfccs_scaled = np.mean(mfccs.T, axis=0)  # Taking the average across time
    return mfccs_scaled

# Paths to your audio files (replace these with your actual file paths)
audio_files = train_full_paths + test_full_paths  # Add more paths as needed
labels = train_labels + test_labels  # Corresponding labels for your audio files

# Extract features from each audio file
features = [extract_mfcc_features(file) for file in audio_files]

# Split the dataset into training and testing sets
X_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=42)
x_train, x_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)

# Standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler()
X_train = scaler.fit_transform(x_train)
X_test = scaler.transform(x_test)

# Initialize and train the SVM classifier
svm_model = SVC(kernel='linear')  # You can experiment with different kernels
svm_model.fit(x_train, y_train)

# Predictions
y_pred = svm_model.predict(x_val)

# Evaluate the model
print("Accuracy:", accuracy_score(y_val, y_pred))
print("Classification Report:", classification_report(y_val, y_pred))


Accuracy: 0.7796610169491526
Classification Report:               precision    recall  f1-score   support

         0.0       0.74      0.87      0.80        30
         1.0       0.83      0.69      0.75        29

    accuracy                           0.78        59
   macro avg       0.79      0.78      0.78        59
weighted avg       0.79      0.78      0.78        59



### Random Forest


In [120]:
from sklearn.ensemble import RandomForestClassifier
# Initialize and train the Random Forest classifier
rf_model = RandomForestClassifier(n_estimators=100)  # You can adjust the number of trees
rf_model.fit(x_train, y_train)

# Make predictions - VAL
y_pred = rf_model.predict(x_val)

# Evaluate the classifier
print("Accuracy:", accuracy_score(y_val, y_pred))
print("Classification Report:", classification_report(y_val, y_pred))

Accuracy: 0.847457627118644
Classification Report:               precision    recall  f1-score   support

         0.0       0.82      0.90      0.86        30
         1.0       0.88      0.79      0.84        29

    accuracy                           0.85        59
   macro avg       0.85      0.85      0.85        59
weighted avg       0.85      0.85      0.85        59

