## Imports

- os: helps us access file locations/get info from libraries on the device
- cv2: modifies image dimensions for cnn model
- math: helps us w/ computations
- numpy: allows us to operate on arrays
- librosa: audio processor
- matplotlib: data/spectrogram visualizer
- torch: pytorch-- builds neural network for machine learning, learning rate scheduler, and dataset tools

In [30]:
import os
import cv2
import math
import numpy as np
import librosa
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn 
import torch.nn.functional as F
from torch.optim import Adam
from torch.optim.lr_scheduler import LambdaLR
from torch.utils.data import Dataset, DataLoader

#### Path definitions

Potential optimization here is to make it so only the animal folder needs to be used, but these variables store strings that make it easier to access specific filepaths to folders which contain the audio files. This also helps with testing specific aspects of the code. 

In [31]:
animals_folder = '/Users/shrutikmk/Documents/Coding/dog-bird-project/animals'

dog_folder = '/Users/shrutikmk/Documents/Coding/dog-bird-project/animals/dog'
bird_folder = '/Users/shrutikmk/Documents/Coding/dog-bird-project/animals/bird'
other_folder = '/Users/shrutikmk/Documents/Coding/dog-bird-project/animals/other'

dog_single = '/Users/shrutikmk/Documents/Coding/dog-bird-project/animals/dog/dog_1.wav'
bird_single = '/Users/shrutikmk/Documents/Coding/dog-bird-project/animals/bird/Kus_1.wav'
other_single = '/Users/shrutikmk/Documents/Coding/dog-bird-project/animals/other/aslan_1.wav'

## Device Definition

Helps us assign whether the NVIDIA GPU, CUDA, can be used for more efficient processing, or whether Pytorch has to use the CPU.

- Code checks if CUDA is available, which allows computations to be offloaded to GPU
- Device object is constructed and it determines where the tensor (data array) will be allocated

In [32]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Audio Dataset Generation

Defines Pytorch Dataset class to collect all of the audio files from the previously defined folders and then processing them; this is pretty much the processing function for the audio to be fed into the CNN

__init__():
defines the folder to be used, the labels to be used, a boolean to decide whether the file should be augmented, and retrieves a list of files (audio files)

__len__():
gets the length of the file list

__getitem__():
gets an item (a tensor) at a specified index. a spectrogram is then made from the file, and then normalized. it is then converted to a three channel representation. the spectrogram and tensor with the label is then returned. 

make_spectrogram():
loads an audio file using librosa, augments the audio (if specified to do so), resizes the padding to be compatible with the cnn, and then returns the processed spectrogram audio

augment_audio():
augments the audio in a specified way. time_stretch speeds up or slows down the audio, add_noise adds excess noise, and pitch_shift changes the pitch of the audio to be higher or lower. bounds are kept to conservative numbers as to keep the audio recognizable. this helps expand 

In [33]:
class AudioDataset(Dataset):
    def __init__(self, folder_name, label, augment=False):
        self.folder_name = folder_name
        self.label = label
        self.augment = augment
        self.file_list = [os.path.join(folder_name, fname) for fname in os.listdir(folder_name) if fname.endswith('.wav') or fname.endswith('.flac')]

    def __len__(self):
        return len(self.file_list)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        spectrogram = self.make_spectrogram(self.file_list[idx], self.augment)
        spectrogram = (spectrogram - np.mean(spectrogram)) / np.std(spectrogram)
        spectrogram = np.repeat(spectrogram[np.newaxis, ...], 3, axis=0) 
        return torch.from_numpy(spectrogram).float(), torch.tensor(self.label, dtype=torch.long)

    def make_spectrogram(self, file_name, augment):
        y, sr = librosa.load(file_name, sr=None, mono=True, res_type='kaiser_best')
        if augment:
            augment_type = np.random.choice(['time_stretch', 'pitch_shift', 'add_noise'], p=[0.3, 0.3, 0.4])
            y = self.augment_audio(y, sr, augment_type)
        spectrogram = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, fmax=8000)
        spectrogram = librosa.power_to_db(spectrogram, ref=np.max)
        spec_data = spectrogram[:128, :216]
        if spec_data.shape[1] < 216:
            spec_data = np.pad(spec_data, ((0, 0), (0, 216 - spec_data.shape[1])))
        spec_data = cv2.resize(spec_data, (224, 128))
        return spec_data

    def augment_audio(self, y, sr, augment_type):
        if augment_type == 'time_stretch':
            speed_factor = np.random.uniform(0.9, 1.1)
            new_length = int(len(y) / speed_factor)
            y_augmented = np.interp(np.linspace(0, len(y), new_length), np.arange(len(y)), y)
        elif augment_type == 'pitch_shift':
            steps = np.random.randint(-2, 3)
            if steps == 0:
                steps = 1
            y_augmented = librosa.effects.pitch_shift(y, sr=sr, n_steps=steps)
        elif augment_type == 'add_noise':
            noise = np.random.randn(len(y))
            y_augmented = y + noise
        else:
            return y
        return y_augmented

## Audio Classifier

Designs the CNN to process the audio spectrogram and then classifies them to a label. 

__init__():
- convolutional layers used to map the image/spectrogram generated prior and analyze patterns within them
- layers are then flattened and then analyzed for patterns using 128 unique features (number decided on based on image dimensions)

forward():
- prints input dimensions at each stage or application of the layering process for debugging purposes

In [34]:
class AudioClassifier(nn.Module):
    def __init__(self):
        super(AudioClassifier, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=5)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5)
        self.fc1 = nn.Linear(64*53*29, 128)
        self.fc2 = nn.Linear(128, 3)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        print(f"Shape after conv1: {x.shape}")
        x = F.max_pool2d(x, 2)
        print(f"Shape after pool1: {x.shape}")
        
        x = F.relu(self.conv2(x))
        print(f"Shape after conv2: {x.shape}")
        x = F.max_pool2d(x, 2)
        print(f"Shape after pool2: {x.shape}")
        
        x = x.view(x.size(0), -1)
        print(f"Shape after flattening: {x.shape}")
        
        x = F.relu(self.fc1(x))
        print(f"Shape after fc1: {x.shape}")
        
        x = F.log_softmax(self.fc2(x), dim=1)
        print(f"Shape after fc2: {x.shape}")
        
        return x

## training and evaluating performance of model

- model set to training mode
- 

In [35]:
def train_and_evaluate(model, loss_fn, optimizer, scheduler, epochs, train_loader, test_loader):
    for epoch in range(epochs):
      
        model.train()
        for inputs, targets in train_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = loss_fn(outputs, targets)
            loss.backward()
            optimizer.step()
        scheduler.step()

        
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for inputs, targets in test_loader:
                inputs, targets = inputs.to(device), targets.to(device)
                outputs = model(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += targets.size(0)
                correct += (predicted == targets).sum().item()
        print(f'Epoch {epoch+1}/{epochs}, Accuracy: {100 * correct / total}%')


116m59.3s for 97% accuracy

In [36]:
def main():
    # Load data
    dog_dataset = AudioDataset(dog_folder, 0, augment=True)
    bird_dataset = AudioDataset(bird_folder, 1, augment=True)
    other_dataset = AudioDataset(other_folder, 2, augment=True)
    dataset = torch.utils.data.ConcatDataset([dog_dataset, bird_dataset, other_dataset])
    train_dataset, test_dataset = train_test_split(dataset, test_size=0.2, random_state=42)
    train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

    # Initialize model, loss function, and optimizer
    model = AudioClassifier().to(device)   
    loss_fn = nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr=0.001)
    scheduler = LambdaLR(optimizer, lr_lambda=lambda epoch: 0.1 if epoch > 4 else 1)

    # Train and evaluate
    train_and_evaluate(model, loss_fn, optimizer, scheduler, epochs=20, train_loader=train_loader, test_loader=test_loader)

In [37]:
if __name__ == "__main__":
    main()

Note: Illegal Audio-MPEG-Header 0x4c495354 at offset 40390.
Note: Trying to resync...
Note: Hit end of (available) data during resync.
Note: Illegal Audio-MPEG-Header 0x66616374 at offset 9082.
Note: Trying to resync...
Note: Hit end of (available) data during resync.
Note: Illegal Audio-MPEG-Header 0x66616374 at offset 2266.
Note: Trying to resync...
Note: Hit end of (available) data during resync.
Note: Illegal Audio-MPEG-Header 0x66616374 at offset 12730.
Note: Trying to resync...
Note: Hit end of (available) data during resync.


Shape after conv1: torch.Size([64, 32, 124, 220])
Shape after pool1: torch.Size([64, 32, 62, 110])
Shape after conv2: torch.Size([64, 64, 58, 106])
Shape after pool2: torch.Size([64, 64, 29, 53])
Shape after flattening: torch.Size([64, 98368])
Shape after fc1: torch.Size([64, 128])
Shape after fc2: torch.Size([64, 3])
Shape after conv1: torch.Size([64, 32, 124, 220])
Shape after pool1: torch.Size([64, 32, 62, 110])
Shape after conv2: torch.Size([64, 64, 58, 106])
Shape after pool2: torch.Size([64, 64, 29, 53])
Shape after flattening: torch.Size([64, 98368])
Shape after fc1: torch.Size([64, 128])
Shape after fc2: torch.Size([64, 3])
Shape after conv1: torch.Size([64, 32, 124, 220])
Shape after pool1: torch.Size([64, 32, 62, 110])
Shape after conv2: torch.Size([64, 64, 58, 106])
Shape after pool2: torch.Size([64, 64, 29, 53])
Shape after flattening: torch.Size([64, 98368])
Shape after fc1: torch.Size([64, 128])
Shape after fc2: torch.Size([64, 3])
Shape after conv1: torch.Size([64, 32, 1