# MLPs for Frame-Level Speech Recognition

### Author: Lakshay Sethi
#### Custom Architecture (Designed by Lakshay Sethi through Ablations)
##### Ablations Link: https://wandb.ai/highcutoff/Lakshay?workspace=user-sethilakshay13

Using MFCC data consisting of 27 features at each time step/frame to accurately identify the phoneme.

# Libraries

In [None]:
!nvidia-smi

Thu Feb 16 09:53:04 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0    26W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!pip install torchsummaryX wandb --quiet
!pip install speechpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import torch
import numpy as np
from torchsummaryX import summary
import sklearn
import speechpy
import gc
import zipfile
import pandas as pd
from tqdm.auto import tqdm
import os
import datetime
import wandb
import gc
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Device: ", device)

Device:  cuda


In [None]:
### If you are using colab, you can import google drive to save model checkpoints in a folder
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
### PHONEME LIST
PHONEMES = [
            '[SIL]',   'AA',    'AE',    'AH',    'AO',    'AW',    'AY',  
            'B',     'CH',    'D',     'DH',    'EH',    'ER',    'EY',
            'F',     'G',     'HH',    'IH',    'IY',    'JH',    'K',
            'L',     'M',     'N',     'NG',    'OW',    'OY',    'P',
            'R',     'S',     'SH',    'T',     'TH',    'UH',    'UW',
            'V',     'W',     'Y',     'Z',     'ZH']

PHONEMES_DICT = {
            '[SIL]':0,  'AA':1,     'AE':2,     'AH':3,     'AO':4,     'AW':5,     'AY':6,  
            'B':7,      'CH':8,     'D':9,      'DH':10,    'EH':11,    'ER':12,    'EY':13,
            'F':14,     'G':15,     'HH':16,    'IH':17,    'IY':18,    'JH':19,    'K':20,
            'L':21,     'M':22,     'N':23,     'NG':24,    'OW':25,    'OY':26,    'P':27,
            'R':28,     'S':29,     'SH':30,    'T':31,     'TH':32,    'UH':33,    'UW':34,
            'V':35,     'W':36,     'Y':37,     'Z':38,     'ZH':39}

# Kaggle

This section contains code that helps you install kaggle's API, creating kaggle.json with you username and API key details. Make sure to input those in the given code to ensure you can download data from the competition successfully.

In [None]:
!pip install --upgrade --force-reinstall --no-deps kaggle==1.5.8
!mkdir /root/.kaggle

with open("/root/.kaggle/kaggle.json", "w+") as f:
    f.write('{"username":"lakshaysethi","key":"ad796e545fa89ae311c2e0f59e4ea1a0"}') 
    # Put your kaggle username & key here

!chmod 600 /root/.kaggle/kaggle.json

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting kaggle==1.5.8
  Using cached kaggle-1.5.8-py3-none-any.whl
Installing collected packages: kaggle
  Attempting uninstall: kaggle
    Found existing installation: kaggle 1.5.8
    Uninstalling kaggle-1.5.8:
      Successfully uninstalled kaggle-1.5.8
Successfully installed kaggle-1.5.8
mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [None]:
# commands to download data from kaggle

# !kaggle competitions download -c 11-785-s23-hw1p2
# !mkdir '/content/data'

# !unzip -qo '11-785-s23-hw1p2.zip' -d '/content/data'
# !rm -rf '11-785-s23-hw1p2.zip'

# Dataset

This section covers the dataset/dataloader class for speech data. You will have to spend time writing code to create this class successfully. We have given you a lot of comments guiding you on what code to write at each stage, from top to bottom of the class. Please try and take your time figuring this out, as it will immensely help in creating dataset/dataloader classes for future homeworks.

Before running the following cells, please take some time to analyse the structure of data. Try loading a single MFCC and its transcipt, print out the shapes and print out the values. Do the transcripts look like phonemes?

In [None]:
# Dataset class to load train and validation data

class AudioDataset(torch.utils.data.Dataset):

    def __init__(self, root, phonemes = PHONEMES, context=0, partition= ['train-clean-100'], ablations = False, cutOff = 1.0): # Feel free to add more arguments

        self.context    = context
        self.phonemes   = phonemes

        self.mfccs, self.transcripts = [], []
        
        for idx in range(len(partition)):
            
            self.mfcc_dir       = root + partition[idx] + '/mfcc/'   # Accessing the mfcc directory
            self.transcript_dir = root + partition[idx] + '/transcript/'  # Accessing the transcript directory

            mfcc_names          = os.listdir(self.mfcc_dir) #List files in self.mfcc_dir using os.listdir 
            mfcc_names.sort()

            transcript_names    = os.listdir(self.transcript_dir)   #List files in self.transcript_dir using os.listdir 
            transcript_names.sort()

            assert len(mfcc_names) == len(transcript_names) # Making sure that we have the same no. of mfcc and transcripts

            for i in range(len(mfcc_names)):    #Iterating through mfccs and transcripts
            
                mfcc        = np.load(self.mfcc_dir + mfcc_names[i], allow_pickle=True) #   Load a single mfcc
                mfcc        = (mfcc - np.mean(mfcc, axis = 0))/np.std(mfcc, axis = 0)   #   Do Cepstral Normalization of mfcc

                transcript  = np.load(self.transcript_dir + transcript_names[i], allow_pickle=True) #   Load the transcript
                transcript = transcript[1:-1]   # Remove [SOS] and [EOS] from the transcript. SOS will always be in the starting and EOS at end
                
            #   Append each mfcc to self.mfcc, transcript to self.transcript
                self.mfccs.append(mfcc)
                self.transcripts.append(transcript)
                
            ##  Part of code to restrict the ablations train dataset size to 20%
                if((ablations) and (i >= cutOff*len(mfcc_names))):
                    break

        # NOTE:
        # Each mfcc is of shape T1 x 27, T2 x 27, ...
        # Each transcript is of shape (T1+2) x 27, (T2+2) x 27 before removing [SOS] and [EOS]

        # Concatenating self.mfcc final shape is T x 27 (Where T = T1 + T2 + ...) 
        self.mfccs          = np.concatenate(self.mfccs, axis = 0)

        # Concatenating self.transcripts the final shape is (T,) meaning, each time step has one phoneme output
        self.transcripts    = np.concatenate(self.transcripts, axis = 0)

        # Length of the dataset is now the length of concatenated mfccs/transcripts
        self.length = len(self.mfccs)

        # Context Padding
        self.mfccs = np.pad(self.mfccs, ((self.context, self.context), (0, 0)), mode = 'constant', constant_values = 0)

        # The available phonemes in the transcript are of string data type
        # But the neural network cannot predict strings as such. 
        # Hence, we map these phonemes to integers

        # TODO: Map the phonemes to their corresponding list indexes in self.phonemes
        self.transcripts = np.array([PHONEMES_DICT[trans] for trans in self.transcripts])
        
    def __len__(self):
        return self.length

    def __getitem__(self, ind):

        # TODO: Based on context and offset, return a frame at given index with context frames to the left, and right.
        frames = self.mfccs[ind: ind+2*self.context+1]
        # After slicing, you get an array of shape 2*context+1 x 27. But our MLP needs 1d data and not 2d.
        frames = frames.flatten()

        # Converting to Tensors
        frames      = torch.FloatTensor(frames)
        phonemes    = torch.tensor(self.transcripts[ind])       

        return frames, phonemes

In [None]:
# Dataset class to only load train Data

class AudioDatasetEfficientTrain(torch.utils.data.Dataset):

    def __init__(self, root, phonemes = PHONEMES, phonemes_dict = PHONEMES_DICT, context = 0, partition = ['train-clean-100']):

        self.context    = context
        self.phonemes   = phonemes

        self.masked     = []
    
        ############ Efficient Data Loader ###############
        ### Need to know the shape first
        T = 0
        for idx in range(len(partition)):
        
            self.mfcc_dir       = root + partition[idx] + '/mfcc/'   # Accessing the mfcc directory
            self.transcript_dir = root + partition[idx] + '/transcript/'  # Accessing the transcript directory

            mfcc_names          = os.listdir(self.mfcc_dir) #List files in self.mfcc_dir using os.listdir 
            mfcc_names.sort()

            transcript_names    = os.listdir(self.transcript_dir)   #List files in self.transcript_dir using os.listdir 
            transcript_names.sort()

            assert len(mfcc_names) == len(transcript_names) # Making sure that we have the same no. of mfcc and transcripts

            for i in range(len(mfcc_names)):
                mfcc = np.load(self.mfcc_dir + mfcc_names[i], allow_pickle=True)

                # Augmenting the model by 20% more data which will be masked (Time and Frequency both)
                if np.random.rand() <= 0.2:
                    self.masked.append(self.mfcc_dir + mfcc_names[i])   # Copying the full Path name here
                    T += mfcc.shape[0]  # Updating T

                T += mfcc.shape[0]

        partition.append('masked')

        self.mfccs = np.zeros(shape = (T + 2*context, mfcc.shape[1]), dtype=np.float32)
        self.transcripts = np.zeros(shape = (T, ), dtype=np.uint8)
        cx, cy = context, 0


        ############## Loading the files ###############
        for idx in range(len(partition)):

            self.mfcc_dir       = root + partition[idx] + '/mfcc/'   # Accessing the mfcc directory
            self.transcript_dir = root + partition[idx] + '/transcript/'  # Accessing the transcript directory

            if partition[idx] == 'masked':  # Loading the data names picked for masking
                mfcc_names          = self.masked #List files in self.masked using os.listdir 
                mfcc_names.sort()

                transcript_names    = self.masked #List files in self.masked using os.listdir 
                transcript_names.sort()

            else:
                mfcc_names          = os.listdir(self.mfcc_dir) #List files in self.mfcc_dir using os.listdir 
                mfcc_names.sort()

                transcript_names    = os.listdir(self.transcript_dir)   #List files in self.transcript_dir using os.listdir 
                transcript_names.sort()

            assert len(mfcc_names) == len(transcript_names) # Making sure that we have the same no. of mfcc and transcripts


            for i in range(len(mfcc_names)):    #Iterating through mfccs and transcripts

                if partition[idx] == 'masked':  # Doing Frequency Masking for the desired data

                    mfcc        = np.load(mfcc_names[i], allow_pickle=True) #   Load a single mfcc
                    mfcc        = (mfcc - np.mean(mfcc, axis = 0))/np.std(mfcc, axis = 0)   #   Do Cepstral Normalization of mfcc

                    transcript  = np.load(transcript_names[i].replace('mfcc', 'transcript'), allow_pickle=True)   #   Load the corresponding transcript
                    transcript = transcript[1:-1]   # Remove [SOS] and [EOS] from the transcript. Note that SOS will always be in the starting and EOS at end, as the name suggests.

                    if np.random.rand() <= 0.5: # Frequency Masking
                        num_freq_mask = np.random.randint(low = int(mfcc.shape[1] * 0.2), high = int(mfcc.shape[1] * 0.3)+1)   # Selcting Frequency Masking between 20% to 30%
                        start_band = np.random.randint(low = 0, high = mfcc.shape[1] - num_freq_mask)
                        mfcc[:, start_band : start_band+num_freq_mask] = 0      # Setting it to 0

                    else:   # Time Masking
                        mask_duration = np.random.randint(low = int(mfcc.shape[0] * 0.2), high = int(mfcc.shape[0] * 0.3)+1) # Selcting Masking between 20% to 30%
                        mask_start = np.random.randint(low = 0, high = mfcc.shape[0] - mask_duration)
                        mfcc[mask_start : mask_start+mask_duration, :] = 0      # Setting it to 0


                else:   
                    mfcc        = np.load(self.mfcc_dir + mfcc_names[i], allow_pickle=True) #   Load a single mfcc
                    mfcc        = (mfcc - np.mean(mfcc, axis = 0))/np.std(mfcc, axis = 0)   #   Do Cepstral Normalization of mfcc

                    transcript  = np.load(self.transcript_dir + transcript_names[i], allow_pickle=True)   #   Load the corresponding transcript
                    transcript = transcript[1:-1]   # Remove [SOS] and [EOS] from the transcript. Note that SOS will always be in the starting and EOS at end, as the name suggests.
                
                # Loading the data more efficiently
                T_i = mfcc.shape[0] 
                self.mfccs[cx:cx+T_i] = mfcc
                self.transcripts[cy:cy+T_i] = np.array([phonemes_dict[trans] for trans in transcript])

                cx += T_i
                cy += T_i

            ### No need to concatenate since self.mfccs and self.transcripts are in desired shape


        # Length of the dataset is now the length of concatenated mfccs/transcripts
        self.length = T

        # No Need for Padding which is already taken care of

        ### No need to map since self.transcripts already contains indices
        #self.transcripts = np.array([self.phonemes.index(i) for i in self.transcripts])

    def __len__(self):
        return self.length

    def __getitem__(self, ind):

        # TODO: Based on context and offset, return a frame at given index with context frames to the left, and right.
        frames = self.mfccs[ind: ind+2*self.context+1]
        # After slicing, you get an array of shape 2*context+1 x 27. But our MLP needs 1d data and not 2d.
        frames = frames.flatten()

        # Converting to Tensors
        frames      = torch.FloatTensor(frames)
        phonemes    = torch.tensor(self.transcripts[ind])       

        return frames, phonemes

In [None]:
# Dataset class to only load validation Data

class AudioDatasetEfficientVal(torch.utils.data.Dataset):

    def __init__(self, root, phonemes = PHONEMES, phonemes_dict = PHONEMES_DICT, context=0, partition= ['dev-clean']): # Feel free to add more arguments

        self.context    = context
        self.phonemes   = phonemes
    
        ############ Efficient Data Loader ###############
        ### Need to know the shape first
        T = 0
        for idx in range(len(partition)):
        
            self.mfcc_dir       = root + partition[idx] + '/mfcc/'   # Accessing the mfcc directory
            self.transcript_dir = root + partition[idx] + '/transcript/'  # Accessing the transcript directory

            mfcc_names          = os.listdir(self.mfcc_dir) #List files in self.mfcc_dir using os.listdir 
            mfcc_names.sort()

            transcript_names    = os.listdir(self.transcript_dir)   #List files in self.transcript_dir using os.listdir 
            transcript_names.sort()

            assert len(mfcc_names) == len(transcript_names) # Making sure that we have the same no. of mfcc and transcripts

            for i in range(len(mfcc_names)):
                mfcc = np.load(self.mfcc_dir + mfcc_names[i], allow_pickle=True)
                T += mfcc.shape[0]


        self.mfccs = np.zeros(shape = (T + 2*context, mfcc.shape[1]), dtype=np.float32)
        self.transcripts = np.zeros(shape = (T, ), dtype=np.uint8)
        cx, cy = context, 0


        ############## Loading the files ###############
        for idx in range(len(partition)):
        
            self.mfcc_dir       = root + partition[idx] + '/mfcc/'   # Accessing the mfcc directory
            self.transcript_dir = root + partition[idx] + '/transcript/'  # Accessing the transcript directory

            mfcc_names          = os.listdir(self.mfcc_dir) #List files in self.mfcc_dir using os.listdir 
            mfcc_names.sort()

            transcript_names    = os.listdir(self.transcript_dir)   #List files in self.transcript_dir using os.listdir 
            transcript_names.sort()

            assert len(mfcc_names) == len(transcript_names) # Making sure that we have the same no. of mfcc and transcripts

            for i in range(len(mfcc_names)):    #Iterating through mfccs and transcripts
        
                mfcc        = np.load(self.mfcc_dir + mfcc_names[i], allow_pickle=True) #   Load a single mfcc
                mfcc        = (mfcc - np.mean(mfcc, axis = 0))/np.std(mfcc, axis = 0)   #   Do Cepstral Normalization of mfcc

                transcript  = np.load(self.transcript_dir + transcript_names[i], allow_pickle=True)   #   Load the corresponding transcript
                transcript = transcript[1:-1]   # Remove [SOS] and [EOS] from the transcript. Note that SOS will always be in the starting and EOS at end, as the name suggests.
            
                # Loading the data more efficiently
                T_i = mfcc.shape[0] 
                self.mfccs[cx:cx+T_i] = mfcc
                self.transcripts[cy:cy+T_i] = np.array([phonemes_dict[trans] for trans in transcript])

                cx += T_i
                cy += T_i

                    
            ### No need to concatenate since self.mfccs and self.transcripts are in desired shape


        # Length of the dataset is now the length of concatenated mfccs/transcripts
        self.length = T

        # No Need for Padding which is already taken care of

        ### No need to map since self.transcripts already contains indices
        #self.transcripts = np.array([self.phonemes.index(i) for i in self.transcripts])

    def __len__(self):
        return self.length

    def __getitem__(self, ind):

        # TODO: Based on context and offset, return a frame at given index with context frames to the left, and right.
        frames = self.mfccs[ind: ind+2*self.context+1]
        # After slicing, you get an array of shape 2*context+1 x 27. But our MLP needs 1d data and not 2d.
        frames = frames.flatten()

        # Converting to Tensors
        frames      = torch.FloatTensor(frames)
        phonemes    = torch.tensor(self.transcripts[ind])       

        return frames, phonemes

In [None]:
class AudioTestDataset(torch.utils.data.Dataset):
    
    def __init__(self, root, context=0, partition= 'test-clean'):

        self.context    = context
        
        # TODO: MFCC directory - use partition to acces train/dev directories from kaggle data using root
        self.mfcc_dir       = root + partition + '/mfcc/'

        # TODO: List files in self.mfcc_dir using os.listdir in sorted order
        mfcc_names          = os.listdir(self.mfcc_dir)
        mfcc_names.sort()

        self.mfccs          = []

        # TODO: Iterate through mfccs and transcripts
        for i in range(len(mfcc_names)):
    
            mfcc        = np.load(self.mfcc_dir + mfcc_names[i], allow_pickle=True) # Load a single mfcc
            mfcc        = (mfcc - np.mean(mfcc, axis = 0))/np.std(mfcc, axis = 0)   #   Do Cepstral Normalization of mfcc

            self.mfccs.append(mfcc)        

        # NOTE:
        # Each mfcc is of shape T1 x 27, T2 x 27, ...
        # Each transcript is of shape (T1+2) x 27, (T2+2) x 27 before removing [SOS] and [EOS]

        # TODO: Concatenate all mfccs in self.mfccs such that 
        # the final shape is T x 27 (Where T = T1 + T2 + ...) 
        self.mfccs      = np.concatenate(self.mfccs, axis = 0)

        # Length of the dataset is now the length of concatenated mfccs/transcripts
        self.length = len(self.mfccs)

        # Context Padding
        self.mfccs = np.pad(self.mfccs, ((self.context, self.context), (0, 0)), mode = 'constant', constant_values = 0)

    def __len__(self):
        return self.length

    def __getitem__(self, ind):

        # TODO: Based on context and offset, return a frame at given index with context frames to the left, and right.
        frames = self.mfccs[ind: ind+2*self.context+1]
        # After slicing, you get an array of shape 2*context+1 x 27. But our MLP needs 1d data and not 2d.
        frames = frames.flatten()
        # Converting to Tensors
        frames      = torch.FloatTensor(frames)   

        return frames
    # TODO: Create a test dataset class similar to the previous class but you dont have transcripts for this
    # Imp: Read the mfccs in sorted order, do NOT shuffle the data here or in your dataloader.

# Parameters

Storing your parameters and hyperparameters in a single configuration dictionary makes it easier to keep track of them during each experiment. It can also be used with weights and biases to log your parameters for each experiment and keep track of them across multiple experiments. 

In [None]:
config = {
    'epochs'        : 38,
    'batch_size'    : 2048,
    'context'       : 20,
    'learning_rate' : 1e-3,
    'architecture'  : 'HighCutoff',
    'best_accuracy' : 0,
    'dropout'       : 0.20,
    'convergence'   : 0
    # Add more as you need them - e.g dropout values, weight decay, scheduler parameters
}

In [None]:
# train_data_efficient = AudioDatasetEfficientVal(root = '/content/data/11-785-s23-hw1p2/', phonemes = PHONEMES, phonemes_dict = PHONEMES_DICT, 
#                         context = config['context'], partition = ['dev-clean']) 

# train_data = AudioDataset(root = '/content/data/11-785-s23-hw1p2/', phonemes = PHONEMES,
#                         context = config['context'], partition = ['dev-clean']) 

# print(train_data_efficient.__len__())
# print(train_data.__len__())

## Create Datasets and Dataloaders

In [None]:
# Function to create the datasets
train_data = AudioDatasetEfficientTrain(root = '/content/data/11-785-s23-hw1p2/', phonemes = PHONEMES, phonemes_dict = PHONEMES_DICT,
                        context = config['context'], partition = ['train-clean-360', 'train-clean-100']) 

# Create a dataset object using the AudioDataset class for the validation data 
val_data = AudioDatasetEfficientVal(root = '/content/data/11-785-s23-hw1p2/', phonemes = PHONEMES, phonemes_dict = PHONEMES_DICT,
                        context = config['context'], partition = ['dev-clean'])  

# Create a dataset object using the AudioTestDataset class for the test data 
test_data = AudioTestDataset(root = '/content/data/11-785-s23-hw1p2/', context = config['context'], 
                             partition = 'test-clean')

In [None]:
# Define dataloaders for train, val and test datasets
# Dataloaders will yield a batch of frames and phonemes of given batch_size at every iteration
# We shuffle train dataloader but not val & test dataloader. Why?

train_loader = torch.utils.data.DataLoader(
    dataset     = train_data, 
    num_workers = 6,
    batch_size  = config['batch_size'], 
    pin_memory  = True,
    shuffle     = True
)

val_loader = torch.utils.data.DataLoader(
    dataset     = val_data, 
    num_workers = 2,
    batch_size  = config['batch_size'],
    pin_memory  = True,
    shuffle     = False
)

test_loader = torch.utils.data.DataLoader(
    dataset     = test_data, 
    num_workers = 2, 
    batch_size  = config['batch_size'], 
    pin_memory  = True, 
    shuffle     = False
)


print("Batch size     : ", config['batch_size'])
print("Context        : ", config['context'])
print("Input size     : ", (2*config['context']+1)*27)
print("Output symbols : ", len(PHONEMES))

print("Train dataset samples = {}, batches = {}".format(train_data.__len__(), len(train_loader)))
print("Validation dataset samples = {}, batches = {}".format(val_data.__len__(), len(val_loader)))
print("Test dataset samples = {}, batches = {}".format(test_data.__len__(), len(test_loader)))

# Testing code to check if your data loaders are working
for i, data in enumerate(train_loader):
    frames, phoneme = data
    print(frames.shape, phoneme.shape)
    break

Batch size     :  2048
Context        :  20
Input size     :  1107
Output symbols :  40
Train dataset samples = 199550088, batches = 97437
Validation dataset samples = 1928204, batches = 942
Test dataset samples = 1934138, batches = 945
torch.Size([2048, 1107]) torch.Size([2048])


# Network Architecture


This section defines your network architecture for the homework. We have given you a sample architecture that can easily clear the very low cutoff for the early submission deadline.

In [None]:
def init_weights(m):
    if isinstance(m, torch.nn.Linear):
        torch.nn.init.kaiming_uniform_(m.weight, nonlinearity='relu')
        m.bias.data.fill_(0.01)

In [None]:
# This architecture will make you cross the very low cutoff
# However, you need to run a lot of experiments to cross the medium or high cutoff
class Network(torch.nn.Module):

    def __init__(self, input_size, output_size, dropout):

        super(Network, self).__init__()

        self.model = torch.nn.Sequential(
            torch.nn.Linear(input_size, input_size*2),
            torch.nn.BatchNorm1d(input_size*2),
            torch.nn.GELU(),
            torch.nn.Dropout(dropout),
             
            torch.nn.Linear(input_size*2, 2048), 
            torch.nn.BatchNorm1d(2048),
            torch.nn.GELU(),
            torch.nn.Dropout(dropout),

            torch.nn.Linear(2048, 2048),
            torch.nn.BatchNorm1d(2048),
            torch.nn.GELU(),
            torch.nn.Dropout(dropout),

            torch.nn.Linear(2048, 2048),
            torch.nn.BatchNorm1d(2048),
            torch.nn.GELU(),
            torch.nn.Dropout(dropout),

            torch.nn.Linear(2048, 1024),
            torch.nn.BatchNorm1d(1024),
            torch.nn.GELU(),
            torch.nn.Dropout(dropout),
            
            torch.nn.Linear(1024, 1024),
            torch.nn.BatchNorm1d(1024),
            torch.nn.GELU(),
            torch.nn.Dropout(dropout),

            torch.nn.Linear(1024, 512),
            torch.nn.BatchNorm1d(512),
            torch.nn.GELU(),
            torch.nn.Dropout(dropout/2),

            torch.nn.Linear(512, 512),
            torch.nn.BatchNorm1d(512),
            torch.nn.GELU(),
        
            torch.nn.Linear(512, output_size)
        )      

    def forward(self, x):
        out = self.model(x)

        return out

In [None]:
INPUT_SIZE  = (2*config['context'] + 1) * 27 # Why is this the case?
model       = Network(INPUT_SIZE, len(train_data.phonemes), 0.2).to(device)
summary(model, frames.to(device))
# Check number of parameters of your network
# Remember, you are limited to 20 million parameters for HW1 (including ensembles)

                         Kernel Shape  Output Shape     Params  Mult-Adds
Layer                                                                    
0_model.Linear_0         [1107, 2214]  [2048, 2214]  2.453112M  2.450898M
1_model.BatchNorm1d_1          [2214]  [2048, 2214]     4.428k     2.214k
2_model.GELU_2                      -  [2048, 2214]          -          -
3_model.Dropout_3                   -  [2048, 2214]          -          -
4_model.Linear_4         [2214, 2048]  [2048, 2048]   4.53632M  4.534272M
5_model.BatchNorm1d_5          [2048]  [2048, 2048]     4.096k     2.048k
6_model.GELU_6                      -  [2048, 2048]          -          -
7_model.Dropout_7                   -  [2048, 2048]          -          -
8_model.Linear_8         [2048, 2048]  [2048, 2048]  4.196352M  4.194304M
9_model.BatchNorm1d_9          [2048]  [2048, 2048]     4.096k     2.048k
10_model.GELU_10                    -  [2048, 2048]          -          -
11_model.Dropout_11                 - 

  df_sum = df.sum()


Unnamed: 0_level_0,Kernel Shape,Output Shape,Params,Mult-Adds
Layer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0_model.Linear_0,"[1107, 2214]","[2048, 2214]",2453112.0,2450898.0
1_model.BatchNorm1d_1,[2214],"[2048, 2214]",4428.0,2214.0
2_model.GELU_2,-,"[2048, 2214]",,
3_model.Dropout_3,-,"[2048, 2214]",,
4_model.Linear_4,"[2214, 2048]","[2048, 2048]",4536320.0,4534272.0
5_model.BatchNorm1d_5,[2048],"[2048, 2048]",4096.0,2048.0
6_model.GELU_6,-,"[2048, 2048]",,
7_model.Dropout_7,-,"[2048, 2048]",,
8_model.Linear_8,"[2048, 2048]","[2048, 2048]",4196352.0,4194304.0
9_model.BatchNorm1d_9,[2048],"[2048, 2048]",4096.0,2048.0


In [None]:
# Initializing weights
model.apply(init_weights)

Network(
  (model): Sequential(
    (0): Linear(in_features=1107, out_features=2214, bias=True)
    (1): BatchNorm1d(2214, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): GELU(approximate='none')
    (3): Dropout(p=0.2, inplace=False)
    (4): Linear(in_features=2214, out_features=2048, bias=True)
    (5): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): GELU(approximate='none')
    (7): Dropout(p=0.2, inplace=False)
    (8): Linear(in_features=2048, out_features=2048, bias=True)
    (9): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (10): GELU(approximate='none')
    (11): Dropout(p=0.2, inplace=False)
    (12): Linear(in_features=2048, out_features=2048, bias=True)
    (13): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (14): GELU(approximate='none')
    (15): Dropout(p=0.2, inplace=False)
    (16): Linear(in_features=2048, out_features=10

# Define Model, Loss Function and Optimizer

Here we define the model, loss function, optimizer and optionally a learning rate scheduler. 

In [None]:
############################ Loss Criterion ############################
criterion = torch.nn.CrossEntropyLoss() # We use CE because the task is multi-class classification 
############################ Loss Criterion ############################


############################### Optimizer ##############################
optimizer = torch.optim.AdamW(model.parameters(), lr = config['learning_rate'], weight_decay = config['learning_rate']*10)
############################### Optimizer ##############################


############################### Scheduler ##############################
#scheduler = torch.optim.lr_scheduler.MultiplicativeLR(optimizer, lr_lambda = lambda epoch: 0.98**epoch, verbose=True)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor = 0.5, patience = 2, threshold=0.25, verbose = True, min_lr = 1e-8)
############################### Scheduler ##############################

# Is your training time very high? 
# Look into mixed precision training if your GPU (Tesla T4, V100, etc) can make use of it 
# Refer - https://pytorch.org/docs/stable/notes/amp_examples.html

# Training and Validation Functions

This section covers the training, and validation functions for each epoch of running your experiment with a given model architecture. The code has been provided to you, but we recommend going through the comments to understand the workflow to enable you to write these loops for future HWs.

In [None]:
torch.cuda.empty_cache()
gc.collect()

### Defining scaler
scaler = torch.cuda.amp.GradScaler()

In [None]:
def train(model, dataloader, optimizer, criterion):

    model.train()
    tloss, tacc = 0, 0 # Monitoring loss and accuracy
    batch_bar   = tqdm(total=len(dataloader), dynamic_ncols=True, leave=False, position=0, desc='Train')
    
    for i, (frames, phonemes) in enumerate(dataloader):
        
        ### Initialize Gradients
        optimizer.zero_grad()

        ### Move Data to Device (Ideally GPU)
        frames      = frames.to(device)
        phonemes    = phonemes.to(device)

        with torch.autocast(device_type='cuda', dtype = torch.float16):
          ### Forward Propagation
          logits  = model(frames)
          ### Loss Calculation
          loss    = criterion(logits, phonemes)

        ### Backward Propagation
        #loss.backward() 
        scaler.scale(loss).backward()
        
        ### Gradient Descent
        scaler.step(optimizer)       

        tloss   += loss.item()
        tacc    += torch.sum(torch.argmax(logits, dim= 1) == phonemes).item()/logits.shape[0]

        batch_bar.set_postfix(loss="{:.04f}".format(float(tloss / (i + 1))), 
                              acc="{:.04f}%".format(float(tacc*100 / (i + 1))))
        batch_bar.update()
        scaler.update()

        ### Release memory
        del frames, phonemes, logits
        torch.cuda.empty_cache()
  
    batch_bar.close()
    tloss   /= len(dataloader)
    tacc    /= len(dataloader)

    return tloss, tacc

In [None]:
def eval(model, dataloader):

    model.eval() # set model in evaluation mode
    vloss, vacc = 0, 0 # Monitoring loss and accuracy
    batch_bar   = tqdm(total=len(dataloader), dynamic_ncols=True, position=0, leave=False, desc='Val')

    for i, (frames, phonemes) in enumerate(dataloader):

        ### Move data to device (ideally GPU)
        frames      = frames.to(device)
        phonemes    = phonemes.to(device)

        # makes sure that there are no gradients computed as we are not training the model now
        with torch.inference_mode(): 
            ### Forward Propagation
            logits  = model(frames)
            ### Loss Calculation
            loss    = criterion(logits, phonemes)

        vloss   += loss.item()
        vacc    += torch.sum(torch.argmax(logits, dim= 1) == phonemes).item()/logits.shape[0]
        
        # Do you think we need loss.backward() and optimizer.step() here?

        batch_bar.set_postfix(loss="{:.04f}".format(float(vloss / (i + 1))), 
                              acc="{:.04f}%".format(float(vacc*100 / (i + 1))))
        batch_bar.update()
    
        ### Release memory
        del frames, phonemes, logits
        torch.cuda.empty_cache()

    batch_bar.close()
    vloss   /= len(dataloader)
    vacc    /= len(dataloader)

    return vloss, vacc

# Weights and Biases Setup

This section is to enable logging metrics and files with Weights and Biases. Please refer to wandb documentationa and recitation 0 that covers the use of weights and biases for logging, hyperparameter tuning and monitoring your runs for your homeworks. Using this tool makes it very easy to show results when submitting your code and models for homeworks, and also extremely useful for study groups to organize and run ablations under a single team in wandb. 

We have written code for you to make use of it out of the box, so that you start using wandb for all your HWs from the beginning.

In [None]:
wandb.login(key="d8c7e26b1505e86567dd118f06a50f03dbfb28d6") #API Key is in your wandb account, under settings (wandb.ai/settings)

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33msethilakshay13[0m ([33mhighcutoff[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
# Create your wandb run
run = wandb.init(
    name    = "Wunderwaffe_Model", ### Wandb creates random run names if you skip this field, we recommend you give useful names
    reinit  = True, ### Allows reinitalizing runs when you re-run this cell
    #id     = "y28t31uz", ### Insert specific run id here if you want to resume a previous run
    #resume = "must", ### You need this to resume previous runs, but comment out reinit = True when using this
    project = "Lakshay", ### Project should be created in your wandb account 
    config  = config ### Wandb Config for your run
)

In [None]:
# ### Save your model architecture as a string with str(model) 
model_arch  = str(model)

### Save it in a txt file 
arch_file   = open("Wunderwaffe_Model", "w")
file_write  = arch_file.write(model_arch)
arch_file.close()

### log it in your wandb run with wandb.save()
wandb.save('Wunderwaffe_Model')

['/content/wandb/run-20230216_100433-8dfxovgb/files/Wunderwaffe_Model']

# Experiment

Now, it is time to finally run your ablations! Have fun!

In [None]:
# Iterate over number of epochs to train and evaluate your model
torch.cuda.empty_cache()
gc.collect()
wandb.watch(model, log="all")

#val_acc_prev = 0

# Training till convergence
for epoch in range(config['epochs']):

    print("\nEpoch {}/{}".format(epoch+1, config['epochs']))

    curr_lr                 = float(optimizer.param_groups[0]['lr'])
    train_loss, train_acc   = train(model, train_loader, optimizer, criterion)
    val_loss, val_acc       = eval(model, val_loader)
    scheduler.step(val_acc)

    print("\tTrain Acc {:.04f}%\tTrain Loss {:.04f}\t Learning Rate {:.07f}".format(train_acc*100, train_loss, curr_lr))
    print("\tVal Acc {:.04f}%\tVal Loss {:.04f}".format(val_acc*100, val_loss))

    ### Log metrics at each epoch in your run 
    # Optionally, you can log at each batch inside train/eval functions 
    # (explore wandb documentation/wandb recitation)
    wandb.log({'train_acc': train_acc*100, 'train_loss': train_loss, 
               'val_acc': val_acc*100, 'val_loss': val_loss, 'lr': curr_lr})
    
    ### Highly Recommended: Save checkpoint in drive and/or wandb if accuracy is better than your current best
    torch.save({'epoch': (epoch+12), 
                'model_state_dict': model.state_dict(), 
                'optimizer_state_dict': optimizer.state_dict(),
                'train_loss': train_loss,
                'train_acc': train_acc,
                'val_loss': val_loss,
                'val_acc': val_acc},
               ('/content/Checkpoints/LakshayModel_CheckPoint_Epoch_'))
    
    ### Convergence check: break if model has converged
    # if (abs(val_acc - val_acc_prev)) < config['convergence']:
    #     print("\n")
    #     print("################################################")
    #     print("Breaking due to convergence at epoch {}".format(epoch))
    #     print("################################################")
    #     print("\n")
    #     break

    # val_acc_prev = val_acc


Epoch 1/38


Train:   0%|          | 0/97437 [00:00<?, ?it/s]

# Testing and submission to Kaggle

Before we get to the following code, make sure to see the format of submission given in *sample_submission.csv*. Once you have done so, it is time to fill the following function to complete your inference on test data. Refer the eval function from previous cells to get an idea of how to go about completing this function.

In [None]:
def test(model, test_loader):
    ### What you call for model to perform inference?
    model.eval() # TODO train or eval?

    ### List to store predicted phonemes of test data
    test_predictions = []

    ### Which mode do you need to avoid gradients?
    with torch.inference_mode(): # TODO

        for i, mfccs in enumerate(tqdm(test_loader)):

            mfccs   = mfccs.to(device)             
            
            logits  = model(mfccs)

            ### Get most likely predicted phoneme with argmax
            predicted_phonemes = (torch.argmax(logits, dim = 1)).tolist()

            ### How do you store predicted_phonemes with test_predictions? Hint, look at eval 
            test_predictions += (PHONEMES[val] for val in predicted_phonemes)

    return test_predictions

In [None]:
predictions = test(model, test_loader)

  0%|          | 0/945 [00:00<?, ?it/s]

In [None]:
### Create CSV file with predictions
with open("./submission.csv", "w+") as f:
    f.write("id,label\n")
    for i in range(len(predictions)):
        f.write("{},{}\n".format(i, predictions[i]))

In [None]:
### Submit to kaggle competition using kaggle API (Uncomment below to use)

!kaggle competitions submit -c 11-785-s23-hw1p2 -f ./submission.csv -m "Test Submission"

### However, its always safer to download the csv file and then upload to kaggle

100% 19.3M/19.3M [00:02<00:00, 9.16MB/s]
Successfully submitted to Frame-Level Speech Recognition

In [None]:
### Finish your wandb run

0,1
lr,▁
train_acc,▁
train_loss,▁
val_acc,▁
valid_loss,▁

0,1
lr,0.001
train_acc,70.63945
train_loss,0.90753
val_acc,69.72801
valid_loss,0.93224
