# HW3P2 - Automatic Speech Recognition

# Instruction to Run the Code

## To run the final model corresponding to the highest kaggle submission, please first make sure the **Global Variables** (next section) are set to fit the purpose, and then go to **Runtime**, click **Restart and run all**. This would
1. pip install, import and download all required packages and data
2. run all functions and classes for loading data and creating model/optimizer/scheduler
3. train the model for 30 epochs based on the parameters saved in the config_king variable defined under **Parameter Configuration**.
4. Sequentially finetune the model for 35 epoch with learning rates reset back to initial learning

**Note** that by default, the notebook is expected to finish running in one click. If you pause the run and want to reload the model from saved path for finetuning, please set **RELOAD** as True under Global Variable section and under reload path in "training setup" section.

**Note** that by default, the training data loads the 360 data. If you only want to run with 100, please set **USE100** as True under Global Variable section. The option to use combined dataset is not available for this homework (also not necessary).


**Note** that by default, the notebook would run the trained model on the test dataset and save the predicted result in csv file, but it would not make the submission to Kaggle. To run the notebook with kaggle submission, set **SUBMIT_KAGGLE** to True.

# Global Variables

In [None]:
REINSTALL = True # whether to reinstall packages and download datasets
CONNECT_DRIVE = False # whether to connect to google drive
USE_BASIC = False # whether to use basic network, only for early submission
SUBMIT = False # whether to submit to kaggle
RELOAD = False # whether to use reload model path, then enter reload
USE100 = False # whether to use train-100 data

# README

## Best Score Hyperparameters:
* **init learning rate** = 2e-3
* **finetune init learning rate** = 2e-3
* **beam width** = 30
* **test beam width** = 150
* **encoder embedding batchnorm**: after every Conv1D layer
* **encoder embedding activate**: GELU
* **encoder embedding dropout** = 0.25
* **encoder LSTM dropout** = 0.25
* **encoder pLSTM locked dropout** = 0.2
* **encoder hidden size** = 256
* **decoder batchnorm**: after every linear layer except for the last layer
* **decoder activate**: GELU
* **decoder dropout** = 0.35
* **decoder hidden size** = 2048
* **ASR embed size** = 256x2 (512)
* **weight decay** = 5e-5
* **batch size** = 128
* **epoch** = 30
* **finetune epoch** = 35
* **weight init**: kaiming normal on conv1d and linear layer, constant on batchnorm layer
* **scheduler**: ReduceLROnPlateau <patience = 1, factor = 0.5, threshold = 0.01, mode = min>
* **optimizer**: AdamW




## Data Loading Scheme:
The highest kaggle score is run by loading and training on the train 360 dataset. The data loader that could load either 360 or 100 or validation are written in class **AudioDataset**, where all mfcc files are read in order and saved into a list for the final concatenation. No memory handling is used.

All MFCCs data are normalized using Cepstral Normalization, and all Transcriptions data have EOS and SOS removed. No other data transform is used.

Within the ASR model, input data from the loader went through frequency masking and time maksing, both with mask parameter = 10.


## Architectures:
The highest kaggle score is reached using ASR model with encoder and decoder.

For encoder, the embedding include three blocks, between which a 0.25 dropout layer is added. The 1st block contains a conv1d, a batchnorm and a GELU activation layer. The kernel size used is 5. The 2nd block is a ResNet block with 256 in channels and 512 out channels. The ResNet block uses kernel size 3 for conv1d layer, and uses GELU activation. The 3rd block is a conv1d layer with kernel 3, 512 in channels and 512 out channels, and a batchnorm layer. No activation is added. In total, the embedding contains three convolutional blocks and 2 dropout layers.

After the embedding, data are passed into a one-layer biLSTM with 0.25 dropout. The input size is the embed size (512) and output size is encoder hidden size (256).

Then, the architecture includes a pBLSTMs block, which include two pBLSTM layer, after each with a LockedDropout layer with 0.2 dropout rate. The encoder hidden size (256) is used for pBLSTM output size.

For decoder, the architecture is a 2 layer MLP with 2048 hidden size, followed by the final linear layer for output. Each non-final linear layer is followed by a batchnorm, a GELU activation, and a 0.35 dropout layer. At the end, LogSoftmax is applied on the final linear output of decoder.

Other architectures are tested as well, but with less ideal performance:
1. For encoder, a 3-layer pBLSTMs is attempted, but the sequences become too short for performance.
2. For encoder, a 2-layer biLSTM + 2-layer pNLSTMs is attempted, but the performance does not improved.
3. For encoder, simplier embedding without ResNet are attempted, and the performance is not as good as the complex embedding.
4. For decoder, a 1 layer MLP is attempted.
5. For decoder, a 4 layer MLP, each with 1024 hidden size is attempted.


## Epochs
I trained the model for 30 epochs and finetune for another 35 epochs to get to my highest kaggle submission. The model is close to converging at 30 epoch, and I reset the learning rate for finetuning again. At about 30 epoch, the model hit the high cutoff. I trained for another 5 epochs, but the model has not fully converged yet. I believe it has the potential to reach a even lower distance.


## Hyperparameters
* **beam width**: 30. I increased from 5 to 20 to 30, and find 30 a better width without sacrificing the runtime too much.

* **test beam width**: 150. I increased from 50 to 100 to 150, and find the larger test width gives more accurate result, which is not too surprising.

* **Batch Size**: 128. Given the computation capacity of my GCP, the maximum batch size I could use is 128 for the given model.

* **Weight Decay**: 5e-5. I started with a higher weight decay based on previous homework, but during the ablation with a smaller weight, the model converge faster. Hence for this model, I lower the weight decay weight.

* **Dropout**: on encoder, I attempted locked dropout rate from 0.1 to 0.5, and find 0.2 the best rate in terms of performance. On decoder, I used 0.35 based on HW1, and the result is quite pleasant, so I did not run ablation study on the rate.




## Other Experiments
### Activation Function
I selected the SiLU (for decoder) and ReLU6 (for encoder) activation function at first, based on HW1 and HW2, but it seems that GELU works slightly better in this homework. Hence I use GELU across both encoder and decoder.

## Batchnorm
I used batchnorm on every layer for encoder embedding and decoder MLP, except for the last layer. I did not attempt removing batchnorm in any layer.

### Weight Initialization
I used normal Kaiming initialization for conv1d and linear layer weight, and constant 1 vs 0 (weight vs bias) initialization for 1d batchnorm layer. I also tried Kaiming uniform initialization, and the performance is slightly worse.

### Scheduler and Learning Rate
I selected ReduceLROnPlateau with patience 1 and factor 0.5, with threshold 0.01 and min mode. I started with using patience = 3 and threshold as default, but the model converge very slowly and even when the model is not improving a lot, the learning rate stays unchanged. I then increase the default threshold 1e-4 to 0.01 reduce the patience, and the model converge much faster. I also attempted CosineAnnealWarmRestart schedule based on HW2, but it does not perform well in this task.


### Optimizer
I only switched between AdamW and Adam optimizer, and found the former better with weight decay.



# Installs

In [None]:
if REINSTALL:
  !pip install wandb -q

### Levenshtein

This may take a while

In [None]:
if REINSTALL:
  !pip install wandb --quiet
  !pip install python-Levenshtein -q
  !git clone --recursive https://github.com/parlance/ctcdecode.git
  !pip install wget -q
  %cd ctcdecode
  !pip install . -q
  %cd ..

  !pip install torchsummaryX -q
  #!pip3 install torch==1.13.1 torchvision
  #!pip3 install torchaudio==2.0.1

## Imports

In [None]:
import torch
import random
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torchsummaryX import summary
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
from sklearn.decomposition import PCA
from torch.autograd import Variable

import torchaudio.transforms as tat

from sklearn.metrics import accuracy_score
import gc

import zipfile
import pandas as pd
from tqdm import tqdm
import os
import datetime

# imports for decoding and distance calculation
import ctcdecode
import Levenshtein
from ctcdecode import CTCBeamDecoder

import warnings
warnings.filterwarnings('ignore')

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Device: ", device)

Device:  cuda


# Kaggle Setup

In [None]:
if REINSTALL:
    !pip install --upgrade --force-reinstall --no-deps kaggle==1.5.8
    !mkdir /root/.kaggle

    with open("/root/.kaggle/kaggle.json", "w+") as f:
        f.write('{"username":"sharonxin1207","key":"xxx"}') # TODO: Put your kaggle username & key here

    !chmod 600 /root/.kaggle/kaggle.json

In [None]:
if REINSTALL:
    !kaggle competitions download -c 11-785-s23-hw3p2

In [None]:
'''
This will take a couple minutes, but you should see at least the following:
11-785-f22-hw3p2.zip  ctcdecode  hw3p2
'''
if REINSTALL:
    !unzip -q 11-785-s23-hw3p2.zip
    !ls

# Google Drive

In [None]:
if CONNECT_DRIVE:
  from google.colab import drive
  drive.mount('/content/gdrive')

# Dataset and Dataloader

In [None]:
# ARPABET PHONEME MAPPING
# DO NOT CHANGE
# This overwrites the phonetics.py file.

CMUdict_ARPAbet = {
    "" : " ",
    "[SIL]": "-", "NG": "G", "F" : "f", "M" : "m", "AE": "@",
    "R"    : "r", "UW": "u", "N" : "n", "IY": "i", "AW": "W",
    "V"    : "v", "UH": "U", "OW": "o", "AA": "a", "ER": "R",
    "HH"   : "h", "Z" : "z", "K" : "k", "CH": "C", "W" : "w",
    "EY"   : "e", "ZH": "Z", "T" : "t", "EH": "E", "Y" : "y",
    "AH"   : "A", "B" : "b", "P" : "p", "TH": "T", "DH": "D",
    "AO"   : "c", "G" : "g", "L" : "l", "JH": "j", "OY": "O",
    "SH"   : "S", "D" : "d", "AY": "Y", "S" : "s", "IH": "I",
    "[SOS]": "[SOS]", "[EOS]": "[EOS]"
}

CMUdict = list(CMUdict_ARPAbet.keys())
ARPAbet = list(CMUdict_ARPAbet.values())


PHONEMES = CMUdict[:-2]
LABELS = ARPAbet[:-2]

In [None]:
# You might want to play around with the mapping as a sanity check here
"""
for i in ['B', 'IH', 'K', 'SH', 'AA']:
  print(CMUdict_ARPAbet[i])
print(len(LABELS), len(PHONEMES))
"""

"\nfor i in ['B', 'IH', 'K', 'SH', 'AA']:\n  print(CMUdict_ARPAbet[i])\nprint(len(LABELS), len(PHONEMES))\n"

### Train Data

In [None]:
class AudioDataset(torch.utils.data.Dataset):

    # For this homework, we give you full flexibility to design your data set class.
    # Hint: The data from HW1 is very similar to this HW

    #TODO
    def __init__(self, root, partition='train-clean-100', transform=["norm"]):
        # Load the directory and all files in them

        self.mfcc_dir = "{}/{}/mfcc".format(root, partition)
        self.transcript_dir = "{}/{}/transcript".format(root, partition)

        self.mfcc_files = sorted(os.listdir(self.mfcc_dir))
        self.transcript_files = sorted(os.listdir(self.transcript_dir))

        self.transform = transform

        #TODO
        # HOW CAN WE REPRESENT PHONEMES? CAN WE CREATE A MAPPING FOR THEM?
        # HINT: TENSORS CANNOT STORE NON-NUMERICAL VALUES OR STRINGS
        self.PHONEMES = PHONEMES
        self.mapping = {}
        for p in range(len(self.PHONEMES)):
          self.mapping[self.PHONEMES[p]] = p

        #TODO
        # CREATE AN ARRAY OF ALL FEATUERS AND LABELS
        # WHAT NORMALIZATION TECHNIQUE DID YOU USE IN HW1? CAN WE USE IT HERE?
        self.mfccs, self.transcripts = [], []

        for i in range(len(self.mfcc_files)):
        #   Load a single mfcc
            mfcc        = np.load(self.mfcc_dir+"/"+self.mfcc_files[i])
        #   Do Cepstral Normalization of mfcc (explained in writeup)
            if "norm" in self.transform:
                mfcc        = (mfcc - mfcc.mean(axis=0))/mfcc.std(axis=0)
        #   Load the corresponding transcript
            transcript  = np.load(self.transcript_dir+"/"+self.transcript_files[i])
            start_idx = np.where(transcript=='[SOS]')[0][-1]
            end_idx = np.where(transcript=='[EOS]')[0][0]
            transcript = transcript[start_idx+1:end_idx]
            transcript = np.array([self.mapping[t] for t in transcript])
        #   Append each mfcc to self.mfcc, transcript to self.transcript
            self.mfccs.append(mfcc)
            self.transcripts.append(transcript)

        #TODO
        # WHAT SHOULD THE LENGTH OF THE DATASET BE?
        self.length = len(self.mfccs)

        '''
        You may decide to do this in __getitem__ if you wish.
        However, doing this here will make the __init__ function take the load of
        loading the data, and shift it away from training.
        '''


    def __len__(self):

        return self.length

    def __getitem__(self, ind):
        mfcc = torch.FloatTensor(self.mfccs[ind])
        transcript = torch.tensor(self.transcripts[ind])
        return mfcc, transcript


    def collate_fn(self, batch):
        '''
        TODO:
        1.  Extract the features and labels from 'batch'
        2.  We will additionally need to pad both features and labels,
            look at pytorch's docs for pad_sequence
        3.  This is a good place to perform transforms, if you so wish.
            Performing them on batches will speed the process up a bit.
        4.  Return batch of features, labels, lenghts of features,
            and lengths of labels.
        '''
        # batch of input mfcc coefficients
        batch_mfcc = [b[0] for b in batch]
        batch_transcript = [b[1] for b in batch]

        # HINT: CHECK OUT -> pad_sequence (imported above)
        # Also be sure to check the input format (batch_first)
        batch_mfcc_pad = pad_sequence(batch_mfcc, batch_first=True, padding_value=0) # TODO
        lengths_mfcc = [len(mfcc) for mfcc in batch_mfcc]
        batch_transcript_pad = pad_sequence(batch_transcript, batch_first=True, padding_value=0) # TODO
        lengths_transcript = [len(trans) for trans in batch_transcript]

        # You may apply some transformation, Time and Frequency masking, here in the collate function;
        # Food for thought -> Why are we applying the transformation here and not in the __getitem__?
        #                  -> Would we apply transformation on the validation set as well?
        #                  -> Is the order of axes / dimensions as expected for the transform functions?

        # Return the following values: padded features, padded labels, actual length of features, actual length of the labels
        return batch_mfcc_pad, batch_transcript_pad, torch.tensor(lengths_mfcc), torch.tensor(lengths_transcript)



### Test Data

In [None]:
# Test Dataloader
class AudioDatasetTest(torch.utils.data.Dataset):

    # TODO: Create a test dataset class similar to the previous class but you dont have transcripts for this
    # Imp: Read the mfccs in sorted order, do NOT shuffle the data here or in your dataloader.
    def __init__(self, root, partition= "test-clean", transform=["norm"]): # Feel free to add more arguments

        # TODO: MFCC directory - use partition to acces train/dev directories from kaggle data using root
        self.mfcc_dir       = "{}/{}/mfcc".format(root, partition)

        # TODO: List files in sefl.mfcc_dir using os.listdir in sorted order
        mfcc_names          = sorted(os.listdir(self.mfcc_dir))

        self.mfccs = []
        self.transform = transform

        # TODO: Iterate through mfccs and transcripts
        for i in range(len(mfcc_names)):
        #   Load a single mfcc
            mfcc        = np.load(self.mfcc_dir+"/"+mfcc_names[i])
        #   Do Cepstral Normalization of mfcc (explained in writeup)
            if "norm" in self.transform:
                mfcc        = (mfcc - mfcc.mean(axis=0))/mfcc.std(axis=0)
            self.mfccs.append(mfcc)
        #self.mfccs          = np.concatenate(self.mfccs)

        # Length of the dataset is now the length of concatenated mfccs/transcripts
        self.length = len(self.mfccs)

        # The available phonemes in the transcript are of string data type
        # But the neural network cannot predict strings as such.
        # Hence, we map these phonemes to integers

    def __len__(self):
        return self.length

    def __getitem__(self, ind):

        # TODO: Based on context and offset, return a frame at given index with context frames to the left, and right.
        mfcc = torch.FloatTensor(self.mfccs[ind]) # Convert to tensors

        return mfcc

    def collate_fn(self, batch):
        '''
        TODO:
        1.  Extract the features and labels from 'batch'
        2.  We will additionally need to pad both features and labels,
            look at pytorch's docs for pad_sequence
        3.  This is a good place to perform transforms, if you so wish.
            Performing them on batches will speed the process up a bit.
        4.  Return batch of features, labels, lenghts of features,
            and lengths of labels.
        '''
        batch_mfcc_pad = pad_sequence(batch, batch_first=True, padding_value=0) # TODO
        lengths_mfcc = [len(mfcc) for mfcc in batch]

        return batch_mfcc_pad, torch.tensor(lengths_mfcc)


### Data - Hyperparameters

In [None]:
BATCH_SIZE = 128 # Increase if your device can handle it

transforms = ["norm"] # set of tranformations
# You may pass this as a parameter to the dataset class above
# This will help modularize your implementation

root = "/content/11-785-s23-hw3p2"

### Data loaders

In [None]:
# get me RAMMM!!!!
import gc
gc.collect()


0

In [None]:
# Create objects for the dataset class
if USE100:
  train_data = AudioDataset(root) #TODO
else:
  train_data = AudioDataset(root, 'train-clean-360') #TODO

val_data = AudioDataset(root, 'dev-clean') # TODO : You can either use the same class with some modifications or make a new one :)
test_data = AudioDatasetTest(root) #TODO

# Do NOT forget to pass in the collate function as parameter while creating the dataloader
train_loader = torch.utils.data.DataLoader(
    dataset     = train_data,
    num_workers = 4,
    batch_size  = BATCH_SIZE,
    collate_fn  = train_data.collate_fn,
    pin_memory  = True,
    shuffle     = True
)
val_loader = torch.utils.data.DataLoader(
    dataset     = val_data,
    num_workers = 2,
    batch_size  = BATCH_SIZE,
    collate_fn  = val_data.collate_fn,
    pin_memory  = True,
    shuffle     = False

)
test_loader = torch.utils.data.DataLoader(
    dataset     = test_data,
    num_workers = 2,
    batch_size  = BATCH_SIZE,
    collate_fn  = test_data.collate_fn,
    pin_memory  = True,
    shuffle     = False
)

print("Batch size: ", BATCH_SIZE)
print("Train dataset samples = {}, batches = {}".format(train_data.__len__(), len(train_loader)))
print("Val dataset samples = {}, batches = {}".format(val_data.__len__(), len(val_loader)))
print("Test dataset samples = {}, batches = {}".format(test_data.__len__(), len(test_loader)))

Batch size:  128
Train dataset samples = 104013, batches = 813
Val dataset samples = 2703, batches = 22
Test dataset samples = 2620, batches = 21


In [None]:
# sanity check
for data in train_loader:
    x, y, lx, ly = data
    print(x.shape, y.shape, lx.shape, ly.shape)
    break

torch.Size([128, 1702, 27]) torch.Size([128, 208]) torch.Size([128]) torch.Size([128])


# NETWORK

## Basic

This is a basic block for understanding, you can skip this and move to pBLSTM one

In [None]:
OUT_SIZE = len(LABELS)
class Network(nn.Module):

    def __init__(self, input_size, output_size, embed_size=256):

        super(Network, self).__init__()

        # Adding some sort of embedding layer or feature extractor might help performance.
        #self.embedding = torch.nn.Embedding(input_size, embed_size) # 27
        self.conv1 = torch.nn.Conv1d(input_size, 128, kernel_size=3, stride=1, padding=1,bias=False)
        self.batch1 = torch.nn.BatchNorm1d(128)
        self.conv2 = torch.nn.Conv1d(128, 256, kernel_size=3, stride=1, padding=1,bias=False)
        self.batch2 = torch.nn.BatchNorm1d(256)
        self.conv3 = torch.nn.Conv1d(256, 512, kernel_size=3, stride=1, padding=1,bias=False)
        self.batch3 = torch.nn.BatchNorm1d(512)
        self.relu = nn.ReLU(inplace=True)
        # TODO : look up the documentation. You might need to pass some additional parameters.
        hidden_size = 256
        self.lstm = nn.LSTM(256, hidden_size = 256, num_layers = 1, batch_first=True, bidirectional=True)

        self.classification = nn.Sequential(
            torch.nn.Linear(hidden_size*2, hidden_size*2),
            torch.nn.ReLU(inplace=True),
            torch.nn.Linear(hidden_size*2, OUT_SIZE)
            #TODO: Linear layer with in_features from the lstm module above and out_features = OUT_SIZE
        )


        self.logSoftmax = torch.nn.LogSoftmax(dim=2)
        #TODO: Apply a log softmax here. Which dimension would apply it on ?

    def forward(self, x, lx):
        #x = self.embedding(x)
        embed_x = self.conv1(x.transpose(1,2))
        embed_x = self.relu(self.batch1(embed_x))
        embed_x = self.conv2(embed_x)
        embed_x = self.relu(self.batch2(embed_x))
        embed_x = self.conv3(embed_x)
        embed_x = self.batch3(embed_x)
        embed_x = embed_x.transpose(1,2)
        print(embed_x.shape)
        x_combined = pack_padded_sequence(embed_x, lx, batch_first=True, enforce_sorted=False)
        out_combined, _ = self.lstm(x_combined)
        out_x, out_lens = pad_packed_sequence(out_combined, batch_first=True)
        out = self.logSoftmax(self.classification(out_x))
        # The forward function takes 2 parameter inputs here. Why?
        # Refer to the handout for hints
        return out, out_lens

## Pyramid Bi-LSTM (pBLSTM)

In [None]:
# Utils Classes and Functions

class PermuteBlock(torch.nn.Module):
    def forward(self, x):
        return x.transpose(1, 2)

# referring https://github.com/salesforce/awd-lstm-lm/blob/dfd3cb0235d2caf2847a4d53e1cbd495b781b5d2/locked_dropout.py#L5
class LockedDropout(nn.Module):
    def __init__(self, dropout=0.5):
        super().__init__()
        self.dropout = dropout

    def forward(self, xy):
        x, y = xy
        if not self.training or not self.dropout:
            return pack_padded_sequence(x, y, batch_first=True, enforce_sorted=False)
        m = x.data.new(1, x.size(1), x.size(2)).bernoulli_(1 - self.dropout)
        mask = Variable(m, requires_grad=False) / (1 - self.dropout)
        #mask = mask.to(device)
        mask = mask.expand_as(x)
        masked_x = pack_padded_sequence(mask * x, y, batch_first=True, enforce_sorted=False)
        masked_x = masked_x.to(device)
        return masked_x


def init_weights(m):
    if isinstance(m, torch.nn.Conv1d):
        torch.nn.init.kaiming_normal_(m.weight.data, mode='fan_out', nonlinearity='relu')
    elif isinstance(m, torch.nn.BatchNorm1d):
        torch.nn.init.constant_(m.weight.data, 1)
        torch.nn.init.constant_(m.bias.data, 0)
    elif isinstance(m, torch.nn.Linear):
        torch.nn.init.normal_(m.weight.data, 0, 0.01)
        torch.nn.init.constant_(m.bias.data, 0)

In [None]:
class pBLSTM(torch.nn.Module):

    '''
    Pyramidal BiLSTM
    Read the write up/paper and understand the concepts and then write your implementation here.

    At each step,
    1. Pad your input if it is packed (Unpack it)
    2. Reduce the input length dimension by concatenating feature dimension
        (Tip: Write down the shapes and understand)
        (i) How should  you deal with odd/even length input?
        (ii) How should you deal with input length array (x_lens) after truncating the input?
    3. Pack your input
    4. Pass it into LSTM layer

    To make our implementation modular, we pass 1 layer at a time.
    '''

    def __init__(self, input_size, hidden_size, batch_first=False):
        super(pBLSTM, self).__init__()

        self.batch_first = batch_first

        self.blstm = torch.nn.LSTM(input_size, hidden_size, num_layers=1,
                                   batch_first=self.batch_first, bidirectional=True)
        self.permute = PermuteBlock()
        # TODO: Initialize a single layer bidirectional LSTM with the given input_size and hidden_size

    def forward(self, x_packed): # x_packed is a PackedSequence
        # TODO: Pad Packed Sequence
        # Call self.trunc_reshape() which downsamples the time steps of x and increases the feature dimensions as mentioned above
        # self.trunc_reshape will return 2 outputs. What are they? Think about what quantites are changing.
        # TODO: Pack Padded Sequence. What output(s) would you get?
        # TODO: Pass the sequence through bLSTM
        x_pad, x_pad_lens = pad_packed_sequence(x_packed, batch_first=self.batch_first)
        x_trunc, x_trunc_lens = self.trunc_reshape(x_pad, x_pad_lens)
        x_packed_pad_trunc = pack_padded_sequence(x_trunc, x_trunc_lens, batch_first=self.batch_first, enforce_sorted=False)
        output, output_lens = self.blstm(x_packed_pad_trunc)
        #output = output.to(device)
        # What do you return?
        output_pad, output_pad_lens = pad_packed_sequence(output, batch_first=self.batch_first)
        # print("finishing padding")
        # output_pad = output_pad.to(device)
        return output_pad, output_pad_lens

    def trunc_reshape(self, x, x_lens):
        # x = batch, seq_len, features
        # TODO: If you have odd number of timesteps, how can you handle it? (Hint: You can exclude them)
        if x.shape[1] % 2!=0:
            x = x[:, :-1, :]
        x_down = x.contiguous().view(x.shape[0], x.shape[1] // 2, 2, x.shape[2])
        x_down = torch.mean(x_down, 2)
        # TODO: Reduce lengths by the same downsampling factor
        x_lens = x_lens // 2
        return x_down, x_lens

# Encoder

### Building Blocks

In [None]:

class ConvBlock(torch.nn.Sequential):
    def __init__(self, in_chan, out_chan, kernel, stride, padding=-1, groups=1, bias=False):
        if padding < 0:
          padding = (kernel-1)//2
        super(ConvBlock, self).__init__(
            torch.nn.Conv1d(in_channels=in_chan, out_channels=out_chan,
                            kernel_size=kernel, stride=stride, padding=padding,
                            groups=groups, bias=bias),
            torch.nn.BatchNorm1d(out_chan),
            #torch.nn.ReLU6(inplace=True)
            torch.nn.GELU()
        )

class ResNetBlock(torch.nn.Module):
    def __init__(self, in_chan, out_chan, stride=1):
        super(ResNetBlock, self).__init__()
        self.stride = stride
        layer1 = ConvBlock(in_chan, out_chan, 3, self.stride)
        layer2 = torch.nn.Sequential(
            torch.nn.Conv1d(out_chan, out_chan, kernel_size=3, stride=1, padding=1, bias=False),
            torch.nn.BatchNorm1d(out_chan)
        )
        self.layerflat = torch.nn.Sequential(
            torch.nn.Conv1d(in_chan, out_chan, kernel_size=1, stride=1, bias=False),
            torch.nn.BatchNorm1d(out_chan)
        )


        self.basicnet = torch.nn.Sequential(*[layer1, layer2])
        #self.activate = torch.nn.ReLU6(inplace=True)
        self.activate = torch.nn.GELU()

    def forward(self, x):
        out = self.basicnet(x)
        out += self.layerflat(x)
        # to skip the connection if downsample condition is not met
        out = self.activate(out)
        return out



### Encoder using ResNet

In [None]:
class ComplexEncoder(torch.nn.Module):
    '''
    The Encoder takes utterances as inputs and returns latent feature representations
    '''
    def __init__(self, input_size, embed_size, encoder_hidden_size=256):
        super(ComplexEncoder, self).__init__()
        # construct layer for CNN embedding

        layer1 = ConvBlock(input_size, 256, 5, 1)
        layer2 = ResNetBlock(256, 512, 1)
        layer3 = torch.nn.Sequential(
            torch.nn.Conv1d(512, embed_size, kernel_size=3, stride=1, padding=1, bias=False),
            torch.nn.BatchNorm1d(embed_size)
        )
        layer_dropout = torch.nn.Dropout(0.25)

        #TODO: You can use CNNs as Embedding layer to extract features. Keep in mind the Input dimensions and expected dimension of Pytorch CNN.
        self.embedding = torch.nn.Sequential(
            *[layer1, layer_dropout, layer2, layer_dropout, layer3])


        self.permute = PermuteBlock()

        self.lstm = torch.nn.LSTM(embed_size, encoder_hidden_size, num_layers=1, dropout=0.25, batch_first=True, bidirectional=True)
        self.pBLSTMs = torch.nn.Sequential(
            #pBLSTM(encoder_hidden_size*2, encoder_hidden_size, True, True),
            #LockedDropout(0.25),
            pBLSTM(encoder_hidden_size*2, encoder_hidden_size, True),
            LockedDropout(0.2),
            pBLSTM(encoder_hidden_size*2, encoder_hidden_size, True),
            LockedDropout(0.2)
            # How many pBLSTMs are required?
            # TODO: Fill this up with pBLSTMs - What should the input_size be?
            # Hint: You are downsampling timesteps by a factor of 2, upsampling features by a factor of 2 and the LSTM is bidirectional)
            # Optional: Dropout/Locked Dropout after each pBLSTM (Not needed for early submission) ...
            # ...
        )


    def forward(self, x, x_lens):
        # note the x input should has been permuted
        embed_x_t = self.embedding(x)
        embed_x = self.permute(embed_x_t)
        x_combined = pack_padded_sequence(embed_x, x_lens, batch_first=True, enforce_sorted=False)

        lstm_output, _ = self.lstm(x_combined)
        lstm_x, lstm_lens = pad_packed_sequence(lstm_output, batch_first=True)
        x_combined = pack_padded_sequence(lstm_x, lstm_lens, batch_first=True, enforce_sorted=False)

        encoder_outputs = self.pBLSTMs(x_combined)
        encoder_outputs, encoder_lens = pad_packed_sequence(encoder_outputs, batch_first=True)
        # Where are x and x_lens coming from? The dataloader
        # TODO: Call the embedding layer
        # TODO: Pack Padded Sequence
        # TODO: Pass Sequence through the pyramidal Bi-LSTM layer
        # TODO: Pad Packed Sequence

        # Remember the number of output(s) each function returns

        return encoder_outputs, encoder_lens

###Basic Encoder

In [None]:
class Encoder(torch.nn.Module):
    '''
    The Encoder takes utterances as inputs and returns latent feature representations
    '''
    def __init__(self, input_size, encoder_hidden_size):
        super(Encoder, self).__init__()
        # construct layer for CNN embedding
        layer1 = ConvBlock(input_size, 128, 3, 1)
        layer2 = ConvBlock(128, 256, 3, 1)
        layer3 = ConvBlock(256, 256, 3, 1)
        layer4 = torch.nn.Sequential(
            torch.nn.Conv1d(256, 512, kernel_size=3, stride=1, padding=1, bias=False),
            torch.nn.BatchNorm1d(512)
        )
        #TODO: You can use CNNs as Embedding layer to extract features. Keep in mind the Input dimensions and expected dimension of Pytorch CNN.
        self.embedding = torch.nn.Sequential(*[layer1, layer2, layer3, layer4])
        self.embedding_resnet = ResNet18(input_size)

        self.permute = PermuteBlock()

        #self.lstm = torch.nn.LSTM(512, 512, num_layers=2, dropout=0.25, batch_first=self.batch_first, bidirectional=True)
        self.pBLSTMs = torch.nn.Sequential(
            pBLSTM(encoder_hidden_size*2, encoder_hidden_size, True),
            LockedDropout(0.25),
            pBLSTM(encoder_hidden_size*2, encoder_hidden_size, True),
            LockedDropout(0.25)
            # How many pBLSTMs are required?
            # TODO: Fill this up with pBLSTMs - What should the input_size be?
            # Hint: You are downsampling timesteps by a factor of 2, upsampling features by a factor of 2 and the LSTM is bidirectional)
            # Optional: Dropout/Locked Dropout after each pBLSTM (Not needed for early submission) ...
            # ...
        )


    def forward(self, x, x_lens):
        # note the x input should has been permuted
        embed_x_t = self.embedding(x)
        embed_x = self.permute(embed_x_t)
        x_combined = pack_padded_sequence(embed_x, x_lens, batch_first=True, enforce_sorted=False)


        lstm_output, _ = self.lstm(x_combined)
        print(type(lstm_output))
        lstm_x, lstm_lens = pad_packed_sequence(lstm_output, batch_first=True)
        lstm_combined = pack_padded_sequence(lstm_x, lstm_lens, batch_first=True, enforce_sorted=False)

        #print(type(x_combined))
        #lstm_output = self.lstm(x_combined)
        encoder_outputs = self.pBLSTMs(lstm_combined)
        encoder_outputs, encoder_lens = pad_packed_sequence(encoder_outputs, batch_first=True)


        return encoder_outputs, encoder_lens

# Decoder

In [None]:
class Decoder(torch.nn.Module):

    def __init__(self, embed_size, output_size= 41):
        super().__init__()

        self.mlp = torch.nn.Sequential(
            torch.nn.Linear(embed_size, 2048),
            PermuteBlock(), torch.nn.BatchNorm1d(2048), PermuteBlock(),
            torch.nn.GELU(),
            torch.nn.Dropout(0.35),

            torch.nn.Linear(2048, 2048),
            PermuteBlock(), torch.nn.BatchNorm1d(2048), PermuteBlock(),
            torch.nn.GELU(),
            torch.nn.Dropout(0.35),
            torch.nn.Linear(2048, output_size),
            #TODO define your MLP arch. Refer HW1P2
            #Use Permute Block before and after BatchNorm1d() to match the size
        )

        self.softmax = torch.nn.LogSoftmax(dim=2)

    def forward(self, encoder_out):
        #TODO call your MLP
        decode_out = self.mlp(encoder_out)
        #TODO Think what should be the final output of the decoder for the classification
        out = self.softmax(decode_out)
        return out

# ASR Model

In [None]:
class ASRModel(torch.nn.Module):

    def __init__(self, input_size, embed_size= 192, output_size= len(PHONEMES)):
        super().__init__()

        self.augmentations  = torch.nn.Sequential(
          tat.FrequencyMasking(freq_mask_param=10),
          tat.TimeMasking(time_mask_param=10),
          PermuteBlock()
            #TODO Add Time Masking/ Frequency Masking

            #Hint: See how to use PermuteBlock() function defined above
        )
        self.encoder        = ComplexEncoder(input_size, embed_size*2)# TODO: Initialize Encoder
        self.decoder        = Decoder(embed_size*2, output_size)# TODO: Initialize Decoder


    def forward(self, x, lengths_x):

        if self.training:
            x = self.augmentations(x)
        else:
            x = x.transpose(1, 2)

        encoder_out, encoder_len   = self.encoder(x, lengths_x)
        decoder_out                 = self.decoder(encoder_out)

        return decoder_out, encoder_len

## INIT Basic
(If trying out the basic Network)

In [None]:
if USE_BASIC:
  model = Network(27, 41).to(device)
  summary(model, x.to(device), lx) # x and lx come from the sanity check above :)

## INIT ASR

In [None]:
#torch.cuda.empty_cache()
model = ASRModel(
    input_size  = 27,
    embed_size  = 256,
    output_size = len(PHONEMES)
).to(device)
print(model)
model.apply(init_weights)
summary(model, x.to(device), lx)

ASRModel(
  (augmentations): Sequential(
    (0): FrequencyMasking()
    (1): TimeMasking()
    (2): PermuteBlock()
  )
  (encoder): ComplexEncoder(
    (embedding): Sequential(
      (0): ConvBlock(
        (0): Conv1d(27, 256, kernel_size=(5,), stride=(1,), padding=(2,), bias=False)
        (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU6(inplace=True)
      )
      (1): Dropout(p=0.25, inplace=False)
      (2): ResNetBlock(
        (layerflat): Sequential(
          (0): Conv1d(256, 512, kernel_size=(1,), stride=(1,), bias=False)
          (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (basicnet): Sequential(
          (0): ConvBlock(
            (0): Conv1d(256, 512, kernel_size=(3,), stride=(1,), padding=(1,), bias=False)
            (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (2): ReLU6(inplace=True)
          )
   

Unnamed: 0_level_0,Kernel Shape,Output Shape,Params,Mult-Adds
Layer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0_augmentations.FrequencyMasking_0,-,"[128, 1702, 27]",,
1_augmentations.TimeMasking_1,-,"[128, 1702, 27]",,
2_augmentations.PermuteBlock_2,-,"[128, 27, 1702]",,
3_encoder.embedding.0.Conv1d_0,"[27, 256, 5]","[128, 256, 1702]",34560.0,58821120.0
4_encoder.embedding.0.BatchNorm1d_1,[256],"[128, 256, 1702]",512.0,256.0
5_encoder.embedding.0.ReLU6_2,-,"[128, 256, 1702]",,
6_encoder.embedding.Dropout_1,-,"[128, 256, 1702]",,
7_encoder.embedding.2.basicnet.0.Conv1d_0,"[256, 512, 3]","[128, 512, 1702]",393216.0,669253600.0
8_encoder.embedding.2.basicnet.0.BatchNorm1d_1,[512],"[128, 512, 1702]",1024.0,512.0
9_encoder.embedding.2.basicnet.0.ReLU6_2,-,"[128, 512, 1702]",,


# Training Config

In [None]:
config = {
    "beam_width" : 30,
    "lr" : 2e-3,
    "epochs" : 30,
    "finetune_epochs": 35,
    "w_decay": 5e-5,
    } # Feel free to add more items here

In [None]:
#TODO
criterion = torch.nn.CTCLoss() # Define CTC loss as the criterion. How would the losses be reduced?
# CTC Loss: https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html
# Refer to the handout for hints

optimizer =  torch.optim.AdamW(model.parameters(), lr=config['lr'], weight_decay=config['w_decay']) # What goes in here?

# Declare the decoder. Use the CTC Beam Decoder to decode phonemes
# CTC Beam Decoder Doc: https://github.com/parlance/ctcdecode
decoder = CTCBeamDecoder(labels=LABELS, blank_id=0, beam_width=config['beam_width'], log_probs_input=True)

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=1, threshold=0.01, verbose=True)
#scheduler_cos = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, 20, 1) #(optimizer, gamma=0.6, step_size=1)
# Mixed Precision, if you need it
scaler = torch.cuda.amp.GradScaler()

# Decode Prediction

In [None]:
def decode_prediction(output, output_lens, decoder, PHONEME_MAP=LABELS):

    # TODO: look at docs for CTC.decoder and find out what is returned here. Check the shape of output and expected shape in decode.
    (de_output, _, de_times, de_seq_lens) = decoder.decode(output, seq_lens=output_lens) #lengths - list of lengths

    pred_strings                    = []

    for i in range(output_lens.shape[0]):
      if de_seq_lens[i, 0] != 0:
        code = "".join(PHONEME_MAP[p] for p in de_output[i, 0, :de_seq_lens[i, 0]])
      else:
        code = ""
      pred_strings.append(code)
      #TODO: Create the prediction from the output of decoder.decode. Don't forget to map it using PHONEMES_MAP.

    return pred_strings

def calculate_levenshtein(output, label, output_lens, label_lens, decoder, PHONEME_MAP= LABELS): # y - sequence of integers

    dist            = 0
    batch_size      = label.shape[0]

    pred_strings    = decode_prediction(output, output_lens, decoder, PHONEME_MAP)
    #label_strings   = ["".join(PHONEME_MAP[p] for p in l) for l in label]

    for i in range(batch_size):
        # TODO: Get predicted string and label string for each element in the batch
        pred_string = pred_strings[i]
        label_i = list(label[i][:label_lens[i]])
        label_strings = [PHONEME_MAP[label_i[l]] for l in range(len(label_i))]
        label_string = "".join(label_strings)
        dist += Levenshtein.distance(pred_string, label_string)

    dist /= batch_size # TODO: Uncomment this, but think about why we are doing this
    # raise NotImplemented
    return dist

# wandb

You will need to fetch your api key from wandb.ai

In [None]:
 import wandb
wandb.login(key="27ad915a9386068b1fc160cd97b84be7ba1fe659")

[34m[1mwandb[0m: Currently logged in as: [33mwenxinz3[0m ([33msharonxin1207[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
LOG_RUN = True
if LOG_RUN:
  run = wandb.init(
      name = "run-resnet+plstm+lstm", ## Wandb creates random run names if you skip this field
      reinit = True, ### Allows reinitalizing runs when you re-run this cell
      # run_id = ### Insert specific run id here if you want to resume a previous run
      # resume = "must" ### You need this to resume previous runs, but comment out reinit = True when using this
      project = "hw3p2-ablations", ### Project should be created in your wandb account
      config = config ### Wandb Config for your run
  )

# Train Functions

In [None]:
from tqdm import tqdm
#torch.backends.cudnn.enabled = True
def train_model(model, train_loader, criterion, optimizer):

    model.train()
    batch_bar = tqdm(total=len(train_loader), dynamic_ncols=True, leave=False, position=0, desc='Train')

    total_loss = 0

    for i, data in enumerate(train_loader):
        optimizer.zero_grad()

        x, y, lx, ly = data
        x, y = x.to(device), y.to(device)

        #with torch.autocast(device_type='cpu'):
        with torch.cuda.amp.autocast():

            h, lh = model(x, lx)
            #print("running model done")
            h = torch.permute(h, (1, 0, 2))#.to(device)
            loss = criterion(h, y, lh, ly)

        total_loss += loss.item()

        batch_bar.set_postfix(
            loss="{:.04f}".format(float(total_loss / (i + 1))),
            lr="{:.06f}".format(float(optimizer.param_groups[0]['lr'])))

        batch_bar.update() # Update tqdm bar

        if device == 'cpu':
           loss.backward()
           optimizer.step()
        # Another couple things you need for FP16.
        else:
          scaler.scale(loss).backward() # This is a replacement for loss.backward()
          scaler.step(optimizer) # This is a replacement for optimizer.step()
          scaler.update() # This is something added just for FP16

        del x, y, lx, ly, h, lh, loss
        torch.cuda.empty_cache()
    print("Finishing Train")
    batch_bar.close() # You need this to close the tqdm bar
    return total_loss / len(train_loader)


def validate_model(model, val_loader, decoder, phoneme_map= LABELS):

    model.eval()
    batch_bar = tqdm(total=len(val_loader), dynamic_ncols=True, position=0, leave=False, desc='Val')

    total_loss = 0
    vdist = 0

    for i, data in enumerate(val_loader):

        x, y, lx, ly = data
        x, y = x.to(device), y.to(device)

        with torch.inference_mode():
            h, lh = model(x, lx)
            h = torch.permute(h, (1, 0, 2))
            loss = criterion(h, y, lh, ly)

        total_loss += float(loss)
        vdist += calculate_levenshtein(torch.permute(h, (1, 0, 2)), y, lh, ly, decoder, phoneme_map)

        batch_bar.set_postfix(loss="{:.04f}".format(float(total_loss / (i + 1))), dist="{:.04f}".format(float(vdist / (i + 1))))

        batch_bar.update()

        del x, y, lx, ly, h, lh, loss
        torch.cuda.empty_cache()
        #torch.cpu.empty_cache()

    batch_bar.close()
    total_loss = total_loss/len(val_loader)
    val_dist = vdist/len(val_loader)
    return total_loss, val_dist

In [None]:
# test code to check shapes
TEST_MODEL = False
if TEST_MODEL:
  model.eval()
  for i, data in enumerate(val_loader, 0):
      x, y, lx, ly = data
      x, y = x.to(device), y.to(device)
      #x = x.transpose(1, 2) # use this step if not in basic mode
      h, lh = model(x, lx)
      print(h.shape)
      print(calculate_levenshtein(h, y, lx, ly, decoder, LABELS))
      h = torch.permute(h, (1, 0, 2))
      print(h.shape, y.shape)
      loss = criterion(h, y, lh, ly)
      print(loss)

      break

In [None]:
def save_model(model, optimizer, scheduler, metric, epoch, path):
    torch.save(
        {'model_state_dict'         : model.state_dict(),
         'optimizer_state_dict'     : optimizer.state_dict(),
         'scheduler_state_dict'     : scheduler.state_dict(),
         metric[0]                  : metric[1],
         'epoch'                    : epoch},
         path
    )

def load_model(path, model, metric= 'valid_acc', optimizer= None, scheduler= None):

    checkpoint = torch.load(path)
    model.load_state_dict(checkpoint['model_state_dict'])

    if optimizer != None:
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    if scheduler != None:
        scheduler.load_state_dict(checkpoint['scheduler_state_dict'])

    epoch   = checkpoint['epoch']
    metric  = checkpoint[metric]

    return [model, optimizer, scheduler, epoch, metric]

### Training Setup

In [None]:
# This is for checkpointing, if you're doing it over multiple sessions
last_epoch_completed = 0
START_EPO = last_epoch_completed
END_EPO = config["epochs"]
best_lev_dist = float("inf") # if you're restarting from some checkpoint, use what you saw there.
epoch_model_path = "epoch_model.pth" #TODO set the model path( Optional, you can just store best one. Make sure to make the changes below )
best_model_path = "best_model.pth" #TODO set best model path


380

In [None]:
if RELOAD:
  reload_path = "ENTER RELOAD PATH HERE"
  reload_items = load_model(reload_path, model, 'valid_dist', optimizer, scheduler)
  model = items[0]
  optimzer = items[1]

  # updating above values if reloading from earlier model
  best_lev_dist = items[4]
  START_EPO = items[3]

  print(best_lev_dist, optimizer.param_groups[0]['lr'])

### For Init Training

In [None]:
#TODO: Please complete the training loop
torch.cuda.empty_cache()
gc.collect()

for epoch in range(START_EPO, END_EPO):

    print("\nEpoch: {}/{}".format(epoch+1, config['epochs']))

    curr_lr = float(optimizer.param_groups[0]['lr'])

    train_loss              = train_model(model, train_loader, criterion, optimizer) #TODO
    valid_loss, valid_dist  = validate_model(model, val_loader, decoder) #TODO
    scheduler.step(valid_dist)
    """
    if epoch < 39:
      scheduler_cos.step()
    elif epoch > 39:
      scheduler.step(valid_dist)
    else:
      pass
    """
    print("\tTrain Loss {:.04f}\t Learning Rate {:.07f}".format(train_loss, curr_lr))
    print("\tVal Dist {:.04f}%\t Val Loss {:.04f}".format(valid_dist, valid_loss))


    wandb.log({
        'train_loss': train_loss,
        'valid_dist': valid_dist,
        'valid_loss': valid_loss,
        'lr'        : curr_lr
    })

    save_model(model, optimizer, scheduler, ['valid_dist', valid_dist], epoch, epoch_model_path)
    wandb.save(epoch_model_path)
    print("Saved epoch model")

    if valid_dist <= best_lev_dist:
        best_lev_dist = valid_dist
        save_model(model, optimizer, scheduler, ['valid_dist', valid_dist], epoch, best_model_path)
        wandb.save(best_model_path)
        print("Saved best model")
    """
    if epoch == 39:
      optimizer.param_groups[0]['lr'] = 0.0002
      scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=1, threshold=0.01, verbose=True)
    """
      # You may find it interesting to exlplore Wandb Artifcats to version your models
    torch.cuda.empty_cache()
    gc.collect()

run.finish()

### For Consequential Finetuning

In [None]:
# reset learning rate
optimizer.param_groups[0]['lr'] = config['lr']
finetune_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.8, patience=3, verbose=True)
best_model_path = "best_finetune_model.pth"

In [None]:
if LOG_RUN:
  run = wandb.init(
      name = "run-resnet+plstm+lstm+finetune", ## Wandb creates random run names if you skip this field
      reinit = True, ### Allows reinitalizing runs when you re-run this cell
      # run_id = ### Insert specific run id here if you want to resume a previous run
      # resume = "must" ### You need this to resume previous runs, but comment out reinit = True when using this
      project = "hw3p2-ablations", ### Project should be created in your wandb account
      config = config ### Wandb Config for your run
  )

In [None]:
torch.cuda.empty_cache()
gc.collect()

for epoch in range(config['finetune_epochs']):

    print("\nEpoch: {}/{}".format(epoch+1, config['finetune_epochs']))

    curr_lr = float(optimizer.param_groups[0]['lr'])

    train_loss              = train_model(model, train_loader, criterion, optimizer) #TODO
    valid_loss, valid_dist  = validate_model(model, val_loader, decoder) #TODO
    finetune_scheduler.step(valid_dist)
    print("\tTrain Loss {:.04f}\t Learning Rate {:.07f}".format(train_loss, curr_lr))
    print("\tVal Dist {:.04f}%\t Val Loss {:.04f}".format(valid_dist, valid_loss))


    wandb.log({
        'train_loss': train_loss,
        'valid_dist': valid_dist,
        'valid_loss': valid_loss,
        'lr'        : curr_lr
    })

    save_model(model, optimizer, scheduler, ['valid_dist', valid_dist], epoch, epoch_model_path)
    wandb.save(epoch_model_path)
    print("Saved epoch model")

    if valid_dist <= best_lev_dist:
        best_lev_dist = valid_dist
        save_model(model, optimizer, scheduler, ['valid_dist', valid_dist], epoch, best_model_path)
        wandb.save(best_model_path)
        print("Saved best model")

      # You may find it interesting to exlplore Wandb Artifcats to version your models
    torch.cuda.empty_cache()
    gc.collect()

run.finish()

# Generate Predictions and Submit to Kaggle

In [None]:
#TODO: Make predictions

# Follow the steps below:
# 1. Create a new object for CTCBeamDecoder with larger (why?) number of beams
# 2. Get prediction string by decoding the results of the beam decoder

TEST_BEAM_WIDTH = 150

test_decoder    = CTCBeamDecoder(labels=LABELS, blank_id=0, beam_width=TEST_BEAM_WIDTH, log_probs_input=True)

results = []

model.eval()
print("Testing")
for data in tqdm(test_loader):

    x, lx   = data
    x       = x.to(device)

    with torch.no_grad():
        h, lh = model(x, lx)

    prediction_string = decode_prediction(h, lh, test_decoder, LABELS)
    #TODO save the output in results array.
    results.append(prediction_string)

    del x, lx, h, lh
    torch.cuda.empty_cache()

Testing



  0%|          | 0/21 [00:00<?, ?it/s][A
  5%|▍         | 1/21 [00:15<05:04, 15.22s/it][A
 10%|▉         | 2/21 [00:29<04:40, 14.77s/it][A
 14%|█▍        | 3/21 [00:42<04:06, 13.67s/it][A
 19%|█▉        | 4/21 [00:52<03:27, 12.22s/it][A
 24%|██▍       | 5/21 [01:06<03:29, 13.11s/it][A
 29%|██▊       | 6/21 [01:15<02:57, 11.80s/it][A
 33%|███▎      | 7/21 [01:26<02:38, 11.35s/it][A
 38%|███▊      | 8/21 [01:42<02:45, 12.70s/it][A
 43%|████▎     | 9/21 [01:57<02:43, 13.61s/it][A
 48%|████▊     | 10/21 [02:07<02:17, 12.51s/it][A
 52%|█████▏    | 11/21 [02:20<02:07, 12.71s/it][A
 57%|█████▋    | 12/21 [02:30<01:46, 11.85s/it][A
 62%|██████▏   | 13/21 [02:44<01:40, 12.52s/it][A
 67%|██████▋   | 14/21 [02:53<01:19, 11.29s/it][A
 71%|███████▏  | 15/21 [03:02<01:03, 10.64s/it][A
 76%|███████▌  | 16/21 [03:13<00:54, 10.88s/it][A
 81%|████████  | 17/21 [03:25<00:44, 11.20s/it][A
 86%|████████▌ | 18/21 [03:42<00:38, 12.75s/it][A
 90%|█████████ | 19/21 [03:56<00:26, 13.18s/it]

In [None]:
if SUBMIT:
  data_dir = root + "/test-clean/random_submission.csv"
  df = pd.read_csv(data_dir)
  concat_results = sum(results, [])
  df.label = concat_results
  df.to_csv('submission.csv', index = False)

  !kaggle competitions submit -c 11-785-s23-hw3p2 -f submission.csv -m "I made it!"


100% 210k/210k [00:01<00:00, 190kB/s]
Successfully submitted to Automatic Speech Recognition (ASR)