# 11785 HW3P2: Automatic Speech Recognition

Welcome to HW3P2. In this homework, you will be using the same data from HW1 but will be incorporating sequence models. We recommend you get familaried with sequential data and the working of RNNs, LSTMs and GRUs to have a smooth learning in this part of the homework.

Disclaimer: This starter notebook will not be as elaborate as that of HW1P2 or HW2P2. You will need to do most of the implementation in this notebook because, it is expected after 2 HWs, you will be in a position to write a notebook from scratch. You are welcomed to reuse the code from the previous starter notebooks but may also need to make appropriate changes for this homework. <br>
We have also given you 3 log files for the Very Low Cutoff (Levenshtein Distance = 30) so that you can observe how loss decreases.

Common errors which you may face


*   Shape errors: Half of the errors from this homework will account to this category. Try printing the shapes between intermediate steps to debug
*   CUDA out of Memory: When your architecture has a lot of parameters, this can happen. Golden keys for this is, (1) Reducing batch_size (2) Call *torch.cuda.empty_cache* often, even inside your training loop, (3) Call *gc.collect* if it helps and (4) Restart run time if nothing works







# Prelimilaries

You will need to install packages for decoding and calculating the Levenshtein distance

In [None]:
!pip install python-Levenshtein
!git clone --recursive https://github.com/parlance/ctcdecode.git
!pip install wget
%cd ctcdecode
!pip install .
%cd ..

!pip install torchsummaryX # We also install a summary package to check our model's forward before training

# Libraries

In [None]:
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torchsummaryX import summary
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

from sklearn.metrics import accuracy_score
import gc
import zipfile
import pandas as pd
from tqdm import tqdm
import os
import datetime

# imports for decoding and distance calculation
import ctcdecode
import Levenshtein
from ctcdecode import CTCBeamDecoder

import warnings
warnings.filterwarnings('ignore')

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Device: ", device)

Device:  cuda


# Kaggle (TODO)

You need to set up your Kaggle and download the data

# Dataset and dataloading (TODO)

In [None]:
# PHONEME_MAP is the list that maps the phoneme to a single character. 
# The dataset contains a list of phonemes but you need to map them to their corresponding characters to calculate the Levenshtein Distance
# You final submission should not have the phonemes but the mapped string
# No TODOs in this cell

PHONEME_MAP = [
    " ",
    ".", #SIL
    "a", #AA
    "A", #AE
    "h", #AH
    "o", #AO
    "w", #AW
    "y", #AY
    "b", #B
    "c", #CH
    "d", #D
    "D", #DH
    "e", #EH
    "r", #ER
    "E", #EY
    "f", #F
    "g", #G
    "H", #H
    "i", #IH 
    "I", #IY
    "j", #JH
    "k", #K
    "l", #L
    "m", #M
    "n", #N
    "N", #NG
    "O", #OW
    "Y", #OY
    "p", #P 
    "R", #R
    "s", #S
    "S", #SH
    "t", #T
    "T", #TH
    "u", #UH
    "U", #UW
    "v", #V
    "W", #W
    "?", #Y
    "z", #Z
    "Z" #ZH
]

In [None]:
# This cell is where your actual TODOs start
# You will need to implement the Dataset class by your own. You may also implement it similar to HW1P2 (dont require context)
# The steps for implementation given below are how we have implemented it.
# However, you are welcomed to do it your own way if it is more comfortable or efficient. 

class LibriSamples(torch.utils.data.Dataset):

    def __init__(self, data_path, partition= "train"): # You can use partition to specify train or dev

        self.X_dir = # TODO: get mfcc directory path
        self.Y_dir = # TODO: get transcript path

        self.X_files = # TODO: list files in the mfcc directory
        self.Y_files = # TODO: list files in the transcript directory

        # TODO: store PHONEMES from phonemes.py inside the class. phonemes.py will be downloaded from kaggle.
        # You may wish to store PHONEMES as a class attribute or a global variable as well.
        self.PHONEMES = NotImplemented

        assert(len(self.X_files) == len(self.Y_files))

        pass

    def __len__(self):
        return len(self.X_files)

    def __getitem__(self, ind):
    
        X = # TODO: Load the mfcc npy file at the specified index ind in the directory
        Y = # TODO: Load the corresponding transcripts

        # Remember, the transcripts are a sequence of phonemes. Eg. np.array(['<sos>', 'B', 'IH', 'K', 'SH', 'AA', '<eos>'])
        # You need to convert these into a sequence of Long tensors
        # Tip: You may need to use self.PHONEMES
        # Remember, PHONEMES or PHONEME_MAP do not have '<sos>' or '<eos>' but the transcripts have them. 
        # You need to remove '<sos>' and '<eos>' from the trancripts. 
        # Inefficient way is to use a for loop for this. Efficient way is to think that '<sos>' occurs at the start and '<eos>' occurs at the end.
        
        Yy = # TODO: Convert sequence of  phonemes into sequence of Long tensors

        return X, Yy
    
    def collate_fn(batch):

        batch_x = [x for x,y in batch]
        batch_y = [y for x,y in batch]

        batch_x_pad = # TODO: pad the sequence with pad_sequence (already imported)
        lengths_x = # TODO: Get original lengths of the sequence before padding

        batch_y_pad = # TODO: pad the sequence with pad_sequence (already imported)
        lengths_y = # TODO: Get original lengths of the sequence before padding

        return batch_x_pad, batch_y_pad, torch.tensor(lengths_x), torch.tensor(lengths_y)


# You can either try to combine test data in the previous class or write a new Dataset class for test data
class LibriSamplesTest(torch.utils.data.Dataset):

    def __init__(self, data_path, test_order): # test_order is the csv similar to what you used in hw1

        test_order_list = # TODO: open test_order.csv as a list
        self.X = # TODO: Load the npy files from test_order.csv and append into a list
        # You can load the files here or save the paths here and load inside __getitem__ like the previous class
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, ind):
        # TODOs: Need to return only X because this is the test dataset
        return NotImplemented
    
    def collate_fn(batch):
        batch_x = [x for x in batch]
        batch_x_pad = # TODO: pad the sequence with pad_sequence (already imported)
        lengths_x = # TODO: Get original lengths of the sequence before padding

        return batch_x_pad, torch.tensor(lengths_x)

In [None]:
batch_size = 128

root = # TODO: Where your hw3p2_student_data folder is

train_data = LibriSamples(root, 'train')
val_data = LibriSamples(root, 'dev')
test_data = LibriSamplesTest(root, 'test_order.csv')

train_loader = # TODO: Define the train loader. Remember to pass in a parameter (function) for the collate_fn argument 
val_loader = # TODO: Define the val loader. Remember to pass in a parameter (function) for the collate_fn argument 
test_loader = # TODO: Define the test loader. Remember to pass in a parameter (function) for the collate_fn argument 

print("Batch size: ", batch_size)
print("Train dataset samples = {}, batches = {}".format(train_data.__len__(), len(train_loader)))
print("Val dataset samples = {}, batches = {}".format(val_data.__len__(), len(val_loader)))
print("Test dataset samples = {}, batches = {}".format(test_data.__len__(), len(test_loader)))

In [None]:
# Optional
# Test code for checking shapes and return arguments of the train and val loaders
for data in val_loader:
    x, y, lx, ly = data # if you face an error saying "Cannot unpack", then you are not passing the collate_fn argument
    print(x.shape, y.shape, lx.shape, ly.shape)
    break

# Model Configuration (TODO)

In [None]:
class Network(nn.Module):

    def __init__(self): # You can add any extra arguments as you wish

        super(Network, self).__init__()

        # Embedding layer converts the raw input into features which may (or may not) help the LSTM to learn better 
        # For the very low cut-off you dont require an embedding layer. You can pass the input directly to the  LSTM
        # self.embedding = 
        
        self.lstm = # TODO: # Create a single layer, uni-directional LSTM with hidden_size = 256
        # Use nn.LSTM() Make sure that you give in the proper arguments as given in https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html

        self.classification = # TODO: Create a single classification layer using nn.Linear()

    def forward(self, x): # TODO: You need to pass atleast 1 more parameter apart from self and x

        # x is returned from the dataloader. So it is assumed to be padded with the help of the collate_fn
        packed_input = # TODO: Pack the input with pack_padded_sequence. Look at the parameters it requires

        out1, (out2, out3) = # TODO: Pass packed input to self.lstm
        # As you may see from the LSTM docs, LSTM returns 3 vectors. Which one do you need to pass to the next function?
        out, lengths  = # TODO: Need to 'unpack' the LSTM output using pad_packed_sequence

        out = # TODO: Pass unpacked LSTM output to the classification layer
        out = # Optional: Do log softmax on the output. Which dimension?

        return NotImplemented # TODO: Need to return 2 variables

model = Network().to(device)
print(model)
summary(model, x.to(device), lx) # x and lx are from the previous cell

# Training Configuration (TODO)

In [None]:
criterion = # TODO: What loss do you need for sequence to sequence models? 
# Do you need to transpose or permute the model output to find out the loss? Read its documentation
optimizer = # TODO: Adam works well with LSTM (use lr = 2e-3)
decoder = # TODO: Intialize the CTC beam decoder
# Check out https://github.com/parlance/ctcdecode for the details on how to implement decoding
# Do you need to give log_probs_input = True or False?

In [None]:
# this function calculates the Levenshtein distance 

def calculate_levenshtein(h, y, h, ly, decoder, PHONEME_MAP):

    # h - ouput from the model. Probability distributions at each time step 
    # y - target output sequence - sequence of Long tensors
    # lh, ly - Lengths of output and target
    # decoder - decoder object which was initialized in the previous cell
    # PHONEME_MAP - maps output to a character to find the Levenshtein distance

    # TODO: You may need to transpose or permute h based on how you passed it to the criterion
    # Print out the shapes often to debug

    # TODO: call the decoder's decode method and get beam_results and out_len (Read the docs about the decode method's outputs)
    # Input to the decode method will be h and its lengths lh 
    # You need to pass lh for the 'seq_lens' parameter. This is not explicitly mentioned in the git repo of ctcdecode.

    batch_size = # TODO

    dist = 0

    for i in range(batch_size): # Loop through each element in the batch

        h_sliced = # TODO: Get the output as a sequence of numbers from beam_results
        # Remember that h is padded to the max sequence length and lh contains lengths of individual sequences
        # Same goes for beam_results and out_lens
        # You do not require the padded portion of beam_results - you need to slice it with out_lens 
        # If it is confusing, print out the shapes of all the variables and try to understand

        h_string = # TODO: MAP the sequence of numbers to its corresponding characters with PHONEME_MAP and merge everything as a single string

        y_sliced = # TODO: Do the same for y - slice off the padding with ly
        y_string = # TODO: MAP the sequence of numbers to its corresponding characters with PHONEME_MAP and merge everything as a single string
        
        dist += Levenshtein.distance(h_string, y_string)

    dist/=batch_size

    return dist

In [None]:
# Optional but recommended

for i, data in enumerate(train_loader, 0):
    
    # Write a test code do perform a single forward pass and also compute the Levenshtein distance
    # Make sure that you are able to get this right before going on to the actual training
    # You may encounter a lot of shape errors
    # Printing out the shapes will help in debugging
    # Keep in mind that the Loss which you will use requires the input to be in a different format and the decoder expects it in a different format
    # Make sure to read the corresponding docs about it
    pass

    break # one iteration is enough

In [None]:
torch.cuda.empty_cache() # Use this often

# TODO: Write the model evaluation function if you want to validate after every epoch

# You are free to write your own code for model evaluation or you can use the code from previous homeworks' starter notebooks
# However, you will have to make modifications because of the following.
# (1) The dataloader returns 4 items unlike 2 for hw2p2
# (2) The model forward returns 2 outputs
# (3) The loss may require transpose or permuting

# Note that when you give a higher beam width, decoding will take a longer time to get executed
# Therefore, it is recommended that you calculate only the val dataset's Levenshtein distance (train not recommended) with a small beam width
# When you are evaluating on your test set, you may have a higher beam width

In [None]:
torch.cuda.empty_cache()

# TODO: Write the model training code 

# You are free to write your own code for training or you can use the code from previous homeworks' starter notebooks
# However, you will have to make modifications because of the following.
# (1) The dataloader returns 4 items unlike 2 for hw2p2
# (2) The model forward returns 2 outputs
# (3) The loss may require transpose or permuting

# Tip: Implement mixed precision training

# Submit to kaggle (TODO)

In [None]:
# TODO: Write your model evaluation code for the test dataset
# You can write your own code or use from the previous homewoks' stater notebooks
# You can't calculate loss here. Why?

In [None]:
# TODO: Generate the csv file