# HW4P2: Attention-based Speech Recognition

<img src="https://cdn.shopify.com/s/files/1/0272/2080/3722/products/SmileBumperSticker_5400x.jpg" alt="A cute cat" width="600">


Welcome to the final assignment in 11785. In this HW, you will work on building a speech recognition system with <i>attention</i>. <br> <br>

<center>
<img src="https://popmn.org/wp-content/uploads/2020/03/pay-attention.jpg" alt="A cute cat" height="100">
</center>

HW Writeup: On Piazza/Course Website <br>
Kaggle Competition Link: https://www.kaggle.com/competitions/11-785-s23-hw4p2/ <br>
Kaggle Dataset Link: https://www.kaggle.com/datasets/varunjain3/11-785-s23-hw4p2-dataset
<br>
LAS Paper: https://arxiv.org/pdf/1508.01211.pdf <br>
Attention is all you need:https://arxiv.org/pdf/1706.03762.pdf

# Instruction to Run the Code

## To run the final model corresponding to the highest kaggle submission, please first make sure the **Global Variables** (next section) are set to fit the purpose, and then go to **Runtime**, click **Restart and run all**. This would
1. pip install, import and download all required packages and data
2. run all functions and classes for loading data and creating model/optimizer/scheduler
3. train the model for 100 epochs based on the parameters saved in the config variable defined under **LibriSpeech**.
4. Sequentially finetune the model for 50 epoch with learning rates reset back slightly (0.005)

**Note** that by default, the notebook is expected to finish running in one click. If you pause the run and want to reload the model from saved path for finetuning, please set **RELOAD** as True under Training section and enter reload path.

**Note** that by default, the notebook would reinstall all packages that might take a long time. If you are using GCP / AWS to run it on a previously run section, please set **REINSTALL** to be False.

**Note** that by default, the notebook would run the trained model on the test dataset and save the predicted result in csv file, but it would not make the submission to Kaggle. To run the notebook with kaggle submission, set **SUBMIT_KAGGLE** to True.

# Global Variables

In [None]:
USE_TOY = False
CONNECT_DRIVE = False
REINSTALL = True
SUBMIT = False

# README

## Best Score Hyperparameters:
* **init learning rate** = 2e-3
* **finetune init learning rate** = 5e-4
* **teacher forcing ratio** = init 1.0 and drop 0.1 every 20 epochs until hitting 0.5
* **encoder embedding batchnorm**: after every Conv1D layer
* **encoder embedding activate**: Not Applicable
* **encoder LSTM dim** = (256, 256)
* **encoder LSTM dropout** = 0.5
* **encoder LSTM layers** = 1
* **encoder pLSTM locked dropout** = 0.5 for the first two blocks and 0.25 for the last one
* **encoder pLSTM layers** = 3
* **encoder pLSTM dim** = (1024, 256) > (1024, 256) > (1024, 256)
* **encoder hidden size** = 256
* **encoder embed size** = 256
* **attention projection size** = 128
* **attention mask value** = -1e9
* **decoder LSTM Cell layers**: 3
* **decoder LSTM Cell dim**: (256+128, 512) > (512, 512) > (512, 128)
* **decoder LSTM Cell dropout** = 0.25
* **decoder CDN linear layers**: 2
* **decoder CDN activation**: Tanh
* **decoder CDN dim**: (256, 512) > (512, 256) > (256, 31)
* **decoder hidden size** = 512
* **decoder embed size** = 256
* **weight decay** = None
* **batch size** = 192
* **epoch** = 100
* **finetune epoch** = 50
* **weight init**: None
* **scheduler**: ReduceLROnPlateau <patience = 3, factor = 0.8, mode = min>
* **optimizer**: Adam



## Data Loading Scheme:
The highest kaggle score is run by loading and training on the train 100 dataset using **SpeechDataset**, where all mfcc files are read in order and saved into a list for the final concatenation. No memory handling is used.

All MFCCs data are normalized using Cepstral Normalization, but note that all Transcriptions data DO NOT have EOS and SOS removed in this assignment. No other data transform is used.


## Architectures:
The highest kaggle score is reached using LAS model with encoder, attention, and decoder.

For encoder, the embedding contains only a Conv1D with kernel size 3 and a batchnorm followed. No activation is used. After the embedding, data are passed into a one-layer biLSTM with 0.5 dropout. The input size is the embed size (256) and output size is encoder hidden size (256). Note that the dropout is not directly added in this LSTM layer. Instead , it is call in the pBiLSTM layer before the input data is passed into the pBiLSTM network. This is due to the design of the pBiLSTM class.  

Then, the architecture includes three pBiLSTMs block. For all three blocks, before the the pBiLSTM is applied, a LockedDropout layer with 0.5 rate is called first. After the last pBiLSTM, a LockedDropout with 0.25 rate is called. The encoder hidden size (256) is used for pBiLSTM output size.

For attention, the input encoder hidden size is 512, a double of the output from listener/encoder. It applied linear transformation for the encoder output with (512, 128) dimension to generate key and value. Attention mask is used with filled value -1e9. The mask has the shape (batch-size, time-steps) and is applied before the attention weights are passed into the softmax.

For decoder, the architecture is a one-layer embedding, 3 layer LSTM-Cell, followed by a 2 layer MLP, and a final linear output layer. Each LSTM Cell is followed by a LockedDropout with 0.25 rate, except for the last layer. For each time-step, The output from the embedding of the time-step is first concatenated with attention value from the previous time-step, then passed into the attention as input to compute the query using linear transformation, and then the query is used to compute a new attention context. The new context goes into the LSTM cells, after which the LSTM output is passed into the attention again, but this time the LSTM output is served as query, so no new query is computed, and the LSTM output is used to compute a new attention context. The output context is then passed into the CDN block with LSTM output. The CDN block project the concatenated input with 256 dimension to 512 first, then Tanh is applied, and another linear transformation is 512 back to 256 is applied, followed by another Tanh activation, before the final output layer.  

Note that the final version passed different input values into the attention twice, one time compute a new query for attention context and one time using LSTM cell output as query.

Other architectures are tested as well, but with less ideal performance:
1. For encoder, a 2-layer pBiLSTMs is attempted, and the model cannot converge.
2. For encoder, a 2-layer biLSTM + 2-layer pBiLSTMs is attempted, but the performance does not improved.
3. For encoder, a more complicated embedding with 2 convolution blocks and activation are attempted, and the model cannot converge.
4. For decoder, a 1 layer MLP is attempted.
5. For decoder, a 2-layer LSTM-cell is attempt, using LSTM cell output directly as query. The model could converge but slower.
6. For decoder, a 2-layer LSTM-cell + 2-layer CDN is attempt, only computing query inside attention rather than use LSTM output as query.
7. For attention, a version without masking is attempted. The model converge slower.
8. For LAS, a model with all hidden, embed, and projection size doubled is attempted, and the model converge much slower.  

## Epochs
I trained the model for 100 epochs and finetuned for another 50 epochs to get to my highest kaggle submission. The model start to show distance decrease at around epoch 8, converging at about 80 epoch, but the value vacillate a bit. After 100 epoch, I reset the learning rate for finetuning again. At about 11 epoch, the model hit the lowest distance I could get. I further finetuned it until 50 epochs are finished, but no improved is made.



## Hyperparameters

* **Batch Size**: 192. Given the computation capacity of my GCP, the maximum batch size I could use is 192 for the given model.

* **Weight Decay**: None. I started with  5e-5 weight decay based on the previous homework, but during the ablation on toy dataset, the model converge faster without decay.

* **Dropout**: on encoder, I attempted locked dropout rate of 0.25 and 0.5, and find 0.5 the best rate in terms of performance. On decoder, I a attempted 0.35 based on HW1 and 0.25 , and the 0.25 leads to faster convergence on toy dataset.

* **hidden size**: I started with the basic parameters given by the handout, and attempted doubling the hidden size of encoder only, and of encoder and decode. A larger hidden size make the model fail to converge on LibriSpeech dataset.



## Other Experiments
### Activation Function on Decoder CDN
I selected the LeakyReLU at first based on recommendation online, but it seems that Tanh works much better in this homework.

## Encoder Embedding
I tried complicated embedding of encoder, from ResNet to 2 and 3 layer CNN with batchnorm and activation. However, my model with only 1 Conv layer and 1 batchnorm layer perform the best.

### Scheduler and Learning Rate
I selected ReduceLROnPlateau with patience 3 and factor 0.8, with min mode. I started with using patience = 1 based on HW3, but the attention model tends to fluctutate in the distance, and a too low patience is not ideal. I think increase the patience to allow the model to train a bit more before adjusting to a lower learning rate. In addition, since the distance does not start to decrease until epoch 8 - 10, I only use the scheduler after 15 epoch, to prevent to learning rate from decreasing too early.

### Optimizer
I only switched between AdamW and Adam optimizer, and found Adam better without weight decay.



# Read this section importantly!

1. By now, we believe that you are already a great deep learning practitioner, Congratulations. 🎉

2. You are allowed to use code from your previous homeworks for this homework. We will only provide, aspects that are necessary and new with this homework.

3. There are a lot of resources provided in this notebook, that will help you check if you are running your implementations correctly.

In [None]:
!nvidia-smi

Fri Apr 21 04:58:05 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Install some required libraries
# Feel free to add more if you want
!pip install -q python-levenshtein torchsummaryX wandb kaggle pytorch-nlp

# Imports

In [None]:
# Import Necessary Modules you require for this HW here
import torch
import random
import numpy as np
import torch.nn as nn
from torchsummaryX import summary
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
from sklearn.decomposition import PCA
from torch.autograd import Variable

import torchaudio.transforms as tat

from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt
import seaborn as sns
import gc

import zipfile
import pandas as pd
from tqdm import tqdm
import os
import datetime

# imports for decoding and distance calculation
import Levenshtein


DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Device: ", DEVICE)

In [None]:
if CONNECT_DRIVE:
  from google.colab import drive
  drive.mount('/content/drive')

Mounted at /content/drive


# Toy Dataset Download

In [None]:
if USE_TOY and REINSTALL:
  !wget -q https://cmu.box.com/shared/static/om4qpzd4tf1xo4h7230k4v1pbdyueghe --content-disposition --show-progress
  !unzip -q hw4p2_toy.zip -d ./

^C
unzip:  cannot find or open hw4p2_toy.zip, hw4p2_toy.zip.zip or hw4p2_toy.zip.ZIP.


## Toy Dataset

The toy dataset is a dataset of fixed length speech sequences that have phonetic transcripts. The reason we made it with phonetic transcripts was to help you understand how attention can work with phonetic transcription that you have done in HW3P2

In [None]:
if USE_TOY:
  # Load the toy dataset
  import numpy as np
  import torch
  X_train = np.load("hw4p2_toy/f0176_mfccs_train_new.npy")
  X_valid = np.load("hw4p2_toy/f0176_mfccs_dev_new.npy")
  Y_train = np.load("hw4p2_toy/f0176_hw3p2_train.npy")
  Y_valid = np.load("hw4p2_toy/f0176_hw3p2_dev.npy")

  # This is how you actually need to find out the different trancripts in a dataset.
  # Can you think whats going on here? Why are we using a np.unique?
  VOCAB_MAP_TOY           = dict(zip(np.unique(Y_valid), range(len(np.unique(Y_valid)))))
  VOCAB_MAP_TOY["[PAD]"]  = len(VOCAB_MAP_TOY)
  VOCAB_TOY               = list(VOCAB_MAP_TOY.keys())

  SOS_TOKEN_TOY = VOCAB_MAP_TOY["[SOS]"]
  EOS_TOKEN_TOY = VOCAB_MAP_TOY["[EOS]"]
  PAD_TOKEN_TOY = VOCAB_MAP_TOY["[PAD]"]

  Y_train = [np.array([VOCAB_MAP_TOY[p] for p in seq]) for seq in Y_train]
  Y_valid = [np.array([VOCAB_MAP_TOY[p] for p in seq]) for seq in Y_valid]

In [None]:
if USE_TOY:
  VOCAB = VOCAB_TOY

  VOCAB_MAP = VOCAB_MAP_TOY

  PAD_TOKEN = PAD_TOKEN_TOY
  SOS_TOKEN = SOS_TOKEN_TOY
  EOS_TOKEN = EOS_TOKEN_TOY

  print(f"Length of vocab: {len(VOCAB)}")
  print(f"Vocab: {VOCAB}")
  print(f"PAD_TOKEN: {PAD_TOKEN}")
  print(f"SOS_TOKEN: {SOS_TOKEN}")
  print(f"EOS_TOKEN: {EOS_TOKEN}")

Length of vocab: 43
Vocab: ['AA', 'AE', 'AH', 'AO', 'AW', 'AY', 'B', 'CH', 'D', 'DH', 'EH', 'ER', 'EY', 'F', 'G', 'HH', 'IH', 'IY', 'JH', 'K', 'L', 'M', 'N', 'NG', 'OW', 'OY', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH', 'UW', 'V', 'W', 'Y', 'Z', 'ZH', '[EOS]', '[SIL]', '[SOS]', '[PAD]']
PAD_TOKEN: 42
SOS_TOKEN: 41
EOS_TOKEN: 39


In [None]:
if USE_TOY:
  class ToyDataset(torch.utils.data.Dataset):

      def __init__(self, partition):

          if partition == "train":
              self.mfccs = X_train
              self.transcripts = Y_train

          elif partition == "valid":
              self.mfccs = X_valid
              self.transcripts = Y_valid

          assert len(self.mfccs) == len(self.transcripts)

          self.length = len(self.mfccs)

      def __len__(self):

          return self.length

      def __getitem__(self, i):

          x = torch.FloatTensor(self.mfccs[i])
          y = torch.tensor(self.transcripts[i])

          return x, y

      def collate_fn(self, batch):

          x_batch, y_batch = list(zip(*batch))

          x_lens      = [x.shape[0] for x in x_batch]
          y_lens      = [y.shape[0] for y in y_batch]

          x_batch_pad = torch.nn.utils.rnn.pad_sequence(x_batch, batch_first=True, padding_value= EOS_TOKEN_TOY)
          y_batch_pad = torch.nn.utils.rnn.pad_sequence(y_batch, batch_first=True, padding_value= EOS_TOKEN_TOY)

          return x_batch_pad, y_batch_pad, torch.tensor(x_lens), torch.tensor(y_lens)

In [None]:
if USE_TOY:
    config = {'init_lr': 0.001}
    config['batch_size'] = 64
    train_toy_dataset   = ToyDataset(partition= 'train')
    valid_toy_dataset   = ToyDataset(partition= 'valid')

    train_toy_loader    = torch.utils.data.DataLoader(
        dataset     = train_toy_dataset,
        batch_size  = config['batch_size'],
        shuffle     = True,
        num_workers = 4,
        pin_memory  = True,
        collate_fn  = train_toy_dataset.collate_fn
    )

    valid_toy_loader    = torch.utils.data.DataLoader(
        dataset     = valid_toy_dataset,
        batch_size  = config['batch_size'],
        shuffle     = False,
        num_workers = 2,
        pin_memory  = True,
        collate_fn  = valid_toy_dataset.collate_fn
    )

    train_loader = train_toy_loader
    dev_loader = valid_toy_loader
    print("No. of train mfccs   : ", train_toy_dataset.__len__())
    print("Batch size           : ", config['batch_size'])
    print("Train batches        : ", train_toy_loader.__len__())
    print("Valid batches        : ", valid_toy_loader.__len__())

No. of train mfccs   :  16000
Batch size           :  128
Train batches        :  125
Valid batches        :  13


In [None]:
if USE_TOY:
  for batch in valid_toy_loader:
      x, y, x_len, y_len = batch
      print(x.shape, y.shape, x_len.shape, y_len.shape)
      break

torch.Size([128, 176, 27]) torch.Size([128, 23]) torch.Size([128]) torch.Size([128])


# Kaggle Dataset Download

In [None]:
if REINSTALL:
    api_token = '{"username":"sharonxin1207","key":"a6eb67109ee97e7f02df4bfe642cf615"}'

    # set up kaggle.json
    # TODO: Use the same Kaggle code from HW1P2, HW2P2, HW3P2
    !mkdir /root/.kaggle/

    with open("/root/.kaggle/kaggle.json", "w+") as f:
        f.write(api_token) # Put your kaggle username & key here

    !chmod 600 /root/.kaggle/kaggle.json

In [None]:
if REINSTALL:
  # To download the dataset
  !kaggle datasets download -d varunjain3/11-785-s23-hw4p2-dataset

Downloading 11-785-s23-hw4p2-dataset.zip to /content
100% 3.73G/3.74G [00:38<00:00, 117MB/s]
100% 3.74G/3.74G [00:38<00:00, 103MB/s]


In [None]:
if REINSTALL:
  # To unzip data quickly and quietly
  !unzip -q 11-785-s23-hw4p2-dataset.zip -d ./data

# Dataset and Dataloaders

We have given you 2 datasets. One is a toy dataset, and the other is the standard LibriSpeech dataset. The toy dataset is to help you get your code implemented and tested and debugged easily, to verify that your attention diagonal is produced correctly. Note however that it's task (phonetic transcription) is drawn from HW3P2, it is meant to be familiar and help you understand how to transition from phonetic transcription to alphabet transcription, with a working attention module.

Please make sure you use the right constants in your code implementation for future modules, (SOS_TOKEN vs SOS_TOKEN_TOY) when working with either dataset. We have defined the constants accordingly below. Before you come to OH or post on piazza, make sure you aren't misuing the constants for either dataset in your code.

## LibriSpeech

In terms of the dataset, the dataset structure for HW3P2 and HW4P2 dataset are very similar. Can you spot out the differences? What all will be required??

Hints:

- Check how big is the dataset (do you require memory efficient loading techniques??)
- How do we load mfccs? Do we need to normalise them?
- Does the data have \<SOS> and \<EOS> tokens in each sequences? Do we remove them or do we not remove them? (Read writeup)
- Would we want a collating function? Ask yourself: Why did we need a collate function last time?
- Observe the VOCAB, is the dataset same as HW3P2?
- Should you add augmentations, if yes which augmentations? When should you add augmentations? (Check bootcamp for answer)


In [None]:
config = {
  'batch_size': 192, #128,
  'lr':2e-3,
  'epochs': 100,
  'finetune_epochs': 50
}

root = "/content/data"

VOCAB = ['<pad>', '<sos>', '<eos>',
         'A',   'B',    'C',    'D',
         'E',   'F',    'G',    'H',
         'I',   'J',    'K',    'L',
         'M',   'N',    'O',    'P',
         'Q',   'R',    'S',    'T',
         'U',   'V',    'W',    'X',
         'Y',   'Z',    "'",    ' ',
         ]

VOCAB_MAP = {VOCAB[i]:i for i in range(0, len(VOCAB))}

PAD_TOKEN = VOCAB_MAP["<pad>"]
SOS_TOKEN = VOCAB_MAP["<sos>"]
EOS_TOKEN = VOCAB_MAP["<eos>"]

print(f"Length of vocab: {len(VOCAB)}")
print(f"Vocab: {VOCAB}")
print(f"PAD_TOKEN: {PAD_TOKEN}")
print(f"SOS_TOKEN: {SOS_TOKEN}")
print(f"EOS_TOKEN: {EOS_TOKEN}")

Length of vocab: 31
Vocab: ['<pad>', '<sos>', '<eos>', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', "'", ' ']
PAD_TOKEN: 0
SOS_TOKEN: 1
EOS_TOKEN: 2


In [None]:
class SpeechDataset(torch.utils.data.Dataset):

    def __init__(self, root, partition='train-clean-100', transform=["norm"]):
        # Load the directory and all files in them

        self.mfcc_dir = "{}/{}/mfcc".format(root, partition)
        self.transcript_dir = "{}/{}/transcripts".format(root, partition)

        self.mfcc_files = sorted(os.listdir(self.mfcc_dir))
        self.transcript_files = sorted(os.listdir(self.transcript_dir))

        self.transform = transform

        #TODO
        # HOW CAN WE REPRESENT PHONEMES? CAN WE CREATE A MAPPING FOR THEM?
        # HINT: TENSORS CANNOT STORE NON-NUMERICAL VALUES OR STRINGS

        self.mapping = VOCAB_MAP

        #TODO
        # CREATE AN ARRAY OF ALL FEATUERS AND LABELS
        # WHAT NORMALIZATION TECHNIQUE DID YOU USE IN HW1? CAN WE USE IT HERE?
        self.mfccs, self.transcripts = [], []

        for i in range(len(self.mfcc_files)):
        #   Load a single mfcc
            mfcc        = np.load(self.mfcc_dir+"/"+self.mfcc_files[i])
        #   Do Cepstral Normalization of mfcc (explained in writeup)
            if "norm" in self.transform:
                mfcc        = (mfcc - mfcc.mean(axis=0))/mfcc.std(axis=0)
        #   Load the corresponding transcript
            transcript  = np.load(self.transcript_dir+"/"+self.transcript_files[i])
            transcript = np.array([self.mapping[t] for t in transcript])
        #   Append each mfcc to self.mfcc, transcript to self.transcript
            self.mfccs.append(mfcc)
            self.transcripts.append(transcript)

        self.length = len(self.mfccs)


    def __len__(self):

        return self.length

    def __getitem__(self, ind):
        mfcc = torch.FloatTensor(self.mfccs[ind])
        transcript = torch.tensor(self.transcripts[ind])
        return mfcc, transcript


    def collate_fn(self, batch):

        # batch of input mfcc coefficients
        batch_mfcc = [b[0] for b in batch]
        batch_transcript = [b[1] for b in batch]

        # HINT: CHECK OUT -> pad_sequence (imported above)
        # Also be sure to check the input format (batch_first)
        batch_mfcc_pad = pad_sequence(batch_mfcc, batch_first=True, padding_value=PAD_TOKEN) # TODO
        lengths_mfcc = [len(mfcc) for mfcc in batch_mfcc]
        batch_transcript_pad = pad_sequence(batch_transcript, batch_first=True, padding_value=PAD_TOKEN) # TODO
        lengths_transcript = [len(trans) for trans in batch_transcript]
        # Return the following values: padded features, padded labels, actual length of features, actual length of the labels
        return batch_mfcc_pad, batch_transcript_pad, torch.tensor(lengths_mfcc), torch.tensor(lengths_transcript)



In [None]:
# Test Dataloader
class SpeechTestDataset(torch.utils.data.Dataset):

    # TODO: Create a test dataset class similar to the previous class but you dont have transcripts for this
    # Imp: Read the mfccs in sorted order, do NOT shuffle the data here or in your dataloader.
    def __init__(self, root, partition= "test-clean", transform=["norm"]): # Feel free to add more arguments

        # TODO: MFCC directory - use partition to acces train/dev directories from kaggle data using root
        self.mfcc_dir       = "{}/{}/mfcc".format(root, partition)

        # TODO: List files in sefl.mfcc_dir using os.listdir in sorted order
        mfcc_names          = sorted(os.listdir(self.mfcc_dir))

        self.mfccs = []
        self.transform = transform

        # TODO: Iterate through mfccs and transcripts
        for i in range(len(mfcc_names)):
        #   Load a single mfcc
            mfcc        = np.load(self.mfcc_dir+"/"+mfcc_names[i])
            if "norm" in self.transform:
                mfcc        = (mfcc - mfcc.mean(axis=0))/mfcc.std(axis=0)
            self.mfccs.append(mfcc)
        self.length = len(self.mfccs)

    def __len__(self):
        return self.length

    def __getitem__(self, ind):

        # TODO: Based on context and offset, return a frame at given index with context frames to the left, and right.
        mfcc = torch.FloatTensor(self.mfccs[ind]) # Convert to tensors

        return mfcc

    def collate_fn(self, batch):

        batch_mfcc_pad = pad_sequence(batch, batch_first=True, padding_value=PAD_TOKEN) # TODO
        lengths_mfcc = [len(mfcc) for mfcc in batch]

        return batch_mfcc_pad, torch.tensor(lengths_mfcc)


In [None]:
dev_dataset = SpeechDataset(root, 'dev-clean')
train_dataset = SpeechDataset(root, 'train-clean-100')
test_dataset = SpeechTestDataset(root)

dev_loader = torch.utils.data.DataLoader(
    dataset     = dev_dataset,
    num_workers = 2,
    batch_size  = config['batch_size'],
    collate_fn  = dev_dataset.collate_fn,
    pin_memory  = True,
    shuffle     = False

)

train_loader = torch.utils.data.DataLoader(
    dataset     = train_dataset,
    num_workers = 4,
    batch_size  = config['batch_size'],
    collate_fn  = train_dataset.collate_fn,
    pin_memory  = True,
    shuffle     = True
)

test_loader = torch.utils.data.DataLoader(
    dataset     = test_dataset,
    num_workers = 2,
    batch_size  = config['batch_size'],
    collate_fn  = test_dataset.collate_fn,
    pin_memory  = True,
    shuffle     = False
)


print("\nChecking the shapes of the data...")
for batch in dev_loader:
    x, y, x_len, y_len = batch
    print(x.shape, y.shape, x_len.shape, y_len.shape)
    break


Checking the shapes of the data...
torch.Size([192, 2936, 27]) torch.Size([192, 364]) torch.Size([192]) torch.Size([192])


Check if you are loading the data correctly with the following:

(Note: These are outputs from loading your data in the dataset class, not your dataloader which will have padded sequences)

- Train Dataset
```
Partition loaded:  train-clean-100
Max mfcc length:  2448
Average mfcc length:  1264.6258453344547
Max transcript:  400
Average transcript length:  186.65321139493324
```

- Dev Dataset
```
Partition loaded:  dev-clean
Max mfcc length:  3260
Average mfcc length:  713.3570107288198
Max transcript:  518
Average transcript length:  108.71698113207547
```

- Test Dataset
```
Partition loaded:  test-clean
Max mfcc length:  3491
Average mfcc length:  738.2206106870229
```

If your values is not matching, read hints, think what could have gone wrong. Then approach TAs.

# THE MODEL

### Listen, Attend and Spell
Listen, Attend and Spell (LAS) is a neural network model used for speech recognition and synthesis tasks.

- LAS is designed to handle long input sequences and is robust to noisy speech signals.
- LAS is known for its high accuracy and ability to improve over time with additional training data.
- It consists of an <b>listener, an attender and a speller</b>, which work together to convert an input speech signal into a corresponding output text.

#### The Dataflow:
<center>
<img src="https://github.com/varunjain3/11785_s23_h4p2/raw/main/DataFlow.png" alt="data flow" height="100">
</center>

#### The Listener:
- converts the input speech signal into a sequence of hidden states.

#### The Attender:
- Decides how the sequence of Encoder hidden state is propogated to decoder.

#### The Speller:
- A language model, that incorporates the "context of attender"(output of attender) to predict sequence of words.






## Building Blocks

In [None]:
def init_weights(m):
    if isinstance(m, torch.nn.Conv1d):
        torch.nn.init.kaiming_normal_(m.weight.data, mode='fan_out', nonlinearity='relu')
    elif isinstance(m, torch.nn.BatchNorm1d):
        torch.nn.init.constant_(m.weight.data, 1)
        torch.nn.init.constant_(m.bias.data, 0)
    elif isinstance(m, torch.nn.Linear):
        torch.nn.init.normal_(m.weight.data, 0, 0.01)
        torch.nn.init.constant_(m.bias.data, 0)

In [None]:
class ConvBlock(torch.nn.Sequential):
    def __init__(self, in_chan, out_chan, kernel, stride, padding=-1, groups=1, bias=False):
        if padding < 0:
          padding = (kernel-1)//2
        super(ConvBlock, self).__init__(
            torch.nn.Conv1d(in_channels=in_chan, out_channels=out_chan,
                            kernel_size=kernel, stride=stride, padding=padding,
                            groups=groups, bias=bias),
            torch.nn.BatchNorm1d(out_chan),
            #torch.nn.ReLU6(inplace=True)
            torch.nn.GELU()
        )

In [None]:
class PermuteBlock(torch.nn.Module):
    def forward(self, x):
        return x.transpose(1, 2)

class PLSTMLockedDropout(torch.nn.Module):
  def __init__(self, dropout=0.5, lockafter=False):
    super().__init__()
    self.dropout = dropout
    self.lockafter = lockafter

  def forward(self, x):
    # x: batch, seq_len, feature
    if not self.training or not self.dropout:
      return x
    if self.lockafter:
      x, x_lens = pad_packed_sequence(x, batch_first=True)
    m = x.data.new(x.size(0), 1, x.size(2)).bernoulli_(1 - self.dropout)
    mask = Variable(m, requires_grad=False) / (1 - self.dropout)
    mask = mask.expand_as(x).to(DEVICE)
    mask_x = mask * x
    if self.lockafter:
      mask_x = pack_padded_sequence(mask_x, x_lens, batch_first=True, enforce_sorted=False)

    return mask_x


class pBLSTM(torch.nn.Module):

    def __init__(self, input_size, hidden_size, droprate=0.5):
        super(pBLSTM, self).__init__()
        self.blstm = torch.nn.LSTM(input_size*2, hidden_size, num_layers=1,
                                   batch_first=True, bidirectional=True)
        if droprate > 0:
          self.dropout = PLSTMLockedDropout(droprate)
        else:
          self.dropout = None

    def forward(self, x_packed):
        x_pad, x_pad_lens = pad_packed_sequence(x_packed, batch_first=True)
        x_trunc, x_trunc_lens = self.trunc_reshape(x_pad, x_pad_lens)
        if self.dropout:
          x_trunc = self.dropout(x_trunc) # lock dropout layer here works better
        x_packed = pack_padded_sequence(x_trunc, x_trunc_lens, batch_first=True, enforce_sorted=False)
        output_pad, output_pad_lens = self.blstm(x_packed)
        """
        if self.dropout:
          output_pad, output_lens = pad_packed_sequence(output_pad, batch_first=True)
          output_pad = self.dropout(output_pad)
          output_pad = pack_padded_sequence(output_pad, output_lens, batch_first=True, enforce_sorted=False)
        """
        return output_pad

    def trunc_reshape(self, x, x_lens): # x: [b, seq, dim]
        # TODO: If you have odd number of timesteps, how can you handle it? (Hint: You can exclude them)
        # TODO: Reshape x. When reshaping x, you have to reduce number of timesteps by a downsampling factor while increasing number of features by the same factor
        # TODO: Reduce lengths by the same downsampling factor
        if x.shape[1] % 2 != 0:
          x = x[:, :-1, :]
        x_down = x.reshape(x.size(0), x.size(1) // 2, x.size(2)*2) # [b, seq/2, 2*dim]
        x_lens = x_lens // 2

        return x_down, x_lens

## The Listener:

Psuedocode:
```python
class Listner:
  def init():
    feature_embedder = #Few layers of 1DConv-batchnorm-activation (Don't overdo)
    pblstm_encoder = #Cascaded pblstm layers (Take pblstm from #HW3P2),
    #can add more sequential lstms
    dropout = #As per your liking

  def forward(x,lx):
    embedding = feature_embedder(x) #optional
    encoding, encoding_len = pblstm_encoder(embedding/x,lx)
    #Regularization if needed
    return encoding, encoding_len
```



In [None]:
class Listener(torch.nn.Module):

    def __init__(self, input_size, embed_size, listener_hidden_size):
        super(Listener, self).__init__()


        self.embedding = torch.nn.Sequential(
            torch.nn.Conv1d(input_size, embed_size, kernel_size=3, stride=1, padding=1, bias=False),
            torch.nn.BatchNorm1d(embed_size)
        )
        """
        last_embed = torch.nn.Sequential(
            torch.nn.Conv1d(embed_size, embed_size, kernel_size=3, stride=1, padding=1, bias=False),
            torch.nn.BatchNorm1d(embed_size)
        )
        torch.nn.Sequential(*[
            ConvBlock(input_size, 256, 3, 1),
            ConvBlock(256, embed_size, 3, 1),
            last_embed]
        )
        """
        self.lstm = torch.nn.LSTM(embed_size, embed_size, num_layers=1, #dropout = 0.25,
                                  batch_first=True, bidirectional=True)

        self.pBLSTMs = torch.nn.Sequential(
            pBLSTM(embed_size*2, listener_hidden_size),
            pBLSTM(listener_hidden_size*2, listener_hidden_size),
            pBLSTM(listener_hidden_size*2, listener_hidden_size),
            #pBLSTM(listener_hidden_size*2, listener_hidden_size),
            PLSTMLockedDropout(0.25, True)
        )



    def forward(self, x, x_lens):
        embed_x_t = self.embedding(x.transpose(1, 2))
        embed_x = embed_x_t.transpose(1, 2)

        # Pass through the first LSTM at the very bottom
        x_combined = pack_padded_sequence(embed_x, x_lens, batch_first=True, enforce_sorted=False)
        lstm_output, _ = self.lstm(x_combined)

        # TODO: Pass through the pBLSTM blocks
        pblstm_output = self.pBLSTMs(lstm_output)

        # Unpack the sequence
        encoder_outputs, encoder_lens = pad_packed_sequence(pblstm_output, batch_first=True)

        return encoder_outputs, encoder_lens

## Attention

### Different ways to compute Attention

1. Dot-product attention
    * raw_weights = bmm(key, query)
    * Optional: Scaled dot-product by normalizing with sqrt key dimension
    * Check "Attention is All You Need" Section 3.2.1
    * 1st way is what most TAs are comfortable with, but if you want to explore, check out other methods below


2. Cosine attention
    * raw_weights = cosine(query, key) # almost the same as dot-product xD

3. Bi-linear attention
    * W = Linear transformation (learnable parameter): d_k -> d_q
    * raw_weights = bmm(key @ W, query)

4. Multi-layer perceptron
    * Check "Neural Machine Translation and Sequence-to-sequence Models: A Tutorial" Section 8.4

5. Multi-Head Attention
    * Check "Attention is All You Need" Section 3.2.2
    * h = Number of heads
    * W_Q, W_K, W_V: Weight matrix for Q, K, V (h of them in total)
    * W_O: d_v -> d_v
    * Reshape K: (B, T, d_k) to (B, T, h, d_k // h) and transpose to (B, h, T, d_k // h)
    * Reshape V: (B, T, d_v) to (B, T, h, d_v // h) and transpose to (B, h, T, d_v // h)
    * Reshape Q: (B, d_q) to (B, h, d_q // h) `
    * raw_weights = Q @ K^T
    * masked_raw_weights = mask(raw_weights)
    * attention = softmax(masked_raw_weights)
    * multi_head = attention @ V
    * multi_head = multi_head reshaped to (B, d_v)
    * context = multi_head @ W_O

Pseudocode:

```python
class Attention:
    '''
    Attention is calculated using the key, value (from encoder embeddings) and query from decoder.

    After obtaining the raw weights, compute and return attention weights and context as follows.:

    attention_weights   = softmax(raw_weights)
    attention_context   = einsum("thinkwhatwouldbetheequationhere",attention, value) #take hint from raw_weights calculation

    At the end, you can pass context through a linear layer too.
    '''

    def init(listener_hidden_size,
              speller_hidden_size,
              projection_size):

        VW = Linear(listener_hidden_size,projection_size)
        KW = Linear(listener_hidden_size,projection_size)
        QW = Linear(speller_hidden_size,projection_size)

    def set_key_value(encoder_outputs):
        '''
        In this function we take the encoder embeddings and make key and values from it.
        key.shape   = (batch_size, timesteps, projection_size)
        value.shape = (batch_size, timesteps, projection_size)
        '''
        key = KW(encoder_outputs)
        value = VW(encoder_outputs)
      
    def compute_context(decoder_context):
        '''
        In this function from decoder context, we make the query, and then we
         multiply the queries with the keys to find the attention logits,
         finally we take a softmax to calculate attention energy which gets
         multiplied to the generted values and then gets summed.

        key.shape   = (batch_size, timesteps, projection_size)
        value.shape = (batch_size, timesteps, projection_size)
        query.shape = (batch_size, projection_size)

        You are also recomended to check out Abu's Lecture 19 to understand Attention better.
        '''
        query = QW(decoder_context) #(batch_size, projection_size)

        raw_weights = #using bmm or einsum. We need to perform batch matrix multiplication. It is important you do this step correctly.
        #What will be the shape of raw_weights?

        attention_weights = #What makes raw_weights -> attention_weights

        attention_context = #Multiply attention weights to values

        return attention_context, attention_weights
```

In [None]:

class Attention(torch.nn.Module):
    def __init__(self, listener_hidden_size, speller_hidden_size, projection_size=128):
        # 512, 512, 128
        super().__init__()
        self.batch_size = None # init
        self.projection_size = projection_size

        self.VW = torch.nn.Linear(listener_hidden_size, projection_size) # could have diff proj size than k and q
        self.KW = torch.nn.Linear(listener_hidden_size, projection_size)
        self.leakyrelu = torch.nn.LeakyReLU(negative_slope=0.25)
        #self.QW = torch.nn.Linear(speller_hidden_size, projection_size)

    def set_key_value(self, encoder_outputs):
        '''
        In this function we take the encoder embeddings and make key and values from it.
        key.shape   = (batch_size, timesteps, projection_size)
        value.shape = (batch_size, timesteps, projection_size)
        '''
        self.key = self.KW(encoder_outputs)
        self.value = self.VW(encoder_outputs)
        self.batch_size = encoder_outputs.shape[0]

        return self.key, self.value

    def forward(self, decoder_context, attn_mask, compute_query=False, add_activate=False):

        if compute_query:
          self.QW = torch.nn.Linear(decoder_context.shape[1], self.projection_size).to(DEVICE)
          self.query = self.QW(decoder_context) #(batch_size, projection_size)
        else:
          # directly use result from lstm as query
          self.query = decoder_context

        if add_activate:
          self.key = self.leakyrelu(self.key)
          self.value = self.leakyrelu(self.value)
          self.query = self.leakyrelu(self.query)

        self.key = self.key.to(DEVICE)
        raw_weights = torch.bmm(self.key, self.query.unsqueeze(2)).squeeze(2).to(DEVICE)
        # key shape: batch, timesteps, project
        # query unsqueeze: batch, project, 1 >>> batch, timesteps, 1
        if attn_mask is not None:
           # a mask with shape (batch, timesteps)
          attn_mask = attn_mask.to(DEVICE)
          to_fill = -1e9 if raw_weights.dtype == torch.float32 else -1e+4
          raw_weights.masked_fill_(attn_mask, to_fill)
        else:
          # if no mask, use a factor to smooth the weights
          raw_weights = 1/np.sqrt(self.query.shape[1]) * raw_weights

        attention_weights = torch.nn.functional.softmax(raw_weights, dim=1)
        #What makes raw_weights -> attention_weights
        attention_context = torch.bmm(attention_weights.unsqueeze(1), self.value).squeeze(1) # batch, 1, projection

        return attention_context, attention_weights

## The Speller

Similar to the language model that you coded up for HW4P1, you have to code a language model for HW4P2 as well. This time, we will also call the attention context step, within the decoder to get the attended-encoder-embeddings.


What you have coded till now:

<center>
<img src="https://github.com/varunjain3/11785_s23_h4p2/raw/main/EncoderAttention.png" alt="data flow" height="400">
</center>

For the Speller, what we have to code:


<center>
<img src="https://github.com/varunjain3/11785_s23_h4p2/raw/main/Decoder.png" alt="data flow" height="400">
</center>

In [None]:
class Speller(torch.nn.Module):

    def __init__(self, attender, input_size, embed_size, speller_hidden_size, direct_query=True):
        super(Speller, self).__init__()
        self.attender = attender
        self.projection_size = self.attender.projection_size
        self.attention = attender
        self.max_timesteps = 550
        self.direct_query = direct_query

        # Embedding layer to convert token to latent space
        self.embedding =  torch.nn.Embedding(input_size, embed_size, padding_idx=PAD_TOKEN)
        # indicate if use LSTM output as query or not
        connect_size = self.projection_size if direct_query else speller_hidden_size
        # Create a sequence of LSTM Cells
        self.lstm_cells =  torch.nn.Sequential(
            torch.nn.LSTMCell(embed_size+self.projection_size, speller_hidden_size),
            torch.nn.LSTMCell(speller_hidden_size, speller_hidden_size),
            torch.nn.LSTMCell(speller_hidden_size, connect_size)
        )
        self.dropout = PLSTMLockedDropout(0.25)

        # For CDN (Feel free to change)
        self.output_to_cdn = torch.nn.Linear(connect_size+self.projection_size, 2*embed_size)
        self.cdn_activate = torch.nn.Tanh()
        self.cdn_to_char = torch.nn.Linear(2*embed_size, embed_size)
        self.char_activate = torch.nn.Tanh()
        self.char_prob = torch.nn.Linear(embed_size, input_size)

        # weight tying
        self.char_prob.weight = self.embedding.weight

    def lstm_step(self, input_word, hidden_state):

      for i in range(len(self.lstm_cells)):
            hidden_state[i] = self.lstm_cells[i](input_word, hidden_state[i])
            input_word = hidden_state[i][0]
            if i == 0:
              input_word = self.dropout(input_word.unsqueeze(1)).squeeze(1)

      return input_word, hidden_state


    def CDN(self, cdn_input):
      # Make the CDN here, you can add the output-to-char
      cdn_output = self.output_to_cdn(cdn_input)
      cdn_output = self.cdn_activate(cdn_output)
      cdn_output = self.cdn_to_char(cdn_output)
      cdn_output = self.char_activate(cdn_output)
      cdn_prob = self.char_prob(cdn_output)

      return cdn_prob


    def forward(self, encoder_lens, y=None, teacher_forcing_ratio=0.9):


        if y is not None: # during test/validation
            batch_size, timesteps = y.shape
        else:
            batch_size = self.attender.key.shape[0]
            timesteps = self.max_timesteps
            teacher_forcing_ratio = 0

        # initial context tensor for time t = 0
        # attn_context = torch.zeros((batch_size, self.projection_size))
        attn_context = self.attender.value[:, 0, :].squeeze(1)#.to(DEVICE)
        # initial prediction output
        output_symbol = torch.zeros(batch_size, 1).to(DEVICE)
        output_symbol[:] = SOS_TOKEN

        # universal attention mask
        mask_len = self.attender.key.shape[1]
        mask = torch.arange(mask_len).unsqueeze(0) >= encoder_lens.unsqueeze(1)
        # (1, timesteps) vs (batch_size, 1) -> (N, timesteps max)

        raw_outputs = []
        attention_plot = []

        # Initialize your hidden_states list here similar to HW4P1
        hidden_states_list = [None]*len(self.lstm_cells) #if hidden_states_list == None else hidden_states_list


        for t in range(timesteps):
            p = np.random.random() # generate a probability p between 0 and 1
            if self.training and p < teacher_forcing_ratio and t > 0: # t = 0, cannot use t-1 / previous value, so directly use SOS embedding
                char_embed = self.embedding(y.long())[:, t-1]  # Take from y, else draw from probability distribution
            else:
                char_embed = self.embedding(output_symbol.argmax(dim=-1))
            # print(t, char_embed.shape, attn_context.shape)
            # Concatenate the character embedding and context from attention, as shown in the diagram
            lstm_input = torch.cat([char_embed, attn_context], dim=1)

            lstm_output, hidden_states_list = self.lstm_step(lstm_input, hidden_states_list) # Feed the input through LSTM Cells and attention.
            print(lstm_output.shape)
            # What should we retrieve from forward_step to prepare for the next timestep?
            attn_context, attn_weights = self.attender(lstm_output, mask, add_activate=False)
            print(attn_context.shape)
            cdn_input = torch.cat([lstm_output, attn_context], dim=1)
            output_symbol = self.CDN(cdn_input)  # call CDN with cdn_input

            # Generate a prediction for this timestep and collect it in output_symbols
            raw_outputs.append(output_symbol.unsqueeze(1))
            attention_plot.append(attn_weights)

        attention_plot = torch.stack(attention_plot, dim=1)
        raw_outputs = torch.cat(raw_outputs, dim=1)

        return raw_outputs, attention_plot

In [None]:
class SpellerFancy(torch.nn.Module):

    def __init__(self, attender, input_size, embed_size, speller_hidden_size, direct_query=True):
        super(SpellerFancy, self).__init__()
        self.attender = attender
        self.projection_size = self.attender.projection_size
        self.attention = attender
        self.max_timesteps = 550
        self.direct_query = direct_query

        # Embedding layer to convert token to latent space
        self.embedding =  torch.nn.Embedding(input_size, embed_size, padding_idx=PAD_TOKEN)
        # indicate if use LSTM output as query or not
        connect_size = self.projection_size if direct_query else speller_hidden_size
        # Create a sequence of LSTM Cells
        self.lstm_cells =  torch.nn.Sequential(
            torch.nn.LSTMCell(embed_size+self.projection_size, speller_hidden_size),
            torch.nn.LSTMCell(speller_hidden_size, speller_hidden_size),
            torch.nn.LSTMCell(speller_hidden_size, connect_size)
        )
        self.dropout = PLSTMLockedDropout(0.25)

        # For CDN (Feel free to change)
        self.output_to_cdn = torch.nn.Linear(connect_size+self.projection_size, 2*embed_size)
        self.cdn_activate = torch.nn.Tanh()
        self.cdn_to_char = torch.nn.Linear(2*embed_size, embed_size)
        self.char_activate = torch.nn.Tanh()
        self.char_prob = torch.nn.Linear(embed_size, input_size)

        # weight tying
        self.char_prob.weight = self.embedding.weight

    def lstm_step(self, input_word, hidden_state):

      for i in range(len(self.lstm_cells)):
            hidden_state[i] = self.lstm_cells[i](input_word, hidden_state[i])
            input_word = hidden_state[i][0]
            if i < (len(self.lstm_cells)-1):
              input_word = self.dropout(input_word.unsqueeze(1)).squeeze(1)

      return input_word, hidden_state


    def CDN(self, cdn_input):
      # Make the CDN here, you can add the output-to-char
      cdn_output = self.output_to_cdn(cdn_input)
      cdn_output = self.cdn_activate(cdn_output)
      cdn_output = self.cdn_to_char(cdn_output)
      cdn_output = self.char_activate(cdn_output)
      cdn_prob = self.char_prob(cdn_output)

      return cdn_prob


    def forward(self, encoder_lens, y=None, teacher_forcing_ratio=0.9):

        if y is not None: # during test/validation
            batch_size, timesteps = y.shape
        else:
            batch_size = self.attender.key.shape[0]
            timesteps = self.max_timesteps
            teacher_forcing_ratio = 0

        # initial context tensor for time t = 0
        # attn_context = torch.zeros((batch_size, self.projection_size))
        attn_context = self.attender.value[:, 0, :].squeeze(1)#.to(DEVICE)
        # initial prediction output
        output_symbol = torch.zeros(batch_size, 1).to(DEVICE)
        output_symbol[:] = SOS_TOKEN

        # universal attention mask
        mask_len = self.attender.key.shape[1]
        mask = torch.arange(mask_len).unsqueeze(0) >= encoder_lens.unsqueeze(1)
        # (1, timesteps) vs (batch_size, 1) -> (N, timesteps max)

        raw_outputs = []
        attention_plot = []

        # Initialize your hidden_states list here similar to HW4P1
        hidden_states_list = [None]*len(self.lstm_cells) #if hidden_states_list == None else hidden_states_list


        for t in range(timesteps):
            p = np.random.random() # generate a probability p between 0 and 1
            if self.training and p < teacher_forcing_ratio and t > 0: # t = 0, cannot use t-1 / previous value, so directly use SOS embedding
                char_embed = self.embedding(y.long())[:, t-1]  # Take from y, else draw from probability distribution
            else:
                char_embed = self.embedding(output_symbol.argmax(dim=-1))
            # print(t, char_embed.shape, attn_context.shape)

            attn_context, _ = self.attender(char_embed.to(DEVICE), mask.to(DEVICE), compute_query=True)
            # Concatenate the character embedding and context from attention, as shown in the diagram
            lstm_input = torch.cat([char_embed, attn_context], dim=1)

            lstm_output, hidden_states_list = self.lstm_step(lstm_input, hidden_states_list) # Feed the input through LSTM Cells and attention.
            # What should we retrieve from forward_step to prepare for the next timestep?
            attn_context, attn_weights = self.attender(lstm_output, mask, add_activate=False)

            cdn_input = torch.cat([lstm_output, attn_context], dim=1)
            output_symbol = self.CDN(cdn_input)  # call CDN with cdn_input

            # Generate a prediction for this timestep and collect it in output_symbols
            raw_outputs.append(output_symbol.unsqueeze(1))
            attention_plot.append(attn_weights)

        attention_plot = torch.stack(attention_plot, dim=1)
        raw_outputs = torch.cat(raw_outputs, dim=1)

        return raw_outputs, attention_plot

## LAS

Here we finally build the LAS model, comibining the listener, attender and speller together, we have given a template, but you are free to read the paper and implement it yourself.

In [None]:
class LAS(torch.nn.Module):
  def __init__(self, input_size, encoder_embed_size, encoder_hidden_size, decoder_embed_size, decoder_hidden_size, projection_size=128):
      super(LAS, self).__init__()

      self.listener = Listener(input_size, encoder_embed_size, encoder_hidden_size)
      self.attender = Attention(encoder_hidden_size*2, decoder_hidden_size, projection_size)
      self.speller = SpellerFancy(self.attender, len(VOCAB), decoder_embed_size, decoder_hidden_size)

  def forward(self, x, lx, y=None, teacher_forcing_ratio=0.9):
      # Encode speech features
      encoder_outputs, encoder_lens = self.listener(x, lx)
      # We want to compute keys and values ahead of the decoding step, as they are constant for all timesteps
      # Set keys and values using the encoder outputs
      key, value = self.attender.set_key_value(encoder_outputs)
      # Decode text with the speller using context from the attention
      raw_outputs, attention_plots = self.speller(encoder_lens, y=y, teacher_forcing_ratio=teacher_forcing_ratio)

      return raw_outputs, attention_plots


# Model Setup

In [None]:
# Baseline LAS has the following configuration:
# Encoder bLSTM/pbLSTM Hidden Dimension of 512 (256 per direction)
# Decoder Embedding Layer Dimension of 256
# Decoder Hidden Dimension of 512
# Attention Projection Size of 128
# Feel Free to Experiment with this

model = LAS(
    input_size=27,
    encoder_embed_size= 256, #384,
    encoder_hidden_size= 256, #384,
    decoder_embed_size= 256, #384, # 512
    decoder_hidden_size=512,
    projection_size= 128) #256)
    # Initialize your model
    # Read the paper and think about what dimensions should be used
    # You can experiment on these as well, but they are not requried for the early submission
    # Remember that if you are using weight tying, some sizes need to be the same

model = model.to(DEVICE)
#model.apply(init_weights)
print(model)

summary(model, x.to(DEVICE), x_len, y.to(DEVICE))

LAS(
  (listener): Listener(
    (embedding): Sequential(
      (0): Conv1d(27, 256, kernel_size=(3,), stride=(1,), padding=(1,), bias=False)
      (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (lstm): LSTM(256, 256, batch_first=True, bidirectional=True)
    (pBLSTMs): Sequential(
      (0): pBLSTM(
        (blstm): LSTM(1024, 256, batch_first=True, bidirectional=True)
        (dropout): PLSTMLockedDropout()
      )
      (1): pBLSTM(
        (blstm): LSTM(1024, 256, batch_first=True, bidirectional=True)
        (dropout): PLSTMLockedDropout()
      )
      (2): pBLSTM(
        (blstm): LSTM(1024, 256, batch_first=True, bidirectional=True)
        (dropout): PLSTMLockedDropout()
      )
      (3): PLSTMLockedDropout()
    )
  )
  (attender): Attention(
    (VW): Linear(in_features=512, out_features=128, bias=True)
    (KW): Linear(in_features=512, out_features=128, bias=True)
    (leakyrelu): LeakyReLU(negative_slope=0.25)
  )
  (spelle

  df_sum = df.sum()


                                                 Kernel Shape  \
Layer                                                           
0_listener.embedding.Conv1d_0                    [27, 256, 3]   
1_listener.embedding.BatchNorm1d_1                      [256]   
2_listener.LSTM_lstm                                        -   
3_listener.pBLSTMs.0.PLSTMLockedDropout_dropout             -   
4_listener.pBLSTMs.0.LSTM_blstm                             -   
...                                                       ...   
4013_speller.Linear_output_to_cdn                  [256, 512]   
4014_speller.Tanh_cdn_activate                              -   
4015_speller.Linear_cdn_to_char                    [512, 256]   
4016_speller.Tanh_char_activate                             -   
4017_speller.Linear_char_prob                       [256, 31]   

                                                      Output Shape     Params  \
Layer                                                                    

Unnamed: 0_level_0,Kernel Shape,Output Shape,Params,Mult-Adds
Layer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0_listener.embedding.Conv1d_0,"[27, 256, 3]","[192, 256, 2936]",20736.0,60880896.0
1_listener.embedding.BatchNorm1d_1,[256],"[192, 256, 2936]",512.0,256.0
2_listener.LSTM_lstm,-,"[127902, 512]",1052672.0,1048576.0
3_listener.pBLSTMs.0.PLSTMLockedDropout_dropout,-,"[192, 1468, 1024]",,
4_listener.pBLSTMs.0.LSTM_blstm,-,"[63903, 512]",2625536.0,2621440.0
...,...,...,...,...
4013_speller.Linear_output_to_cdn,"[256, 512]","[192, 512]",,131072.0
4014_speller.Tanh_cdn_activate,-,"[192, 512]",,
4015_speller.Linear_cdn_to_char,"[512, 256]","[192, 256]",,131072.0
4016_speller.Tanh_char_activate,-,"[192, 256]",,


# Loss Function, Optimizers, Scheduler

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=config['lr']) #, weight_decay=5e-5) # TODO: Define the optimizer. Adam/AdamW usually works good for this HW
criterion = torch.nn.CrossEntropyLoss(reduction='none').to(DEVICE)
#check how would you fill these values : https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
scaler      = torch.cuda.amp.GradScaler()
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.8, patience=3, verbose=True)

# Optional (but Recommended): Create a custom class for a Teacher Force Schedule

# Levenshtein Distance

In [None]:
# We have given you this utility function which takes a sequence of indices and converts them to a list of characters
def indices_to_chars(indices, vocab):
    tokens = []
    for i in indices: # This loops through all the indices
        if int(i) == SOS_TOKEN: # If SOS is encountered, dont add it to the final list
            continue
        elif int(i) == EOS_TOKEN: # If EOS is encountered, stop the decoding process
            break
        else:
            tokens.append(vocab[int(i)])
    return tokens

# To make your life more easier, we have given the Levenshtein distantce / Edit distance calculation code
def calc_edit_distance(predictions, y, ly, vocab=VOCAB, print_example= False):

    dist                = 0
    batch_size, seq_len = predictions.shape

    for batch_idx in range(batch_size):

        y_sliced    = indices_to_chars(y[batch_idx,0:ly[batch_idx]], vocab)
        pred_sliced = indices_to_chars(predictions[batch_idx], vocab)

        # Strings - When you are using characters from the AudioDataset
        y_string    = ''.join(y_sliced)
        pred_string = ''.join(pred_sliced)

        dist        += Levenshtein.distance(pred_string, y_string)
        # Comment the above and uncomment below for toy dataset, as the toy dataset has a list of phonemes to compare
        #dist      += Levenshtein.distance(y_sliced, pred_sliced)

    if print_example:
        # Print y_sliced and pred_sliced if you are using the toy dataset
        print("Ground Truth : ", y_string)
        print("Prediction   : ", pred_string)

    dist/=batch_size
    return dist

# Train and Validation functions


In [None]:

def train(model, dataloader, criterion, optimizer, teacher_forcing_rate, augment=False):
    model.train()
    batch_bar = tqdm(total=len(dataloader), dynamic_ncols=True, leave=False, position=0, desc='Train')

    running_loss        = 0.0
    running_perplexity  = 0.0
    """
    augmentations  = torch.nn.Sequential(
          tat.FrequencyMasking(freq_mask_param=10),
          tat.TimeMasking(time_mask_param=10)
        )
    """

    for i, (x, y, lx, ly) in enumerate(dataloader):
        #if augment:
          #x = augmentations(x)

        optimizer.zero_grad()

        x, y, lx, ly = x.to(DEVICE), y.to(DEVICE), lx, ly

        with torch.cuda.amp.autocast():
          # Predictions are of Shape (batch_size, timesteps, vocab_size).
          # Transcripts are of shape (batch_size, timesteps) Which means that you have batch_size amount of batches with timestep number of tokens.
          # So in total, you have batch_size*timesteps amount of characters.
          # Similarly, in predictions, you have batch_size*timesteps amount of probability distributions.
          raw_predictions, attention_plot = model(x, lx, y = y, teacher_forcing_ratio=teacher_forcing_rate)

          # How do you need to modify transcipts and predictions so that you can calculate the CrossEntropyLoss? Hint: Use Reshape/View and read the docs
          # Also we recommend you plot the attention weights, you should get convergence in around 10 epochs, if not, there could be something wrong with
          # your implementation
          loss = criterion(raw_predictions.view(-1, raw_predictions.shape[2]), y.view(-1))

          # create mask for controling the length so as to use reduction = none in criterion
          len_mask = torch.zeros(y.shape[1], y.shape[0])
          len_mask = len_mask.to(DEVICE)
          for j in range(len(ly)):
            len_mask[:ly[j], j] = 1

          m = len_mask.flatten()

          adj_loss = torch.sum(loss*m) / torch.sum(len_mask)
          perplexity = torch.exp(adj_loss).item() # exponential of the loss

          running_loss += adj_loss.item()
          running_perplexity += perplexity

        if DEVICE == 'cuda':
          scaler.scale(adj_loss).backward()

          # Since the gradients of optimizer's assigned params are unscaled, clips as usual:
          scaler.unscale_(optimizer)
          torch.nn.utils.clip_grad_norm_(model.parameters(), 2)

          # Optional: Use torch.nn.utils.clip_grad_norm to clip gradients to prevent them from exploding, if necessary
          # If using with mixed precision, unscale the Optimizer First before doing gradient clipping
          scaler.step(optimizer)
          scaler.update()
        else:
          adj_loss.backward()
          torch.nn.utils.clip_grad_norm_(model.parameters(), 2)
          optimizer.step()


        batch_bar.set_postfix(
            loss="{:.04f}".format(running_loss/(i+1)),
            perplexity="{:.04f}".format(running_perplexity/(i+1)),
            lr="{:.04f}".format(float(optimizer.param_groups[0]['lr'])),
            tf_rate='{:.02f}'.format(teacher_forcing_rate))
        batch_bar.update()

        del x, y, lx, ly
        torch.cuda.empty_cache()

    running_loss /= len(dataloader)
    running_perplexity /= len(dataloader)
    batch_bar.close()

    return running_loss, running_perplexity, attention_plot

In [None]:
def validate(model, dataloader):

    model.eval()

    batch_bar = tqdm(total=len(dataloader), dynamic_ncols=True, position=0, leave=False, desc="Val")

    running_lev_dist = 0.0

    for i, (x, y, lx, ly) in enumerate(dataloader):

        x, y, lx, ly = x.to(DEVICE), y.to(DEVICE), lx, ly

        with torch.inference_mode():
            raw_predictions, attentions = model(x, lx, y = None)

        # Greedy Decoding
        greedy_predictions   =  torch.argmax(raw_predictions, dim=2) #greedy_predict(raw_predictions)
        # TODO: How do you get the most likely character from each distribution in the batch?

        # Calculate Levenshtein Distance
        running_lev_dist    += calc_edit_distance(greedy_predictions, y, ly, VOCAB, print_example = False) # You can use print_example = True for one specific index i in your batches if you want

        batch_bar.set_postfix(
            dist="{:.04f}".format(running_lev_dist/(i+1)))
        batch_bar.update()

        del x, y, lx, ly
        torch.cuda.empty_cache()

    batch_bar.close()
    running_lev_dist /= len(dataloader)

    return running_lev_dist

# Experiment

In [None]:
# Login to Wandb
# Initialize your Wandb Run Here
# Save your model architecture in a txt file, and save the file to Wandb
import wandb
wandb.login(key="27ad915a9386068b1fc160cd97b84be7ba1fe659")

[34m[1mwandb[0m: Currently logged in as: [33mwenxinz3[0m ([33msharonxin1207[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
LOG_RUN = True
if LOG_RUN:
  run = wandb.init(
      name = "run-hw4-complete", ## Wandb creates random run names if you skip this field
      reinit = True, ### Allows reinitalizing runs when you re-run this cell
      # run_id = ### Insert specific run id here if you want to resume a previous run
      #resume = "allow", ### You need this to resume previous runs, but comment out reinit = True when using this
      project = "hw4p2-ablations", ### Project should be created in your wandb account
      config = config ### Wandb Config for your run
  )

In [None]:
def plot_attention(attention):
    # Function for plotting attention
    # You need to get a diagonal plot
    plt.clf()
    sns.heatmap(attention, cmap='GnBu')
    plt.show()

def save_model(model, optimizer, scheduler, metric, epoch, path):
    torch.save(
        {'model_state_dict'         : model.state_dict(),
         'optimizer_state_dict'     : optimizer.state_dict(),
         'scheduler_state_dict'     : scheduler.state_dict(),
         metric[0]                  : metric[1],
         'epoch'                    : epoch},
         path
    )

def load_model(path, model, metric= 'valid_acc', optimizer= None, scheduler= None):

    checkpoint = torch.load(path)
    model.load_state_dict(checkpoint['model_state_dict'])

    if optimizer != None:
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    if scheduler != None:
        scheduler.load_state_dict(checkpoint['scheduler_state_dict'])

    epoch   = checkpoint['epoch']
    metric  = checkpoint[metric]

    return [model, optimizer, scheduler, epoch, metric]

In [None]:
# model path for saving
epoch_model_path = "hw4_epoch.pt" #TODO set the model path( Optional, you can just store best one. Make sure to make the changes below )
best_model_path = "hw4_best.pt" #TODO set best model path
# "best_fancy_speller.pt" >>> best 11.6
#"best_lstm_query_model_adjtf.pt" #TODO set best model path
reload_path =   "hw4_models/1+3+3+2_best.pt"#"best_fancy_speller.pt"

In [None]:
RELOAD = False
if RELOAD:
  reload_items =  load_model(best_model_path, model, 'valid_dist', optimizer, scheduler)
  model = reload_items[0]
  optimizer = reload_items[1]
  scheduler = reload_items[2]
  print(reload_items[3], reload_items[4])
  print(optimizer.param_groups[0]['lr'])

113 7.5995138888888905
0.0001099511627776001


In [None]:
torch.cuda.empty_cache()
gc.collect()
best_lev_dist = float("inf") # reload_items[4]
tf_rate = 1


## Training for 100 Epochs

In [None]:

for epoch in range(config['epochs']):

    print("\nEpoch: {}/{}".format(epoch+1, config['epochs']))

    # Call train and validate, get attention weights from training
    curr_lr = float(optimizer.param_groups[0]['lr'])

    train_loss, train_perplexity, attention_plot    = train(model, train_loader, criterion, optimizer, tf_rate) #TODO
    valid_dist              = validate(model, dev_loader) #TODO
    if epoch > 14:
      scheduler.step(valid_dist)
    # Print your metrics
    print("\tTrain Loss {:.04f}\t Train Perplexity  {:.04f}\t Learning Rate {:.07f}".format(
        train_loss, train_perplexity, curr_lr))
    print("\tVal Dist {:.04f}%\t".format(valid_dist))

    # Plot Attention for a single item in the batch
    plot_attention(attention_plot[0].cpu().detach().numpy())
    # one schedule is reduce at 20, then every 15 epoch until 0.1
    if ((epoch+1)%20 == 0) and tf_rate > 0.4 and epoch < 60:
      tf_rate -= 0.1
      print("Reducing TF rate by 0.1")
    # Log metrics to Wandb
    elif ((epoch+1) == 75) and tf_rate > 0.4:
      tf_rate -= 0.1

      print("Reducing TF rate by 0.1")
    """
    elif epoch+1 == 50 and tf_rate > 0.1:
      tf_rate -= 0.05
      print("Reducing TF rate by 0.05")
    elif epoch+1 == 60 and tf_rate > 0.1:
      tf_rate -= 0.05
      print("Reducing TF rate by 0.05")
    elif epoch+1 == 70 and tf_rate > 0.1:
      tf_rate -= 0.05
      print("Reducing TF rate by 0.05")"""

    wandb.log({
        'train_loss': train_loss,
        'train_perplex': train_perplexity,
        'valid_dist': valid_dist,
        'tf_rate': tf_rate,
        'lr'        : curr_lr
    })
    wandb.save(epoch_model_path)

    save_model(model, optimizer, scheduler, ['valid_dist', valid_dist], epoch, epoch_model_path)
    print("Saved epoch model")

    if valid_dist <= best_lev_dist:
        best_lev_dist = valid_dist
        save_model(model, optimizer, scheduler, ['valid_dist', valid_dist], epoch, best_model_path)
        #wandb.save(best_model_path)
        print("Saved best model")

      # You may find it interesting to exlplore Wandb Artifcats to version your models
    torch.cuda.empty_cache()
    gc.collect()

run.finish()


## Finetuning for 50 epochs

In [None]:
if LOG_RUN:
  run = wandb.init(
      name = "run-hw4-complete-finetune", ## Wandb creates random run names if you skip this field
      reinit = True, ### Allows reinitalizing runs when you re-run this cell
      # run_id = ### Insert specific run id here if you want to resume a previous run
      #resume = "allow", ### You need this to resume previous runs, but comment out reinit = True when using this
      project = "hw4p2-ablations", ### Project should be created in your wandb account
      config = config ### Wandb Config for your run
  )

In [None]:
epoch_model_path = "hw4_ft_epoch.pt" #TODO set the model path( Optional, you can just store best one. Make sure to make the changes below )
best_model_path = "hw4_ft_best.pt" #TODO set best model path

In [None]:
# Reset LR and scheduler for finetune
optimizer.param_groups[0]['lr'] = 0.0005
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.8, patience=3, verbose=True)


In [None]:

for epoch in range(config['finetune_epochs']):

    print("\nEpoch: {}/{}".format(epoch+1, config['finetune_epochs']))

    # Call train and validate, get attention weights from training
    curr_lr = float(optimizer.param_groups[0]['lr'])

    train_loss, train_perplexity, attention_plot    = train(model, train_loader, criterion, optimizer, tf_rate) #TODO
    valid_dist              = validate(model, dev_loader) #TODO
    if epoch > 14:
      scheduler.step(valid_dist)
    # Print your metrics
    print("\tTrain Loss {:.04f}\t Train Perplexity  {:.04f}\t Learning Rate {:.07f}".format(
        train_loss, train_perplexity, curr_lr))
    print("\tVal Dist {:.04f}%\t".format(valid_dist))

    # Plot Attention for a single item in the batch
    plot_attention(attention_plot[0].cpu().detach().numpy())
    # one schedule is reduce at 20, then every 15 epoch until 0.1

    wandb.log({
        'train_loss': train_loss,
        'train_perplex': train_perplexity,
        'valid_dist': valid_dist,
        'tf_rate': tf_rate,
        'lr'        : curr_lr
    })
    wandb.save(epoch_model_path)

    save_model(model, optimizer, scheduler, ['valid_dist', valid_dist], epoch, epoch_model_path)
    print("Saved epoch model")

    if valid_dist <= best_lev_dist:
        best_lev_dist = valid_dist
        save_model(model, optimizer, scheduler, ['valid_dist', valid_dist], epoch, best_model_path)
        #wandb.save(best_model_path)
        print("Saved best model")

      # You may find it interesting to exlplore Wandb Artifcats to version your models
    torch.cuda.empty_cache()
    gc.collect()

run.finish()


# Testing

In [None]:
# Optional: Load your best model Checkpoint here

# TODO: Create a testing function similar to validation
# TODO: Create a file with all predictions
# TODO: Submit to Kaggle

def prediction_decode(to_decode):
    # to_decode: batch_size, seq_len, vocab_size
    decode = []
    for seq in to_decode:
        seq_str = ""
        for char in seq:
            if char == EOS_TOKEN:
              break
            elif char == SOS_TOKEN:
              pass
            else:
              seq_str += VOCAB[char]
        decode.append(seq_str)
    return decode


def test(model, dataloader):
    model.eval()

    #batch_bar = tqdm(total=len(dataloader), dynamic_ncols=True, position=0, leave=False, desc="Val")

    running_lev_dist = 0.0

    pred_list = []

    for (x, lx) in tqdm(dataloader):
        x, lx = x.to(DEVICE), lx
        with torch.inference_mode():
            raw_predictions, attentions = model(x, lx, y = None)
        greedy_predictions = prediction_decode(torch.argmax(raw_predictions, dim=2))
        pred_list.extend(greedy_predictions)

        del x, lx
        torch.cuda.empty_cache()
    return pred_list


In [None]:
test_prediction = test(model, test_loader)
with open("submission.csv", 'w') as fh:
  fh.write('index,label\n')
  for i in range(len(test_prediction)):
    fh.write(str(i)+ ',' + test_prediction[i] + "\n")

df = pd.read_csv("submission.csv")
df.to_csv('submission.csv', index = False)



  0%|          | 0/14 [00:00<?, ?it/s][A
  7%|▋         | 1/14 [00:03<00:51,  3.99s/it][A
 14%|█▍        | 2/14 [00:07<00:41,  3.42s/it][A
 21%|██▏       | 3/14 [00:09<00:34,  3.17s/it][A
 29%|██▊       | 4/14 [00:12<00:30,  3.03s/it][A
 36%|███▌      | 5/14 [00:15<00:27,  3.01s/it][A
 43%|████▎     | 6/14 [00:19<00:25,  3.14s/it][A
 50%|█████     | 7/14 [00:21<00:21,  3.03s/it][A
 57%|█████▋    | 8/14 [00:24<00:17,  2.97s/it][A
 64%|██████▍   | 9/14 [00:27<00:14,  3.00s/it][A
 71%|███████▏  | 10/14 [00:30<00:11,  2.88s/it][A
 79%|███████▊  | 11/14 [00:33<00:08,  2.89s/it][A
 86%|████████▌ | 12/14 [00:36<00:05,  2.99s/it][A
 93%|█████████▎| 13/14 [00:39<00:02,  2.98s/it][A
100%|██████████| 14/14 [00:41<00:00,  2.99s/it]


In [None]:
if SUBMIT:
  !kaggle competitions submit -c 11-785-s23-hw4p2 -f submission.csv -m "I made it!"


100% 288k/288k [00:00<00:00, 601kB/s]
Successfully submitted to Attention-Based Speech Recognition