# HW1: Frame-Level Speech Recognition

In this homework, you will be working with MFCC data consisting of 27 features at each time step/frame. Your model should be able to recognize the phoneme occured in that frame.

#Instruction to Run the Code

# To run the final model corresponding to the highest kaggle submission, please first make sure the **Global Variables** (next section) are set to fit the purpose, and then go to **Runtime**, click **Restart and run all**. This would
1. pip install, import and download all required packages and data
2. run all functions and classes for loading data and creating model/optimizer/scheduler
3. train the model based on the parameters saved in the config_king variable defined under **Parameter Configuration**.

**Note** that by default, the training data loads the combined 360 and 100 data. If you only want to run with 360 or 100, please set **ONLY360** or **ONLY100** as True under Global Variable section. If both are set to True, train 360 would be preferred.

**Note** that to run the combined 460 train data, the notebook uses GCP which allow 102.5 GB RAM.

By default, the notebook would run the trained model on the test dataset and save the predicted result in csv file, but it would not make the submission to Kaggle. To run the notebook with kaggle submission, set **SUBMIT_KAGGLE** to True.

# Global Variables

In [None]:
ONLY360 = False
ONLY100 = False
CONNECT_DRIVE = False
SUBMIT_KAGGLE = False
MODEL_SAVED_NAME = "460Attempt.pt" # please change the name based on run schemes

# README

## Best Score Hyperparameters:
* **context** = 30
* **init learning rate** = 1e-3
* **layers** = 2048, 2048, 2048, 2048, 1280, 1024
* **activation** = SiLU
* **dropout** = 0.35 (on 2048 layer), 0.15 (on 1024 layer)
* **weight decay** = 0.005
* **batch size** = 16384
* **epoch** = 40
* **batchnorm**: on every layer
* **weight init**: kaiming normal on linear layer, constant on batchnorm layer
* **scheduler**: ReduceLROnPlateau
** patience = 3
** factor = 0.5
** threshold = 0.0025
** mode = max
* **optimizer**: AdamW




## Data Loading Scheme:
The highest kaggle score is run by loading train 360 and train 100, combining the two datasets into one, and training on the combined dataset. Note that the mfcc tensors are using type = float32 to save the memory. The detailed code is shown in **AudioDatasetBoth**, where mfcc from 360 and 100 are loaded individually in the order of 360 >> 100, saved into a list before final concatenation.


The normal data loader that loads either 360 or 100 or validation are written in class **AudioDataset**, where all mfcc files are read in order and saved into a list for the final concatenation. No memory handling is used.


## Architectures:
The highest kaggle score is reached using **6 layer Pyramid** architecture: **2048-2048-2048-2048-1280-1042**. 1280 comes from 1.25 x 1024, and the purpose is to make the best use of the 20M allowed parameters. The model results in 19.9M parameters, which is within the limitation. Each linear layer (except for the output layer) is followed by a 1d batchnorm layer, and then SiLU activation. All layers with 2048 nodes have a dropout layer after activation with dropout rate at 0.35. For the last layer before the output layer, a dropout layer is added with 0.15 dropout rate.


Other architectures are tested as well, but with less ideal performance:
1. 8 layer Pyramid architecture, 2048 x 2 + 1024 x 6. Each layer has 1d batchnorm and uses SiLU activation, and for each alternative layer, a dropout layer with rate = 0.35 is added after activation.
2. 8 layer Pyramid architecture, 2048 x 2 + 1024 x 4 + 512 x 2. Everything else is the same.
3. 8 layer cylinder architecture, 1024 x 8. Everything else the same.
4. 6 layer Pyramid architecture, 2048 x 4 + 1024 x 2. Batchnorm and SiLU are the same as above, and 5 out of 6 layers have the dropout layer. For a layer with 2048 nodes, the dropout rate is 0.35. For the last layer before output, the dropout rate is 0.15.
5. 5 layer cylinder architecture, 2048 x 5. Batchnorm and SiLU are the same as above, and every layer has a dropout layer using 0.35 rate.


## Epochs
I trained the model for 40 epochs to get to my highest kaggle submission. The model is close to converging at 40 epoch, and I believe the model could hit the high-cutoff with 10 more epoch. However, due to time and computation unit restriction, I halted at 40 epochs.


## Hyperparameters
* **Context Size**: 30. I increased from 20 to 25 to 30, and found 30 the best performance.

* **Batch Size**: 16384. On train 100, I use 1024 for various ablation tests. Once I switched to 360, to improve the runtime, I tested 4096 and 8192, and the accuracy is very satisfying using 8192. I then decided to move to the larger combined 460 training set, and I doubled the batch size to 16384.

* **Weight Decay**: 0.005. As shown in my ablation study in the toy dataset (see excel), using 0.005 gives a better performance. But for the first 5 epochs, it is not as good as not using weight decay at all. When I try the weight decay on train 360, however, adding weight decay gives better performance in later epoch (10+) than not adding, so I keep the 0.005 weight decay in my final optimizer setting.

* **Dropout**: 0.35 on layers with 2048 nodes, and 0.15 on the last layer with 1024 nodes before the output layer. For dropout, I tested 0.25, 0.3 and 0.35 on different layers, and using 0.35 on layers with 2048 nodes gives the better result. However, if applying 0.35 dropout on all 6 layers, the model tends to be very underfit. I then decreased the dropout in the last layer with least nodes, and removed the dropout layer from the second last layer, to get the best performance. However, overall, the model is still underfit (with train acc around 2% below val acc).




## Other Experiments
### Activation Function
I selected the SiLU activation function (it seems to work better with batchnorm, theoretically), but I also tested LeakyRelu and GELU. In addition, I tried using different activation functions on different layers, and the result is not as good as simply using SiLU.
## Batchnorm
I used batchnorm on every layer, but tested alternative batchnorm as well. Using it on every layer performs better.
### Weight Initialization
I used normal Kaiming initialization for linear layer weight, and constant 1 vs 0 (weight vs bias) initialization for 1d batchnorm layer. I also tried Kaiming uniform initialization, and the performance is slightly worse.
### Scheduler and Learning Rate
I selected ReduceLROnPlateau with patience 3 and factor 0.5, with threshold 0.0025 and max mode. I started with using patience = 2 and threshold = 0.01, but the learning rate started to decrease before the model converges. I then used the default threshold 1e-4 and it took a lot of epochs to train before the learning rate decreased (~50). Hence I found the middle point, used the 0.0025 threshold and increased the patience by 1. The learning rate started to decrease at around 18 epoch by half. I used max mode aiming to adjust learning rate based on val accuracy. Overall, the scheduler helps improve the val accuracy, but since I used a large threshold, the learning rate still decreases before the original model using initial learning rate fully converges. This could be the reason for my model to not yet reach the high cutoff at epoch 40.


### Optimizer
I only switched between AdamW and Adam optimizer, and found the former better.




# Libraries

In [None]:
!pip install torchsummaryX wandb --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 KB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.3/181.3 KB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 KB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 KB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


In [None]:
import torch
import numpy as np
from torchsummaryX import summary
import sklearn
import gc
import zipfile
import pandas as pd
from tqdm.auto import tqdm
import os
import datetime
import wandb
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Device: ", device)

Device:  cuda


In [None]:
### If you are using colab, you can import google drive to save model checkpoints in a folder
if CONNECT_DRIVE:
  from google.colab import drive
  drive.mount('/content/gdrive')
  # !ls /content/drive/MyDrive

In [None]:
### PHONEME LIST
PHONEMES = [
            '[SIL]',   'AA',    'AE',    'AH',    'AO',    'AW',    'AY',
            'B',     'CH',    'D',     'DH',    'EH',    'ER',    'EY',
            'F',     'G',     'HH',    'IH',    'IY',    'JH',    'K',
            'L',     'M',     'N',     'NG',    'OW',    'OY',    'P',
            'R',     'S',     'SH',    'T',     'TH',    'UH',    'UW',
            'V',     'W',     'Y',     'Z',     'ZH',    '[SOS]', '[EOS]']

# Kaggle

This section contains code that helps you install kaggle's API, creating kaggle.json with you username and API key details. Make sure to input those in the given code to ensure you can download data from the competition successfully.

In [None]:
!pip install --upgrade --force-reinstall --no-deps kaggle==1.5.8
!mkdir /root/.kaggle

with open("/root/.kaggle/kaggle.json", "w+") as f:
    f.write('{"username":"sharonxin1207","key":"xxx"}')
    # Put your kaggle username & key here

!chmod 600 /root/.kaggle/kaggle.json

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting kaggle==1.5.8
  Downloading kaggle-1.5.8.tar.gz (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 KB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
  Created wheel for kaggle: filename=kaggle-1.5.8-py3-none-any.whl size=73274 sha256=7f783382b6f0d1420350aa39431cc6dcbd6943c67b657bc47d6dfecdfb59ed20
  Stored in directory: /root/.cache/pip/wheels/f3/67/7b/a6d668747974998471d29b230e7221dd01330ac34faebe4af4
Successfully built kaggle
Installing collected packages: kaggle
  Attempting uninstall: kaggle
    Found existing installation: kaggle 1.5.12
    Uninstalling kaggle-1.5.12:
      Successfully uninstalled kaggle-1.5.12
Successfully installed kaggle-1.5.8


In [None]:
# commands to download data from kaggle

!kaggle competitions download -c 11-785-s23-hw1p2
!mkdir '/content/data'

!unzip -qo '11-785-s23-hw1p2.zip' -d '/content/data'
#!cat ../root/.kaggle/kaggle.json
#!cp -r "/content/data" '/content/drive/MyDrive'

Downloading 11-785-s23-hw1p2.zip to /content
100% 16.0G/16.0G [12:33<00:00, 29.9MB/s]
100% 16.0G/16.0G [12:33<00:00, 22.8MB/s]


For QUIZ MCQ Exploration

In [None]:
from collections import Counter
def transcript_explore():
  # Code for MCQ QUIZ
  transcript_dir = "/content/drive/MyDrive/DLdata/11-785-s23-hw1p2/dev-clean/transcript"
  transcript_names  = sorted(os.listdir(transcript_dir))
  transcript_list = []
  for i in range(len(transcript_names)):
  #  print(transcript_names[i])
    try:
      transcript  = np.load(transcript_dir+"/"+transcript_names[i])
      transcript_list.append(transcript)
      transcripts = np.concatenate(transcript_list)
    except ValueError:
      print(transcript_names[i])

  sil_count = np.count_nonzero(transcripts == '[SIL]')
  counter = Counter(transcripts)
  print(sil_count)
  print(counter.most_common()[-1][0])

# transcript_explore()

# Dataset

This section covers the dataset/dataloader class for speech data. You will have to spend time writing code to create this class successfully. We have given you a lot of comments guiding you on what code to write at each stage, from top to bottom of the class. Please try and take your time figuring this out, as it will immensely help in creating dataset/dataloader classes for future homeworks.

Before running the following cells, please take some time to analyse the structure of data. Try loading a single MFCC and its transcipt, print out the shapes and print out the values. Do the transcripts look like phonemes?

In [None]:
# Dataset class to load train and validation data
class AudioDataset(torch.utils.data.Dataset):

    def __init__(self, root, phonemes = PHONEMES, context=0, partition= "train-clean-100"): # Feel free to add more arguments

        self.context    = context
        self.phonemes   = phonemes

        # TODO: MFCC directory - use partition to acces train/dev directories from kaggle data using root
        self.mfcc_dir       = "{}/{}/mfcc".format(root, partition)
        # TODO: Transcripts directory - use partition to acces train/dev directories from kaggle data using root
        self.transcript_dir = "{}/{}/transcript".format(root, partition)


        # TODO: List files in sefl.mfcc_dir using os.listdir in sorted order
        mfcc_names          = sorted(os.listdir(self.mfcc_dir))
        # TODO: List files in self.transcript_dir using os.listdir in sorted order
        transcript_names    = sorted(os.listdir(self.transcript_dir))

        # Making sure that we have the same no. of mfcc and transcripts
        assert len(mfcc_names) == len(transcript_names)

        self.mfccs, self.transcripts = [], []

        # TODO: Iterate through mfccs and transcripts
        for i in range(len(mfcc_names)):
        #   Load a single mfcc
            mfcc        = np.load(self.mfcc_dir+"/"+mfcc_names[i])
        #   Do Cepstral Normalization of mfcc (explained in writeup)
            mfcc        = (mfcc - mfcc.mean(axis=0))/mfcc.std(axis=0)
        #   Load the corresponding transcript
            transcript  = np.load(self.transcript_dir+"/"+transcript_names[i])
            # Remove [SOS] and [EOS] from the transcript
            # (Is there an efficient way to do this without traversing through the transcript?)
            # Note that SOS will always be in the starting and EOS at end, as the name suggests.
            start_idx = np.where(transcript=='[SOS]')[0][-1]
            end_idx = np.where(transcript=='[EOS]')[0][0]
            transcript = transcript[start_idx+1:end_idx]
        #   Append each mfcc to self.mfcc, transcript to self.transcript
            self.mfccs.append(mfcc)
            self.transcripts.append(transcript)

        # TODO: Concatenate all mfccs in self.mfccs such that
        # the final shape is T x 27 (Where T = T1 + T2 + ...)
        self.mfccs          = np.concatenate(self.mfccs)
        #self.mfccs.astype(np.float32)

        # TODO: Concatenate all transcripts in self.transcripts such that
        # the final shape is (T,) meaning, each time step has one phoneme output
        self.transcripts    = np.concatenate(self.transcripts)
        #self.transcripts.astype(np.uint8)
        # Hint: Use numpy to concatenate

        # Length of the dataset is now the length of concatenated mfccs/transcripts
        self.length = len(self.mfccs)

        # Take some time to think about what we have done.
        # self.mfcc is an array of the format (Frames x Features).
        # Our goal is to recognize phonemes of each frame
        # From hw0, you will be knowing what context is.
        # We can introduce context by padding zeros on top and bottom of self.mfcc
        self.mfccs = np.pad(self.mfccs, ((self.context, self.context), (0, 0)), 'constant', constant_values=(0, 0))

        # The available phonemes in the transcript are of string data type
        # But the neural network cannot predict strings as such.
        # Hence, we map these phonemes to integers

        # TODO: Map the phonemes to their corresponding list indexes in self.phonemes
        self.transcripts = np.array([phonemes.index(p) for p in self.transcripts]).reshape(self.transcripts.shape)
        # Now, if an element in self.transcript is 0, it means that it is 'SIL' (as per the above example)

    def __len__(self):
        return self.length

    def __getitem__(self, ind):

        # TODO: Based on context and offset, return a frame at given index with context frames to the left, and right.
        frames = self.mfccs[ind:(ind+2*self.context+1)]
        # After slicing, you get an array of shape 2*context+1 x 27. But our MLP needs 1d data and not 2d.
        frames = frames.flatten() # TODO: Flatten to get 1d data

        frames      = torch.FloatTensor(frames) # Convert to tensors
        phonemes    = torch.tensor(self.transcripts[ind])

        return frames, phonemes

In [None]:
class AudioTestDataset(torch.utils.data.Dataset):

    # TODO: Create a test dataset class similar to the previous class but you dont have transcripts for this
    # Imp: Read the mfccs in sorted order, do NOT shuffle the data here or in your dataloader.
    def __init__(self, root, phonemes = PHONEMES, context=0, partition= "test-clean-100"): # Feel free to add more arguments

        self.context    = context
        self.phonemes   = phonemes

        # TODO: MFCC directory - use partition to acces train/dev directories from kaggle data using root
        self.mfcc_dir       = "{}/{}/mfcc".format(root, partition)

        # TODO: List files in sefl.mfcc_dir using os.listdir in sorted order
        mfcc_names          = sorted(os.listdir(self.mfcc_dir))

        self.mfccs = []

        # TODO: Iterate through mfccs and transcripts
        for i in range(len(mfcc_names)):
        #   Load a single mfcc
            mfcc        = np.load(self.mfcc_dir+"/"+mfcc_names[i])
        #   Do Cepstral Normalization of mfcc (explained in writeup)

            mfcc        = (mfcc - mfcc.mean(axis=0))/mfcc.std(axis=0)
            self.mfccs.append(mfcc)

        # NOTE:
        # Each mfcc is of shape T1 x 27, T2 x 27, ...
        # Each transcript is of shape (T1+2) x 27, (T2+2) x 27 before removing [SOS] and [EOS]

        # TODO: Concatenate all mfccs in self.mfccs such that
        # the final shape is T x 27 (Where T = T1 + T2 + ...)
        self.mfccs          = np.concatenate(self.mfccs)

        # Length of the dataset is now the length of concatenated mfccs/transcripts
        self.length = len(self.mfccs)

        # Take some time to think about what we have done.
        # self.mfcc is an array of the format (Frames x Features).
        # Our goal is to recognize phonemes of each frame
        # From hw0, you will be knowing what context is.
        # We can introduce context by padding zeros on top and bottom of self.mfcc
        self.mfccs = np.pad(self.mfccs, ((self.context, self.context), (0, 0)), 'constant', constant_values=(0, 0))

        # The available phonemes in the transcript are of string data type
        # But the neural network cannot predict strings as such.
        # Hence, we map these phonemes to integers

    def __len__(self):
        return self.length

    def __getitem__(self, ind):

        # TODO: Based on context and offset, return a frame at given index with context frames to the left, and right.
        frames = self.mfccs[ind:(ind+2*self.context+1)]
        # After slicing, you get an array of shape 2*context+1 x 27. But our MLP needs 1d data and not 2d.
        frames = frames.flatten() # TODO: Flatten to get 1d data

        frames      = torch.FloatTensor(frames) # Convert to tensors

        return frames

In [None]:
# Dataset class to load train 100 and 360 data
class AudioDatasetBoth(torch.utils.data.Dataset):

    def __init__(self, root, phonemes = PHONEMES, context=0): # Feel free to add more arguments

        self.context    = context
        self.phonemes   = phonemes

        # TODO: MFCC directory - use partition to acces train/dev directories from kaggle data using root
        self.mfcc_dir       = "{}/{}/mfcc".format(root, "train-clean-360")
        # TODO: Transcripts directory - use partition to acces train/dev directories from kaggle data using root
        self.transcript_dir = "{}/{}/transcript".format(root, "train-clean-360")

        self.mfcc_dir100       = "{}/{}/mfcc".format(root, "train-clean-100")
          # TODO: Transcripts directory - use partition to acces train/dev directories from kaggle data using root
        self.transcript_dir100 = "{}/{}/transcript".format(root, "train-clean-100")


          # TODO: List files in sefl.mfcc_dir using os.listdir in sorted order
        mfcc_names100          = sorted(os.listdir(self.mfcc_dir100))
          # TODO: List files in self.transcript_dir using os.listdir in sorted order
        transcript_names100    = sorted(os.listdir(self.transcript_dir100))

        # TODO: List files in sefl.mfcc_dir using os.listdir in sorted order
        mfcc_names          = sorted(os.listdir(self.mfcc_dir))
        # TODO: List files in self.transcript_dir using os.listdir in sorted order
        transcript_names    = sorted(os.listdir(self.transcript_dir))

        # Making sure that we have the same no. of mfcc and transcripts
        assert len(mfcc_names) == len(transcript_names)
        assert len(mfcc_names100) == len(transcript_names100)
        self.mfccs, self.transcripts = [], []

        # TODO: Iterate through mfccs and transcripts
        for i in range(len(mfcc_names)+len(mfcc_names100)-1):
            #   Load a single mfcc
            #   Load the corresponding transcript
            if i < len(mfcc_names):
              mfcc        = np.load(self.mfcc_dir+"/"+mfcc_names[i])

              transcript  = np.load(self.transcript_dir+"/"+transcript_names[i])
            else:
              i = i-len(mfcc_names)
              mfcc        = np.load(self.mfcc_dir100+"/"+mfcc_names100[i])
              transcript  = np.load(self.transcript_dir100+"/"+transcript_names100[i])

        #   Do Cepstral Normalization of mfcc (explained in writeup)
            mfcc        = (mfcc - mfcc.mean(axis=0))/mfcc.std(axis=0)

            # Remove [SOS] and [EOS] from the transcript
            # (Is there an efficient way to do this without traversing through the transcript?)
            # Note that SOS will always be in the starting and EOS at end, as the name suggests.
            start_idx = np.where(transcript=='[SOS]')[0][-1]
            end_idx = np.where(transcript=='[EOS]')[0][0]
            transcript = transcript[start_idx+1:end_idx]
        #   Append each mfcc to self.mfcc, transcript to self.transcript
            self.mfccs.append(mfcc)
            self.transcripts.append(transcript)

        # TODO: Concatenate all mfccs in self.mfccs such that
        # the final shape is T x 27 (Where T = T1 + T2 + ...)
        self.mfccs          = np.concatenate(self.mfccs)
        self.mfccs.astype(np.float32)

        # TODO: Concatenate all transcripts in self.transcripts such that
        # the final shape is (T,) meaning, each time step has one phoneme output
        self.transcripts    = np.concatenate(self.transcripts)
        #self.transcripts.astype(np.uint8)
        # Hint: Use numpy to concatenate

        # Length of the dataset is now the length of concatenated mfccs/transcripts
        self.length = len(self.mfccs)

        # Take some time to think about what we have done.
        # self.mfcc is an array of the format (Frames x Features).
        # Our goal is to recognize phonemes of each frame
        # From hw0, you will be knowing what context is.
        # We can introduce context by padding zeros on top and bottom of self.mfcc
        self.mfccs = np.pad(self.mfccs, ((self.context, self.context), (0, 0)), 'constant', constant_values=(0, 0))

        # The available phonemes in the transcript are of string data type
        # But the neural network cannot predict strings as such.
        # Hence, we map these phonemes to integers

        # TODO: Map the phonemes to their corresponding list indexes in self.phonemes
        #mapping_dic = {p:phonemes.index(p) for p in phonemes}
        #unique_ph, seq = np.unique(self.transcripts, return_inverse = True)
        self.transcripts = np.array([phonemes.index(p) for p in self.transcripts]).reshape(self.transcripts.shape)
        self.transcripts.astype(np.uint8)
        print(self.mfccs.shape, self.transcripts.shape)
        # Now, if an element in self.transcript is 0, it means that it is 'SIL' (as per the above example)

    def __len__(self):
        return self.length

    def __getitem__(self, ind):

        # TODO: Based on context and offset, return a frame at given index with context frames to the left, and right.
        frames = self.mfccs[ind:(ind+2*self.context+1)]
        # After slicing, you get an array of shape 2*context+1 x 27. But our MLP needs 1d data and not 2d.
        frames = frames.flatten() # TODO: Flatten to get 1d data

        frames      = torch.FloatTensor(frames) # Convert to tensors
        phonemes    = torch.tensor(self.transcripts[ind])

        return frames, phonemes

# Parameters Configuration

Storing your parameters and hyperparameters in a single configuration dictionary makes it easier to keep track of them during each experiment. It can also be used with weights and biases to log your parameters for each experiment and keep track of them across multiple experiments.

In [None]:
config_baseline = {
    'root'          :"/content/data/11-785-s23-hw1p2",
    'epochs'        : 5,
    'batch_size'    : 1024,
    'context'       : 20,
    'init_lr'       : 1e-3,
    'architecture'  : 'very-low-cutoff'
    # Add more as you need them - e.g dropout values, weight decay, scheduler parameters
}


config_queen = {
    'root'          :"/content/data/11-785-s23-hw1p2",
    'epochs'        : 30,
    'batch_size'    : 8192,
    'context'       : 25,
    'init_lr'       : 1e-3,
    'architecture'  : '8_layer_pyramid',
    'dropout'       : 0.35,
    'wdecay'        : 0.005,
    'hidden_size'   : [2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024],
    'num_col'       : 27
    # Add more as you need them - e.g dropout values, weight decay, scheduler parameters
}

config_king = {
    'root'          :"/content/data/11-785-s23-hw1p2",
    'epochs'        : 30,
    'batch_size'    : 16384, #  16384
    'context'       : 30,
    'init_lr'       : 1e-3,
    'architecture'  : '6_layer_pyramid',
    'dropout'       : 0.35,
    'wdecay'        : 0.005,
    'hidden_size'   : [2048, 2048, 2048, 2048, 1280, 1024],
    'num_col'       : 27,
    'scheduler_factor': 0.5,
    'scheduler_patience': 3,
    'scheduler_threshold': 0.0025,
    # Add more as you need them - e.g dropout values, weight decay, scheduler parameters
}

config = config_king

# Create Datasets

In [None]:
#TODO: Create a dataset object using the AudioDataset class for the training data
if ONLY360:
  train_data = AudioDataset(config['root'], phonemes = PHONEMES, context=config['context'], partition= "train-clean-360")
  print("loading train 360")
elif ONLY100:
  train_data = AudioDataset(config['root'], phonemes = PHONEMES, context=config['context'], partition= "train-clean-100")
  print("loading train 100")
else:
  train_data = AudioDatasetBoth(config['root'], phonemes = PHONEMES, context=config['context'])
  print("loading train 360+100")
# TODO: Create a dataset object using the AudioDataset class for the validation data
val_data = AudioDataset(config['root'], phonemes = PHONEMES, context=config['context'], partition= "dev-clean")

# TODO: Create a dataset object using the AudioTestDataset class for the test data
test_data = AudioTestDataset(config['root'], phonemes = PHONEMES, context=config['context'], partition= "test-clean")

loading train 100


In [None]:
# Define dataloaders for train, val and test datasets
# Dataloaders will yield a batch of frames and phonemes of given batch_size at every iteration
# We shuffle train dataloader but not val & test dataloader. Why?

train_loader = torch.utils.data.DataLoader(
    dataset     = train_data,
    num_workers = 4,
    batch_size  = config['batch_size'],
    pin_memory  = True,
    shuffle     = True
)

val_loader = torch.utils.data.DataLoader(
    dataset     = val_data,
    num_workers = 2,
    batch_size  = config['batch_size'],
    pin_memory  = True,
    shuffle     = False
)

test_loader = torch.utils.data.DataLoader(
    dataset     = test_data,
    num_workers = 2,
    batch_size  = config['batch_size'],
    pin_memory  = True,
    shuffle     = False
)


print("Batch size     : ", config['batch_size'])
print("Context        : ", config['context'])
print("Input size     : ", (2*config['context']+1)*27)
print("Output symbols : ", len(PHONEMES)-2)
if 'hidden_size' in config.keys():
  print("Hidden Layer Sizes : ", config['hidden_size'])

print("Train dataset samples = {}, batches = {}".format(train_data.__len__(), len(train_loader)))
print("Validation dataset samples = {}, batches = {}".format(val_data.__len__(), len(val_loader)))
print("Test dataset samples = {}, batches = {}".format(test_data.__len__(), len(test_loader)))

Batch size     :  16384
Context        :  30
Input size     :  1647
Output symbols :  40
Hidden Layer Sizes :  [2048, 2048, 2048, 2048, 1280, 1024]
Train dataset samples = 36091157, batches = 2203
Validation dataset samples = 1928204, batches = 118
Test dataset samples = 1934138, batches = 119


In [None]:
# Testing code to check if your data loaders are working
for i, data in enumerate(train_loader):
    frames, phoneme = data
    print(frames.shape, phoneme.shape)
    break

torch.Size([16384, 1647]) torch.Size([16384])


# Network Architecture


This section defines your network architecture for the homework. We have given you a sample architecture that can easily clear the very low cutoff for the early submission deadline.

In [None]:
# This architecture will make you cross the very low cutoff
# However, you need to run a lot of experiments to cross the medium or high cutoff
class NetworkBaseline(torch.nn.Module):

    def __init__(self, input_size, output_size):

        super(NetworkBaseline, self).__init__()

        self.model = torch.nn.Sequential(
            torch.nn.Linear(input_size, 512),
            torch.nn.ReLU(),
            torch.nn.Linear(512, output_size)
        )

    def forward(self, x):
        out = self.model(x)

        return out

In [None]:
# This architecture is used for the 8 layer 2048x2-1024x6 architecture, alternatively adding dropout layer

class NetworkQueen(torch.nn.Module):

    def __init__(self, input_size, hidden_sizes, output_size, droprate):

        super(NetworkQueen, self).__init__()
        # create hidden layers, note the first size is input size, so layer number is one less
        layer_num = len(hidden_sizes)
        hidden_sizes = [input_size]+hidden_sizes # add input layer
        hidden_layers = []
        for l in range(layer_num):

          hidden_layers.append(torch.nn.Linear(hidden_sizes[l], hidden_sizes[l+1]))
          hidden_layers.append(torch.nn.BatchNorm1d(hidden_sizes[l+1]))
          hidden_layers.append(torch.nn.SiLU())
          if l % 2 == 0:
            hidden_layers.append(torch.nn.Dropout(droprate))
        # for the last layer to output, no activation need in this model
        hidden_layers.append(torch.nn.Linear(hidden_sizes[-1], output_size))

        self.model = torch.nn.Sequential(
            *hidden_layers)


    def forward(self, x):
        out = self.model(x)

        return out

In [None]:
# This architecture is used for the 2048x4-1280-1024 architecture, adding dropout for every layer with 2048 neurons

class NetworkKing(torch.nn.Module):

    def __init__(self, input_size, hidden_sizes, output_size, droprate):

        super(NetworkKing, self).__init__()
        # create hidden layers, note the first size is input size, so layer number is one less
        layer_num = len(hidden_sizes)
        hidden_sizes = [input_size]+hidden_sizes # add input layer
        hidden_layers = []
        for l in range(layer_num):

          hidden_layers.append(torch.nn.Linear(hidden_sizes[l], hidden_sizes[l+1]))
          hidden_layers.append(torch.nn.BatchNorm1d(hidden_sizes[l+1]))
          hidden_layers.append(torch.nn.SiLU())
          if hidden_sizes[l+1] == 2048:
            hidden_layers.append(torch.nn.Dropout(droprate))
        # for the last layer to output, no activation need in this model
        hidden_layers.append(torch.nn.Dropout(droprate-0.2))
        hidden_layers.append(torch.nn.Linear(hidden_sizes[-1], output_size))

        self.model = torch.nn.Sequential(
            *hidden_layers)


    def forward(self, x):
        out = self.model(x)

        return out

In [None]:
# weights initialization
def initialize_weights(m):
    if isinstance(m, torch.nn.Linear):
        torch.nn.init.kaiming_normal_(m.weight.data)
#       torch.nn.init.kaiming_uniform_(m.weight.data)
    elif isinstance(m, torch.nn.BatchNorm1d):
        torch.nn.init.constant_(m.weight.data, 1)
        torch.nn.init.constant_(m.bias.data, 0)


# Define Model, Loss Function and Optimizer

Here we define the model, loss function, optimizer and optionally a learning rate scheduler.

In [None]:
# INPUT_SIZE  = (2*config['context'] + 1) * 27 # Why is this the case?
# frame is flatter, so total 27 cols per row, with 2*context+1 rows
# model_baseline       = NetworkBaseline(INPUT_SIZE, len(train_data.phonemes)).to(device)
model_queen = NetworkKing(
    (2*config['context']+1)*config['num_col'],
    config['hidden_size'],
    len(train_data.phonemes)-2,
    config['dropout']).to(device)
model_queen.apply(initialize_weights)
summary(model_queen, frames.to(device))
# Check number of parameters of your network
# Remember, you are limited to 20 million parameters for HW1 (including ensembles)

                         Kernel Shape   Output Shape     Params  Mult-Adds
Layer                                                                     
0_model.Linear_0         [1647, 2048]  [16384, 2048]  3.375104M  3.373056M
1_model.BatchNorm1d_1          [2048]  [16384, 2048]     4.096k     2.048k
2_model.SiLU_2                      -  [16384, 2048]          -          -
3_model.Dropout_3                   -  [16384, 2048]          -          -
4_model.Linear_4         [2048, 2048]  [16384, 2048]  4.196352M  4.194304M
5_model.BatchNorm1d_5          [2048]  [16384, 2048]     4.096k     2.048k
6_model.SiLU_6                      -  [16384, 2048]          -          -
7_model.Dropout_7                   -  [16384, 2048]          -          -
8_model.Linear_8         [2048, 2048]  [16384, 2048]  4.196352M  4.194304M
9_model.BatchNorm1d_9          [2048]  [16384, 2048]     4.096k     2.048k
10_model.SiLU_10                    -  [16384, 2048]          -          -
11_model.Dropout_11      

  df_sum = df.sum()


Unnamed: 0_level_0,Kernel Shape,Output Shape,Params,Mult-Adds
Layer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0_model.Linear_0,"[1647, 2048]","[16384, 2048]",3375104.0,3373056.0
1_model.BatchNorm1d_1,[2048],"[16384, 2048]",4096.0,2048.0
2_model.SiLU_2,-,"[16384, 2048]",,
3_model.Dropout_3,-,"[16384, 2048]",,
4_model.Linear_4,"[2048, 2048]","[16384, 2048]",4196352.0,4194304.0
5_model.BatchNorm1d_5,[2048],"[16384, 2048]",4096.0,2048.0
6_model.SiLU_6,-,"[16384, 2048]",,
7_model.Dropout_7,-,"[16384, 2048]",,
8_model.Linear_8,"[2048, 2048]","[16384, 2048]",4196352.0,4194304.0
9_model.BatchNorm1d_9,[2048],"[16384, 2048]",4096.0,2048.0


In [None]:
criterion = torch.nn.CrossEntropyLoss() # Defining Loss function.
# We use CE because the task is multi-class classification

#optimizer = torch.optim.Adam(model09.parameters(), lr= config['init_lr']) #Defining Optimizer
# Recommended : Define Scheduler for Learning Rate,
# including but not limited to StepLR, MultiStepLR, CosineAnnealingLR, ReduceLROnPlateau, etc.
# You can refer to Pytorch documentation for more information on how to use them.
#scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[2,4], gamma=0.1)
# scheduler2 = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=2)

# Is your training time very high?
# Look into mixed precision training if your GPU (Tesla T4, V100, etc) can make use of it
# Refer - https://pytorch.org/docs/stable/notes/amp_examples.html

optimizer = torch.optim.AdamW(model_queen.parameters(), lr= config['init_lr'], weight_decay=config['wdecay']) # AdamW - w decay

# Recommended : Define Scheduler for Learning Rate,
# including but not limited to StepLR, MultiStepLR, CosineAnnealingLR, ReduceLROnPlateau, etc.
# You can refer to Pytorch documentation for more information on how to use them.
#scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[2, 4, 6, 8, 10, 12, 14, 16, 18], gamma=0.75)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='max',
    factor=config['scheduler_factor'],
    patience=config['scheduler_patience'],
    threshold=config['scheduler_threshold'],
    verbose=True)

# Training and Validation Functions

This section covers the training, and validation functions for each epoch of running your experiment with a given model architecture. The code has been provided to you, but we recommend going through the comments to understand the workflow to enable you to write these loops for future HWs.

In [None]:
torch.cuda.empty_cache()
gc.collect()

273

In [None]:
def train(model, dataloader, optimizer, criterion):

    model.train()
    tloss, tacc = 0, 0 # Monitoring loss and accuracy
    batch_bar   = tqdm(total=len(train_loader), dynamic_ncols=True, leave=False, position=0, desc='Train')

    for i, (frames, phonemes) in enumerate(dataloader):

        ### Initialize Gradients
        optimizer.zero_grad()

        ### Move Data to Device (Ideally GPU)
        frames      = frames.to(device)
        phonemes    = phonemes.to(device)

        ### Forward Propagation
        logits  = model(frames)

        ### Loss Calculation
        loss    = criterion(logits, phonemes)

        ### Backward Propagation
        loss.backward()

        ### Gradient Descent
        optimizer.step()

        tloss   += loss.item()
        tacc    += torch.sum(torch.argmax(logits, dim= 1) == phonemes).item()/logits.shape[0]

        batch_bar.set_postfix(loss="{:.04f}".format(float(tloss / (i + 1))),
                              acc="{:.04f}%".format(float(tacc*100 / (i + 1))))
        batch_bar.update()

        ### Release memory
        del frames, phonemes, logits
        torch.cuda.empty_cache()

    batch_bar.close()
    tloss   /= len(train_loader)
    tacc    /= len(train_loader)

    return tloss, tacc

In [None]:
# Using mix precision
def train_mp(model, dataloader, optimizer, criterion):

    model.train()
    tloss, tacc = 0, 0 # Monitoring loss and accuracy
    batch_bar   = tqdm(total=len(train_loader), dynamic_ncols=True, leave=False, position=0, desc='Train')
    scaler = torch.cuda.amp.GradScaler()

    # for i, (frames, phonemes) in enumerate(dataloader):
    for i, (frames, phonemes) in enumerate(dataloader):
        # print(frames.shape)
        # print(phonemes.shape)
        ### Initialize Gradients
        optimizer.zero_grad()

        ### Move Data to Device (Ideally GPU)
        frames      = frames.to(device)
        phonemes    = phonemes.to(device)

        with torch.autocast(device_type='cuda', dtype=torch.float16):
            logits = model(frames)
            loss = criterion(logits, phonemes)

        scaler.scale(loss).backward()

        scaler.step(optimizer)

        scaler.update()

        tloss   += loss.item()
        tacc    += torch.sum(torch.argmax(logits, dim= 1) == phonemes).item()/logits.shape[0]

        batch_bar.set_postfix(loss="{:.04f}".format(float(tloss / (i + 1))),
                              acc="{:.04f}%".format(float(tacc*100 / (i + 1))))
        batch_bar.update()

        ### Release memory
        del frames, phonemes, logits
        torch.cuda.empty_cache()

    batch_bar.close()
    tloss   /= len(train_loader)
    tacc    /= len(train_loader)
    #wandb.log({'train_acc': tacc*100, 'train_loss': tloss})
    return tloss, tacc

In [None]:
def eval(model, dataloader):

    model.eval() # set model in evaluation mode
    vloss, vacc = 0, 0 # Monitoring loss and accuracy
    batch_bar   = tqdm(total=len(val_loader), dynamic_ncols=True, position=0, leave=False, desc='Val')

    for i, (frames, phonemes) in enumerate(dataloader):

        ### Move data to device (ideally GPU)
        frames      = frames.to(device)
        phonemes    = phonemes.to(device)

        # makes sure that there are no gradients computed as we are not training the model now
        with torch.inference_mode():
            ### Forward Propagation
            logits  = model(frames)
            ### Loss Calculation
            loss    = criterion(logits, phonemes)

        vloss   += loss.item()
        vacc    += torch.sum(torch.argmax(logits, dim= 1) == phonemes).item()/logits.shape[0]

        # Do you think we need loss.backward() and optimizer.step() here?

        batch_bar.set_postfix(loss="{:.04f}".format(float(vloss / (i + 1))),
                              acc="{:.04f}%".format(float(vacc*100 / (i + 1))))
        batch_bar.update()

        ### Release memory
        del frames, phonemes, logits
        torch.cuda.empty_cache()

    batch_bar.close()
    vloss   /= len(val_loader)
    vacc    /= len(val_loader)

    return vloss, vacc

# Weights and Biases Setup

This section is to enable logging metrics and files with Weights and Biases. Please refer to wandb documentationa and recitation 0 that covers the use of weights and biases for logging, hyperparameter tuning and monitoring your runs for your homeworks. Using this tool makes it very easy to show results when submitting your code and models for homeworks, and also extremely useful for study groups to organize and run ablations under a single team in wandb.

We have written code for you to make use of it out of the box, so that you start using wandb for all your HWs from the beginning.

In [None]:
wandb.login(key="27ad915a9386068b1fc160cd97b84be7ba1fe659") #API Key is in your wandb account, under settings (wandb.ai/settings)

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
# Create your wandb run
run = wandb.init(
    name    = "official-run", ### Wandb creates random run names if you skip this field, we recommend you give useful names
    reinit  = True, ### Allows reinitalizing runs when you re-run this cell
    #id     = "y28t31uz", ### Insert specific run id here if you want to resume a previous run
    #resume = "must", ### You need this to resume previous runs, but comment out reinit = True when using this
    project = "hw1p2", ### Project should be created in your wandb account
    config  = config ### Wandb Config for your run
)

[34m[1mwandb[0m: Currently logged in as: [33mwenxinz3[0m ([33msharonxin1207[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
### Save your model architecture as a string with str(model)
model_arch  = str(model_queen)

### Save it in a txt file
arch_file   = open("model_arch_official.txt", "w")
file_write  = arch_file.write(model_arch)
arch_file.close()

### log it in your wandb run with wandb.save()
wandb.save('model_arch_official.txt')

['/content/wandb/run-20230224_150919-fo4quypo/files/model_arch_official.txt']

## If reloading previous model

In [None]:
RELOAD_PATH = None # or model file to reload, such as 'Attempt460_finetune.pt'

In [None]:
def reload_prev_model(model_name):
    model_queen = NetworkKing(
        (2*config['context']+1)*config['num_col'],
        config['hidden_size'],
        40,
        config['dropout']).to(device)
    model_queen.apply(initialize_weights)
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(model_queen.parameters(), lr= config['init_lr']*0.05, weight_decay=config['wdecay']) # AdamW - w decay

    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=3, threshold=0.0025, verbose=True)
    model_save_name = model_name #'Attempt460_finetune.pt'
    path = F"/content/{model_save_name}"

    checkpoint = torch.load(path)
    model_queen.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

    scheduler.load_state_dict(checkpoint['scheduler'])

    return model_queen, optimizer, scheduler


def save_model_state(model_path, model, optimizer, scheduler, train_loss, val_loss, val_acc):
    model_save_name = model_path # 'Attempt460.pt'
    path = F"/content/drive/MyDrive/{model_save_name}"
    gcp_path = F"/content/{model_save_name}"
    torch.save({
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'scheduler': scheduler.state_dict(),
                'train_loss': train_loss,
                'val_loss':val_loss,
                'val_acc': val_acc
                }, gcp_path)

In [None]:
if RELOAD_PATH is not None:
  model_queen, optimizer, scheduler = reload_prev_model(RELOAD_PATH)

# Experiment

Now, it is time to finally run your ablations! Have fun! --- not fun :(

In [None]:
# Iterate over number of epochs to train and evaluate your model
torch.cuda.empty_cache()
gc.collect()
wandb.watch(model_queen, log="all")

for epoch in range(config['epochs']):

    print("\nEpoch {}/{}".format(epoch+1, config['epochs']))

    curr_lr                 = float(optimizer.param_groups[0]['lr'])
    train_loss, train_acc   = train_mp(model_queen, train_loader, optimizer, criterion)
    val_loss, val_acc       = eval(model_queen, val_loader)
    scheduler.step(val_acc)
    # scheduler.step()

    print("\tTrain Acc {:.04f}%\tTrain Loss {:.04f}\t Learning Rate {:.07f}".format(train_acc*100, train_loss, curr_lr))
    print("\tVal Acc {:.04f}%\tVal Loss {:.04f}".format(val_acc*100, val_loss))
    if (epoch+1) % 5 == 0:
      model_save_name = "Tuning"+MODEL_SAVED_NAME
      save_model_state(model_save_name, model_queen, optimizer, scheduler, train_loss, val_loss, val_acc)

    ### Log metrics at each epoch in your run
    # Optionally, you can log at each batch inside train/eval functions
    # (explore wandb documentation/wandb recitation)
    wandb.log({'train_acc': train_acc*100, 'train_loss': train_loss,
               'val_acc': val_acc*100, 'valid_loss': val_loss, 'lr': curr_lr})

    ### Highly Recommended: Save checkpoint in drive and/or wandb if accuracy is better than your current best


In [None]:
### Finish your wandb run
run.finish()
model_save_name = MODEL_SAVED_NAME
save_model_state(model_save_name, model_queen, optimizer, scheduler, train_loss, val_loss, val_acc)


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
lr,█████▃▃▃▃▁
train_acc,▁▅▆▇▇▇████
train_loss,█▄▃▂▂▂▁▁▁▁
val_acc,▁▄▅▆▆▇▇███
valid_loss,█▅▄▃▃▂▂▁▁▁

0,1
lr,0.00025
train_acc,84.85594
train_loss,0.44112
val_acc,87.23075
valid_loss,0.36695


# Testing and submission to Kaggle

Before we get to the following code, make sure to see the format of submission given in *sample_submission.csv*. Once you have done so, it is time to fill the following function to complete your inference on test data. Refer the eval function from previous cells to get an idea of how to go about completing this function.

In [None]:
def test(model, test_loader):
    ### What you call for model to perform inference?
    model.eval() # TODO >> eval

    ### List to store predicted phonemes of test data
    test_predictions = []

    ### Which mode do you need to avoid gradients?
    with torch.inference_mode(): # TODO

        for i, mfccs in enumerate(tqdm(test_loader)):

            mfccs   = mfccs.to(device)

            logits  = model(mfccs)

            ### Get most likely predicted phoneme with argmax
            predicted_phonemes = torch.argmax(logits, dim = 1)
            predicted_phonemes = [PHONEMES[p] for p in predicted_phonemes]

            ### How do you store predicted_phonemes with test_predictions? Hint, look at eval
            test_predictions.extend(predicted_phonemes)
    # Map the numeric representation of phonemes back to string
    # test_predictions = [PHONEMES[p] for p in test_predictions]

    return test_predictions

In [None]:
os.environ['WANDB_CONSOLE'] = 'off'
wandb.init()
predictions = test(model_queen, test_loader)

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mwenxinz3[0m ([33msharonxin1207[0m). Use [1m`wandb login --relogin`[0m to force relogin


  0%|          | 0/119 [00:00<?, ?it/s]

In [None]:
### Create CSV file with predictions
savepath = F"/content/drive/MyDrive/submission.csv"
with open("./submission.csv", "w+") as f:
    f.write("id,label\n")
    for i in range(len(predictions)):
        f.write("{},{}\n".format(i, predictions[i]))

In [None]:
if SUBMIT_KAGGLE:
  ### Submit to kaggle competition using kaggle API (Uncomment below to use)
  !kaggle competitions submit -c 11-785-s23-hw1p2 -f ./submission.csv -m "Test Submission"

  ### However, its always safer to download the csv file and then upload to kaggle

100% 19.3M/19.3M [00:00<00:00, 45.7MB/s]
Successfully submitted to Frame-Level Speech Recognition