# Read me
1. To successfully run this code, one has to put all the **npy** files under the folder where this code is being run.
2. Ceate a folder called **test** where all the pth files will be stored.
3. Run cells one after another to train and test the model

## Model Description
This model is compsed of 10 layers with *bach normalization* performmed before each *CrossEntropyLoss* activation layer, and the optimizer for gradient descent optimization is Adam.



Details about the model structure is written in a text cell above the corresponding code cell.


In [None]:
import os
import numpy as np
import torch
from torch import nn
from tqdm import tqdm
import pandas as pd
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using {} device'.format(device))

There are three reasons why I set context to be 25:
1. On the piazza post [@411](https://colab.research.google.com/drive/1KAcru1O0asbOrY12v2fBl345_Eynrtbd?authuser=2#scrollTo=8c35946a-1e2e-4096-892b-628f05e99776), it explicitly states that when context = 20, the model can easily help you reach B cutoff. 
2. The writeup for p2 says that the recommended value of context is between 5-30, and because the value of context determines the portion of speech preceding and succeeding one sound, when we have a higher value of context, one input contains more information for classification, which can lead to better performance. 
3. Since context = 20 is the recommended value for reaching B cutoff, to reach A cutoff, the value of context must be increased, and because the maximal recommended value is 30, I simply tried the midpoint between 20 and 30, which is 25, for my model, which indeed boosts the classification performance for my model.

In [None]:
#%%setting offset and context
context = 25
offset = context

In [None]:
#%%Training data_set
class Train_Dataset(torch.utils.data.Dataset):    
    def __init__(self, X, Y, offset = offset, context = context):
        
        # Add data and label to self 
        self.X = X
        self.Y = Y
        
        #data index mapping 
        index_map_X = []
        for i, x in enumerate(X):
            for j, xx in enumerate(x):
                index_pair_X = (i, j)
                index_map_X.append(index_pair_X)
        
        #Assign data index mapping to self 
        self.index_map = index_map_X
        
        #Add length to self 
        self.length = len(index_map_X)
        
        #Add context and offset to self 
        self.context = context
        self.offset = offset
        
        #Zero pad data as-needed for context 
        for i, x in enumerate(self.X):
            self.X[i] = np.pad(x, ((context, context), (0, 0)), 'constant', constant_values=0)
        
    def __len__(self):
        
        #Return length 
        return self.length
    
    def __getitem__(self, index):
        
        #Get index pair from index map 
        i, j = self.index_map[index]
        
        #Calculate starting timestep using offset and context 
        start_j = j + self.offset - self.context
        
        #Calculate ending timestep using offset and context 
        end_j = j + self.offset + self.context + 1
        
        #Get data at index pair with context 
        features = self.X[i][start_j:end_j, :]
        
        #Get label at index pair
        labels = self.Y[i][j]
        
        ### Return data at index pair with context and label at index pair 
        return features, labels
    
    def collate_fn(batch):
        
        ### Select all data from batch 
        batch_x = [x for x, y in batch]
        
        ### Select all labels from batch
        batch_y = [y for x, y in batch]
        
        ### Convert batched data and labels to tensors 
        batch_x = torch.as_tensor(batch_x)
        batch_y = torch.as_tensor(batch_y)
        
        ### Return batched data and labels 
        return batch_x, batch_y

After doing some research, I found this article talking about ways to make Pytorch train faster
, [Faster Deep Learning Training with PyTorch – a 2021 Guide](https://efficientdl.com/faster-deep-learning-in-pytorch-a-guide/). This article says that a 2x speed-up for a single training epoch by using four workers and pinned memory, and that's why I added them in the codes for both of my dataloaders.

The value of batch size is according to the piazza post [@411](https://colab.research.google.com/drive/1KAcru1O0asbOrY12v2fBl345_Eynrtbd?authuser=2#scrollTo=8c35946a-1e2e-4096-892b-628f05e99776)

In [None]:
#%% load training and validating data
print('loading data...')
X = np.load("train.npy", allow_pickle=True)
y = np.load("train_labels.npy", allow_pickle=True)
val_X = np.load("dev.npy", allow_pickle=True)
val_Y = np.load("dev_labels.npy", allow_pickle=True)
train_dataset = Train_Dataset(X, y)
#num_worker = 4 * num_GPU .
#pin_memory = True : allocates the samples in page-locked memory, which speeds-up the transfer
batch_size = 128
train_dataloader  = torch.utils.data.DataLoader(train_dataset, 
                        batch_size = batch_size, 
                        shuffle=True, 
                        collate_fn= Train_Dataset.collate_fn,
                        num_workers = 4,
                        pin_memory = True)

val_dataset = Train_Dataset(val_X, val_Y)
val_dataloader = torch.utils.data.DataLoader(val_dataset, 
                        batch_size = batch_size,
                        shuffle=False, 
                        num_workers = 4,                       
                        collate_fn= Train_Dataset.collate_fn,
                        pin_memory = True)
print('Data loaded')

I found another [article](https://towardsdatascience.com/batch-normalization-in-3-levels-of-understanding-14c2da90a338) talking about not relying batchnorm for avoiding overfitting, so I add drop out layers after activations.

For the value of dropout, after discussing with some of my friends who also take this course, we concluded that strating with a high value of drop out rate such as 0.5 ~ 0.8 and then decreasing it after several layers can lead to a better performance.


In [None]:
#%%neural network
class NeuralNetwork(nn.Module):
    def __init__(self, input_size, output_size, hiddens, activation):
        super().__init__()
        in_h_out_0 = [input_size] + hiddens
        in_h_out_1 =  hiddens + [output_size]
        zip_hiddens = list(zip(in_h_out_0, in_h_out_1))
        layers = [nn.Flatten()]
        for i in range(len(zip_hiddens)):
            layers.append(nn.Linear(zip_hiddens[i][0], zip_hiddens[i][1]))
            layers.append(nn.BatchNorm1d(zip_hiddens[i][1]))
            layers.append(activation)
            #decreasing Droupout layers by layers large 0.5-> small  0.1
            if i < 2: layers.append(nn.Dropout(0.5)) 
            else : layers.append(nn.Dropout(0.1)) 
        self.seq_layers = nn.Sequential(*layers)

    def forward(self, X):
        return self.seq_layers(X)

* The value of learning rate is set according to this piazza post [@411](https://colab.research.google.com/drive/1KAcru1O0asbOrY12v2fBl345_Eynrtbd?authuser=2#scrollTo=8c35946a-1e2e-4096-892b-628f05e99776)

* The size of hidden layers is basicly determined by try-and-error, and it is also based on the post [@411](https://colab.research.google.com/drive/1KAcru1O0asbOrY12v2fBl345_Eynrtbd?authuser=2#scrollTo=8c35946a-1e2e-4096-892b-628f05e99776)
* The choice of loss function is according to the open source, [PYTORCH TUTORIALS_optimization](https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html), whose task is also doing classification.
* I chose Adam for optimizer because it is might be the best overall choice according to this [article](https://ruder.io/optimizing-gradient-descent/), *An overview of gradient descent optimization algorithms.*
* I kept adding the number of layers until the valuation accuracy is at least 0.79; the size of each layers is referenced to this post [@411](https://colab.research.google.com/drive/1KAcru1O0asbOrY12v2fBl345_Eynrtbd?authuser=2#scrollTo=8c35946a-1e2e-4096-892b-628f05e99776)

In [None]:
#%% Set model parameters
learning_rate = 1e-3
activation = nn.LeakyReLU(0.1) 
input_size = (1 + 2 * context) * 40 
output_size = 70
hiddens = [2048,2048,1024,1024,512,512,256,256,128]
#%%Initialize model
#recommended to move a model to GPU before constructing an optimizer
model = NeuralNetwork(input_size= input_size, output_size= output_size, 
                      hiddens= hiddens, activation= activation).to(device)
#Initialize the loss function
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr= learning_rate)

In [None]:
#%%trainning
def training(dataloader, model, loss_fn, optimizer):
    sum_loss, accuracy = 0.0, 0.0
    n_batches = len(dataloader) #number of batches
    #train mode
    model.train()
    for (X, y) in dataloader:
        #sending data to device
        X, y = X.float().to(device), y.long().to(device)
        #Forward
        optimizer.zero_grad()
        prediction = model(X)
        loss = loss_fn(prediction, y)

        # Backpropagation        
        loss.backward()
        optimizer.step()
        sum_loss += loss.item()  
        y_hat = prediction.argmax(1)
        correct = torch.sum(y_hat == y).item() / X.shape[0] #batch_size
        accuracy += correct 
    mean_loss = sum_loss / n_batches
    mean_accuracy = accuracy / n_batches
    return mean_loss, mean_accuracy

In [None]:
#%%validating
def testing(dataloader, model, loss_fn):
    sum_loss, accuracy = 0.0, 0.0
    n_batches = len(dataloader) #number of batches
    with torch.no_grad():
        model.eval()
        for (X, y) in dataloader:
            #sending data to device
            X, y = X.float().to(device), y.long().to(device)
            #Forward
            prediction = model(X)
            #calculating loss
            sum_loss += loss_fn(prediction, y).item()
            y_hat = prediction.argmax(1)            
            correct =  torch.sum(y_hat == y).item() / X.shape[0]
            accuracy += correct 
    mean_loss = sum_loss / n_batches
    mean_accuracy = accuracy / n_batches
    return mean_loss, mean_accuracy

I kept increasing the number of epochs until the the validation accuracy  was guaranteed to be over 0.8

It took about 25~30 minutes to run one epoch, and in order to have validation accuracy over 0.8, the model has to run  at least 64 epochs.
Therefore, it took about 2 days in total.

The model is storded under a folder called **test**, so in order to successfully run this code, one has to create a new folder called **test** in the same folder where my code is being run. OR, one can modify the path where the model should be stored to avoid savepath error.

In [None]:
#%%running model
print('Running model')
list_loss_train, list_accuracy_train, list_loss_val, list_accuracy_val = [], [], [], []
epochs = 70
for e in tqdm(range(epochs)):
    #trainning
    train_loss, train_accuracy = training(train_dataloader, model, loss_fn, optimizer)
    #Saving & Loading a General Checkpoint for Inference and/or Resuming Training
    torch.save({
            'epoch': e,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': train_loss}, '../test/'+str(e)+'.pth')
    list_loss_train += [train_loss]
    list_accuracy_train += [train_accuracy]
    #valdating
    val_loss, val_accuracy = testing(val_dataloader, model, loss_fn)
    list_loss_val += [val_loss]
    list_accuracy_val += [val_accuracy]
#     loss_scheduler.step()
    print("Epoch "+str(e)+": \n train_loss = " + str(train_loss) +
          ", train_accuracy:" + str(train_accuracy) +
          ",\n val_Loss:" + str(val_loss) +
          ", val_accuracy:" + str(val_accuracy))
print('Done.')
np.savetxt("train_loss.csv", list_loss_train, delimiter=",")
np.savetxt("train_accueacy.csv", list_accuracy_train, delimiter=",")
np.savetxt("val_loss.csv", list_loss_val, delimiter=",")
np.savetxt("val_accuracy.csv", list_accuracy_val, delimiter=",")

In [None]:
#%%model loading for testing
checkpoint = torch.load('../test/'+ str(epochs - 1)+'.pth')
model.load_state_dict(checkpoint['model_state_dict'])

<All keys matched successfully>

In [None]:
#%%test data loader
class Test_Dataset(torch.utils.data.Dataset):
    def __init__(self, X, offset= offset, context= context):
        
        ### Assign data to self (1 line)
        self.X = X
        
        ### Define data index mapping (4-6 lines)
        index_map_X = []
        
        for i, x in enumerate(X):
            for j, xx in enumerate(x):
                index_pair_X = (i, j)
                index_map_X.append(index_pair_X)
                
        ### Assign data index mapping to self (1 line)
        self.index_map = index_map_X
        
        ### Assign length to self (1 line)
        self.length = len(self.index_map)
        
        ### Add context and offset to self (1-2 line)
        self.context = context
        self.offset = offset
        
        ### Zero pad data as-needed for context size = 1 (1-2 lines)
        for i, x in enumerate(self.X):
            self.X[i] = np.pad(x, ((context, context), (0, 0)), 'constant', constant_values=0)
        
    def __len__(self):
        
        ### Return length (1 line)
        return self.length
    
    def __getitem__(self, index):
        
        ### Get index pair from index map (1-2 lines)
        i, j = self.index_map[index]
        
        ### Calculate starting timestep using offset and context (1 line)
        start_j = j + self.offset - self.context
        
        ### Calculate ending timestep using offset and context (1 line)
        end_j = j + self.offset + self.context + 1
        
        ### Get data at index pair with context (1 line)
        feature = self.X[i][start_j:end_j,:]
        
        ### Return data (1 line)
        return feature
    
    def collate_fn(batch):
        
        ### Convert batch to tensor (1 line)
        batch_x = torch.as_tensor(batch)
        
        ### Return batched data and labels (1 line)
        return batch_x

In [None]:
#%%loading test datd
test_X = np.load("test.npy", allow_pickle=True)
test_X = Test_Dataset(test_X)
test_dataloader = torch.utils.data.DataLoader(
                        dataset= test_X,
                        batch_size= 1,
                        shuffle= False,
                        num_workers= 4,
                        collate_fn= Test_Dataset.collate_fn,
                        pin_memory= True)

In [None]:
#%%testing
prediction = []
with torch.no_grad():
    model.eval()
    for X in tqdm(test_dataloader):
        #sending data to device
        X = X.float().to(device)
        #Predicting
        prediction += [model(X).argmax(1).item()]
#%%save prediction to csv
prediction_df = pd.DataFrame(data = {'id':np.arange(len(prediction)), 'label': prediction})
prediction_df.to_csv('hw1.csv', index= False)