# Question A2 (10 marks)

### In this question, we will determine the optimal batch size for mini-batch gradient descent. Find the optimal batch size for mini-batch gradient descent by training the neural network and evaluating the performances for different batch sizes. Note: Use 5-fold cross-validation on training partition to perform hyperparameter selection. You will have to reconsider the scaling of the dataset during the 5-fold cross validation.

* note: some cells are non-editable and cannot be filled, but leave them untouched. Fill up only cells which are provided.

#### Plot mean cross-validation accuracies on the final epoch for different batch sizes as a scatter plot. Limit search space to batch sizes {128, 256, 512, 1024}. Next, create a table of time taken to train the network on the last epoch against different batch sizes. Finally, select the optimal batch size and state a reason for your selection.

This might take a while to run, so plan your time carefully.

In [89]:
import tqdm
import time
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
from torch import nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

from scipy.io import wavfile as wav

from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix

from common_utils import set_seed

# setting seed
set_seed()

2. To reduce repeated code, place your

- network (MLP defined in QA1)
- torch datasets (CustomDataset defined in QA1)
- loss function (loss_fn defined in QA1)

in a separate file called **common_utils.py**

Import them into this file. You will not be repenalised for any error in QA1 here as the code in QA1 will not be remarked.

The following code cell will not be marked.

In [90]:
# YOUR CODE HERE
from common_utils import MLP, CustomDataset, loss_fn, split_dataset
# from common_utils import split_dataset, preprocess_dataset

# def preprocess(df):
#     # YOUR CODE HERE
    
#     X_train, y_train, X_test, y_test = split_dataset(df, 'filename', 0.30, 1)
#     X_train_scaled, X_test_scaled = preprocess_dataset(X_train, X_test)
    
#     return X_train_scaled, y_train, X_test_scaled, y_test

# import the x train and y train datasets
df = pd.read_csv('simplified.csv')
df['label'] = df['filename'].str.split('_').str[-2]

df['label'].value_counts()
#change to train test split?
X_train, y_train, X_test, y_test = split_dataset(df, 'filename', 0.30, 1)
print(X_train)
print(y_train)

class BatchCustomDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# X_train_scaled, y_train, X_test_scaled, y_test = preprocess(df)
# X_train, y_train = X_train_scaled, y_train

            tempo  total_beats  average_beats  chroma_stft_mean  \
5358    95.703125         1874     187.400000          0.567137   
642    103.359375          477      79.500000          0.549953   
7565    78.302557          875     125.000000          0.646271   
9584   112.347147         3430     201.764706          0.599859   
9374   198.768029         6870     214.687500          0.724747   
...           ...          ...            ...               ...   
7813   151.999081         3349     176.263158          0.591543   
10955  107.666016         3107     194.187500          0.514742   
905    161.499023        16138     375.302326          0.492115   
5192    92.285156          247      61.750000          0.526634   
235     95.703125          602      86.000000          0.500863   

       chroma_stft_var  chroma_cq_mean  chroma_cq_var  chroma_cens_mean  \
5358          0.088985        0.515726       0.076869          0.262738   
642           0.088597        0.488051       

3. Define different folds for different batch sizes to get a dictionary of training and validation datasets. Preprocess your datasets accordingly.

In [91]:
def generate_cv_folds_for_batch_sizes(parameters, X_train, y_train):
    """
    returns:
    X_train_scaled_dict(dict) where X_train_scaled_dict[batch_size] is a list of the preprocessed training matrix for the different folds.
    X_val_scaled_dict(dict) where X_val_scaled_dict[batch_size] is a list of the processed validation matrix for the different folds.
    y_train_dict(dict) where y_train_dict[batch_size] is a list of labels for the different folds
    y_val_dict(dict) where y_val_dict[batch_size] is a list of labels for the different folds
    """
    # YOUR CODE HERE
#     x train scaled and y train goes in here
#     x train scaled dict is the list of of 4/5 matrices
#     x val scaled dict is the last 1/5 matrix for testing
#     y train dict is the list of 4/5 labels
#     y val dict is the last 1/5 labels for testing
    
#     It will not differ by batch size. X_train_scaled_dict[128] is a list of train dataset for the different folds, and you should have 5 elements in the list in total. It is the same as X_train_scaled_dict[256], etc
#     X_train_scaled_dict should look like {128:[list of 5 folds] 256:[list of 5 folds], 512: [list of 5 folds], 1024: [list of 5 folds]}
#     y_train_dict is a dictionary of 4x5 elements as well, each element is the matrix of labels to train towards
    
#     customdataset = xtrain, ytrain
    
#     cv = KFold(n_splits=5, shuffle=True, random_state=1)
#     for train_idx, test_idx in cv.split(X_train, y_train):
#         X_train_scaled_dict, y_train_dict  = X_train[train_idx], y_train[train_idx]
#         X_val_scaled_dict, y__val_dict = X_train[test_idx], y_train[test_idx]
    batch_sizes = parameters  # Default to batch size of 32 if not provided
    
    cv = KFold(n_splits=5, shuffle=True, random_state=1)
    
#     X_train_scaled_dict = {batch_size: [] for batch_size in batch_sizes}
#     X_val_scaled_dict = {batch_size: [] for batch_size in batch_sizes}
#     y_train_dict = {batch_size: [] for batch_size in batch_sizes}
#     y_val_dict = {batch_size: [] for batch_size in batch_sizes}

    X_train_scaled_dict = {}
    X_val_scaled_dict = {}
    y_train_dict = {}
    y_val_dict = {}

    for batch_size in batch_sizes:
        X_train_scaled_dict[batch_size] = []
        X_val_scaled_dict[batch_size] = []
        y_train_dict[batch_size] = []
        y_val_dict[batch_size] = []
    
    X_train = X_train[:, 1:]
    for train_idx, val_idx in cv.split(X_train):
        X_train_fold, X_val_fold = X_train[train_idx], X_train[val_idx]
        y_train_fold, y_val_fold = y_train[train_idx], y_train[val_idx]
        
        standard_scaler = preprocessing.StandardScaler()
        X_train_fold_scaled = standard_scaler.fit_transform(X_train_fold)
        X_val_fold_scaled = standard_scaler.fit_transform(X_val_fold)
        
        for batch_size in batch_sizes:
            X_train_scaled_dict[batch_size].append(X_train_fold_scaled)
            X_val_scaled_dict[batch_size].append(X_val_fold_scaled)
            y_train_dict[batch_size].append(y_train_fold)
            y_val_dict[batch_size].append(y_val_fold)
    
    return X_train_scaled_dict, X_val_scaled_dict, y_train_dict, y_val_dict

batch_sizes = [128,256,512,1024]
X_train_scaled_dict, X_val_scaled_dict, y_train_dict, y_val_dict = generate_cv_folds_for_batch_sizes(batch_sizes, X_train.to_numpy(), y_train)
# sanity check: note that 6751 / 8439 is around 80%
# print(X_train_scaled_dict)
print(len(X_train_scaled_dict))
print(len(X_train_scaled_dict[128]))
print(len(X_train_scaled_dict[128][0]))
print(len(X_train_scaled_dict[256][2][0]))
print(X_train_scaled_dict[128][0])

# sanity check: this is the other 20%
# print(X_val_scaled_dict)
print(len(X_val_scaled_dict))
print(len(X_val_scaled_dict[128]))
print(len(X_val_scaled_dict[128][0]))
print(len(X_val_scaled_dict[256][2][0]))
print(X_val_scaled_dict[128][0])

# print(y_train_dict)
print(len(y_train_dict))
print(len(y_train_dict[128]))
print(len(y_train_dict[128][0]))
print(y_train_dict[128][0])
print(y_train_dict[128][0][1])

# print(y_val_dict)
print(len(y_val_dict))
print(len(y_val_dict[128]))
print(len(y_val_dict[128][0]))
print(y_val_dict[128][0])
print(y_val_dict[128][0][1])

4
5
6751
77
[[-0.49106531  0.01265512  0.05686221 ...  1.18119068 -0.38849754
  -1.03574435]
 [-0.79797693 -0.98345107 -0.22072241 ...  0.05499416  0.55841098
   0.96548922]
 [-0.71053897 -0.56340629  1.33516163 ...  1.38367374 -0.88628092
   0.96548922]
 ...
 [ 2.64264068  1.74732306 -1.15502705 ...  0.18848147 -0.14767633
  -1.03574435]
 [-0.8485064  -1.1473147  -0.59740884 ... -0.8292896  -0.76430263
  -1.03574435]
 [-0.77051526 -0.92344468 -1.01371018 ... -1.33465669  0.51198613
   0.96548922]]
4
5
1688
77
[[-0.82343749 -1.10706956 -0.41688652 ... -0.16468361 -0.68326526
  -1.02763283]
 [ 0.98136812  2.11967717  0.5703326  ...  1.19125302 -0.5304174
  -1.02763283]
 [ 2.26269824  1.56816612 -0.46104367 ... -0.4657886  -0.76811365
  -1.02763283]
 ...
 [-0.84217921 -1.13612265 -1.27788254 ... -0.73719861  0.23449046
   0.97311021]
 [-0.82817678 -1.14106785  0.41268208 ...  0.68536118 -0.9647467
  -1.02763283]
 [-0.6952614  -0.71454386 -0.90409557 ...  0.72641355 -0.53359531
   0.97311

4. Perform hyperparameter tuning for the different batch sizes with 5-fold cross validation.

In [92]:
def intialise_loaders_batch(X_train_scaled, y_train, X_test_scaled, y_test, batch_size):

#     print("X_train_scaled in initialise loaders batch")
#     print(len(X_train_scaled[0]))
    train_data = BatchCustomDataset(X_train_scaled,y_train)
#     print(len(train_data[1]))
    test_data = BatchCustomDataset(X_test_scaled,y_test)
    
    train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
    test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=True)
    
    return train_dataloader, test_dataloader

def train_loop_batch(dataloader, model, loss_fn, optimizer, x_test, y_test):
    # put within the epochs loop
#     size = len(dataloader.dataset)
#     num_batches = len(dataloader)
#     print(size)
#     print(num_batches)
#     train_loss, train_correct = 0, 0
    acc_ = []
#     print
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
#         print(len(X[0]))
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
#         print(len(x_test[0]))
        pred = model(torch.tensor(x_test, dtype=torch.float))
#         print(pred)
#         print(y_test)
#         acc__ = (pred.argmax(1) == torch.tensor(y_test, dtype=torch.float).argmax(1)).type(torch.float).mean()
        acc__ = (pred.argmax(1) == torch.tensor(y_test, dtype=torch.float)).type(torch.float).mean()
        
        acc_.append(acc__.item())
        
    return acc_
#         train_loss += loss.item()
#         train_correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    
#     train_loss /= num_batches
#     train_correct /=size

#     return train_loss, train_correct

# YOUR CODE HERE
def find_optimal_hyperparameter(X_train_scaled_dict, X_val_scaled_dict, y_train_dict, y_val_dict, batch_sizes):
    cv = KFold(n_splits=5, shuffle=True, random_state=1)
    
    model = MLP(77,128,2)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    loss_fn = nn.CrossEntropyLoss()
    
    cross_validation_times = []
#     acc = []
    foldaccuracyofabatchsize = []
    timeforafoldforthatbatchsize = []
    meanaccuracyofabatchsizelist = []
    meantimeofabatchsizelist = []
    for batch_size in batch_sizes:
        print(foldaccuracyofabatchsize)
        print(timeforafoldforthatbatchsize)
        foldaccuracyofabatchsize = []
        timeforafoldforthatbatchsize = []
        for idxy in range(0,5):
            x_train = X_train_scaled_dict[batch_size][idxy]
            y_train = y_train_dict[batch_size][idxy]
            x_test = X_val_scaled_dict[batch_size][idxy]
            y_test = y_val_dict[batch_size][idxy]
            
#             acc_ = []
#             time = []
#             print(len(x_train[0]))
            train_dataloader, test_dataloader = intialise_loaders_batch(x_train, y_train, x_test, y_test, batch_size)
#             print(train_dataloader.dataset)
            for epoch in range(100):
                start = time.time()
                acc_ = train_loop_batch(train_dataloader, model, loss_fn, optimizer, x_test, y_test)
                end = time.time()
                # for a fold, the list of accuracies for the batches of that epoch^
                if epoch==100:
                    foldaccuracyofabatchsize.append(np.mean(np.array(acc_), axis = 0))
#                     the accuracy for the last epoch of that fold - the fold accuracy for that batch size, length is 5
                    timeforafoldforthatbatchsize.append(end-start)
        meanaccuracyofabatchsizelist.append(np.mean(np.array(foldaccuracyofabatchsize), axis = 0))
        meantimeofabatchsizelist.append(np.mean(np.array(timeforafoldforthatbatchsize), axis = 0))
        # length should be 4^

#         acc_ = []
#         for no_hidden in hidden_units:
        
#             model = FFN(no_inputs, no_hidden, no_outputs)
    
#             loss_fn = torch.nn.CrossEntropyLoss()
#             optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    
#             for epoch in range(100):
#                 pred = model(torch.tensor(x_train, dtype=torch.float))
#                 loss = loss_fn(pred, torch.tensor(y_train, dtype=torch.float))
    
#                 # Backpropagation
#                 optimizer.zero_grad()
#                 loss.backward()
#                 optimizer.step()
    
#             pred = model(torch.tensor(x_test, dtype=torch.float))
#             acc__ = (pred.argmax(1) == torch.tensor(y_test, dtype=torch.float).argmax(1)).type(torch.float).mean()
    
#             acc_.append(acc__.item())
#             accuracylistfor5foldsforabatchsize.append(acc_)
#         acc.append(accuracylistfor5foldsforabatchsize.mean)
#         acc.append(acc_)
    
#     cv_acc = np.mean(np.array(acc), axis = 0)
#     cross_validation_accuracies = cv_acc
    cross_validation_accuracies = meanaccuracyofabatchsizelist
    cross_validation_times = meantimeofabatchsizelist
    print(cross_validation_accuracies)
    print(cross_validation_times)
    return cross_validation_accuracies, cross_validation_times

batch_sizes = [128,256,512,1024]
cross_validation_accuracies, cross_validation_times = find_optimal_hyperparameter(X_train_scaled_dict, X_val_scaled_dict, y_train_dict, y_val_dict, batch_sizes)


[]
[]


KeyboardInterrupt: 

5. Plot scatterplot of mean cross validation accuracies for the different batch sizes.

In [None]:
# YOUR CODE HERE

6. Create a table of time taken to train the network on the last epoch against different batch sizes. Select the optimal batch size and state a reason for your selection.

In [None]:
df = pd.DataFrame({'Batch Size':
                   'Last Epoch Time':
                  })

df

In [None]:
# YOUR CODE HERE
optimal_batch_size =
reason =