# Machine Learning
## Programming Assessment 5: Neural Networks

### Instructions


*   The aim of this assignment is to learn machine learning tools - Keras, Sklearn and PyTorch.
*   You must use the Python programming language.
*   You can add as many code/markdown cells as required.
*   ALL cells must be run (and outputs visible) in order to get credit for your work.
*   Please use procedural programming style and comment your code thoroughly.
*   There are three parts of this assignment. The import statements for the required libraries is already given.


### Introduction
In this assignment, you will be using neural networks to implement a simplified version of a speech recognizer which aims to identify what digit has been spoken in a given audio file.

In order to accomplish this, you will be using different toolkits, popularly used in machine learning for training models. In this assignment, you will be introduced to Sklearn, Keras, and Pytorch. An implementation from scratch is not required for the purposes of this assignment.

Have fun!

In [1]:
import numpy as np
import pandas as pd

## Part 1: Feature Extraction
You will the MNIST audio dataset which can be downloaded from [here](https://www.kaggle.com/datasets/sripaadsrinivasan/audio-mnist). The dataset contains audio recordings, where speakers say digits (0 to 9) out loud. Use the following line of code to read the audio file:
```python
audio, sr = librosa.load(file_path, sr=16000)
```
You need to extract MFCC features for each audio file, the feature extraction code is give (you can read about MFCC from [here](https://link.springer.com/content/pdf/bbm:978-3-319-49220-9/1.pdf)). Length of each feature vector will be 13. You need to save all the feature vectors in a csv file with ith column representing ith feature, and each row representing an audio file. Add a 'y' column to the csv file and append the labels column at the end. Your csv file should look like this:

| x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | x11 | x12 | x13 | y |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| -11.347038 | -8.070062 | -0.915299 | 6.859546 | 8.754656 | -3.440287 | -5.738487 | -21.853178 | -9.859462 | 3.584948 | -2.661195	| 1.023747 | -4.574332 | 2 |

Print out 2 vectors in this notebook.

Split the dataset into train and test with 80:20 ratio. Print the train data size and test data size.

In [2]:
from glob import glob
import python_speech_features as mfcc
import librosa
from sklearn.model_selection import train_test_split

In [3]:
def get_MFCC(audio, sr):
    features = mfcc.mfcc(audio, sr, 0.025, 0.01, 13, appendEnergy = True)
    return np.mean(features, axis=0)

In [4]:
from IPython.display import Audio
file_path = 'data/01/0_01_0.wav'
audio, sr = librosa.load(file_path, sr=16000)
Audio(audio, rate=sr)

In [5]:
import os
data_dir = 'data/'
file_paths = glob(os.path.join(data_dir,"*/*.wav"))
data = []
labels = []
print(len(file_paths))

30000


In [14]:
from tqdm import tqdm
for file_path in tqdm(file_paths):
    label = int(os.path.basename(file_path).split('_')[0])
    audio,sr = librosa.load(file_path, sr=16000)
    features = get_MFCC(audio, sr)
    data.append(features)
    labels.append(label)

labels = np.array(labels)
df = pd.DataFrame(data, columns=[f"x{i+1}" for i in range(13)])
df['y'] = labels
df.to_csv("mnist_features.csv", index=False)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30000/30000 [09:02<00:00, 55.30it/s]


In [4]:
df = pd.read_csv("mnist_features.csv")

In [5]:
df.head()

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,y
0,-10.666645,-1.41154,-2.47352,7.168286,-3.494081,-2.785832,-14.398475,-4.217254,3.39559,-11.173119,-2.370115,3.899111,-11.044119,0
1,-11.414706,-2.149634,-0.706419,10.050852,1.945194,-5.537895,-12.362039,0.239243,1.692732,-6.133325,3.241282,5.820779,-7.441492,0
2,-10.242019,-2.336824,-5.271823,5.837609,-1.33789,-2.493693,-13.57208,-7.714594,0.306567,-6.59189,-1.882395,3.712337,-6.204464,0
3,-10.193638,-3.643128,-8.129836,2.650802,-0.697407,-8.648287,-18.1957,-2.643146,9.175978,-4.324699,-1.795747,7.319704,-11.465352,0
4,-10.485194,-4.505045,-0.209834,2.821504,-3.985702,-4.391896,-15.959238,-1.718976,2.344726,-8.538168,-1.72591,4.176098,-7.359689,0


In [6]:
df.y.unique()

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

In [7]:
train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=42)
print("Train Size", train_df.shape)
print("Test Size", test_df.shape)

Train Size (24000, 14)
Test Size (6000, 14)


## Part 2: Neural Network Implementation

### Task 2.1:  Scikit-learn

In this part you will use the [Scikit-learn](https://scikit-learn.org/stable/index.html) to implement the [Neural Network](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) and apply it to the MNIST audio dataset (provided in part 1). Split the training dataset into train and evaluation data with 90:10 ratio. Run evaluation on X_eval while training on X_train. Tune the hyperparameters to get the best possible classification accuracy. You need to report accuracy, recall, precision and F1 score on the test dataset and print the confusion matrix.

Expected value for accuracy is 87 or above.

In [38]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [39]:
def normalisation(X_train, X_test):

    X_train = np.array(X_train, dtype=float)
    X_test = np.array(X_test, dtype=float)

    mean = X_train.mean(axis=0)
    std = X_train.std(axis=0)
    std[std == 0] = 1  # Avoid division by zero

    X_train_norm = (X_train - mean) / std
    X_test_norm = (X_test - mean) / std

    return X_train_norm, X_test_norm



In [40]:
X = train_df.drop(columns=['y'], axis=1)
y = train_df['y']

In [41]:
X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.1, shuffle=True, random_state=42)
print("Train Size", X_train.shape)
print("Val Size", X_eval.shape)

Train Size (21600, 13)
Val Size (2400, 13)


In [42]:
X_train_sc, X_eval_sc = normalisation(X_train, X_eval)

In [43]:
mlp = MLPClassifier(hidden_layer_sizes=(256,128,64,32), max_iter=200, learning_rate='adaptive', early_stopping=True, n_iter_no_change=5, random_state=42)
mlp.fit(X_train_sc, y_train)
y_eval_pred = mlp.predict(X_eval_sc)
print("Scikit-Learn MLP:")
print(classification_report(y_eval, y_eval_pred))
print(confusion_matrix(y_eval, y_eval_pred))

Scikit-Learn MLP:
              precision    recall  f1-score   support

           0       0.92      0.91      0.91       225
           1       0.95      0.96      0.95       243
           2       0.93      0.92      0.93       243
           3       0.92      0.95      0.94       255
           4       0.98      0.96      0.97       235
           5       0.97      0.99      0.98       255
           6       1.00      1.00      1.00       237
           7       0.97      0.98      0.98       235
           8       0.98      0.97      0.98       243
           9       0.97      0.95      0.96       229

    accuracy                           0.96      2400
   macro avg       0.96      0.96      0.96      2400
weighted avg       0.96      0.96      0.96      2400

[[205   2   6   5   2   2   0   3   0   0]
 [  1 233   2   0   1   2   0   0   0   4]
 [ 12   0 223   8   0   0   0   0   0   0]
 [  1   0   6 243   0   0   0   1   3   1]
 [  1   6   0   0 226   2   0   0   0   0]
 [  1   

In [44]:
X_test = test_df.drop(columns=['y'], axis=1)
y_test = test_df['y']

In [45]:
_,X_test_sc = normalisation(X_train, X_test)

In [46]:
y_test_pred = mlp.predict(X_test_sc)
print("Scikit-Learn MLP for Testing Data:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_test_pred))
print(confusion_matrix(y_test, y_test_pred))

Scikit-Learn MLP for Testing Data:
Accuracy: 0.9521666666666667
              precision    recall  f1-score   support

           0       0.92      0.93      0.93       596
           1       0.93      0.97      0.95       599
           2       0.93      0.93      0.93       605
           3       0.95      0.95      0.95       589
           4       0.98      0.96      0.97       607
           5       0.97      0.99      0.98       603
           6       1.00      0.99      0.99       620
           7       0.98      0.97      0.98       589
           8       0.97      0.96      0.96       583
           9       0.97      0.94      0.96       609

    accuracy                           0.96      6000
   macro avg       0.96      0.96      0.96      6000
weighted avg       0.96      0.96      0.96      6000

[[552   6  26   2   3   1   0   4   0   2]
 [  1 582   2   0   4   1   0   0   0   9]
 [ 24   4 560  12   0   0   0   2   2   1]
 [  4   0  14 558   0   0   0   0  12   1]
 [  4

### Task 2.2: Tensorflow Keras

In this part you will use the [Keras](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential) to implement the [Neural Network](https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/) and apply it to the MNIST audio dataset (provided in part 1). Split the training dataset into train and evaluation data with 90:10 ratio. Run evaluation on X_eval while training on X_train. Tune the hyperparameters to get the best possible classification accuracy. You need to report accuracy, recall, precision and F1 score on the test dataset and print the confusion matrix.

Expected value for accuracy is 87 or above.

In [47]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import optimizers

In [48]:
# Set the parameters accordingly
LEARNING_RATE = 0.001
BATCH_SIZE = 32
EPOCHS = 30

In [49]:
model = Sequential([
    Input(shape=(13,)),
    Dense(128, activation='silu'),
    Dense(64, activation='silu'),
    Dense(32, activation='silu'),
    Dense(10, activation='softmax')
])

model.compile(optimizer=optimizers.Adam(learning_rate=LEARNING_RATE), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train_sc, y_train, validation_data=(X_eval_sc, y_eval), epochs=EPOCHS, batch_size=BATCH_SIZE)
y_pred = np.argmax(model.predict(X_test_sc), axis=1)
print(f"TensorFlow Keras MLP:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
TensorFlow Keras MLP:
Accuracy: 0.9543333333333334
              precision    recall  f1-score   support

           0       0.92      0.91      0.91       596
           1       0.94      0.95      0.95       599
           2       0.91      0.92      0.92       605
           3       0.95      0.93      0.94       589
           4       0.99      0.96      0.97       607
           5       0.97      0.99      0.98       603
           6       1.00      0.99      0.99       620
           7       0.97      0.98      0.98       589
           8       0.94      0.97      0.96       583
           9       0.95      0.94      0.95       609

  

### Task 2.3: Pytorch

In this part you will use the [Keras](https://pytorch.org/docs/stable/nn.html) to implement the [Neural Network](https://medium.com/analytics-vidhya/a-simple-neural-network-classifier-using-pytorch-from-scratch-7ebb477422d2) and apply it to the MNIST audio dataset (provided in part 1). Split the training dataset into train and evaluation data with 90:10 ratio. Run evaluation on X_eval while training on X_train. You need to use DataLoader to generate batches of data. Tune the hyperparameters to get the best possible classification accuracy. You need to report training loss, training accuracy, validation loss and validation accuracy after each epoch in the following format:
```
Epoch 1/2
loss: 78.67749792151153 - accuracy: 0.6759259259259259 - val_loss: 6.320814955048263 - val_accuracy: 0.7356481481481482
Epoch 2/2
loss: 48.70551285566762 - accuracy: 0.7901234567901234 - val_loss: 6.073690168559551 - val_accuracy: 0.7791666666666667
```
You need to report accuracy, recall, precision and F1 score on the test dataset and print the confusion matrix.

Expected value for accuracy is 87 or above.

In [30]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import accuracy_score

In [31]:
class Data(Dataset):
    def __init__(self, X_train, y_train):
        # Code here
        self.X = torch.tensor(X_train, dtype=torch.float32)
        self.y = torch.tensor(y_train.values, dtype=torch.long)

    def __getitem__(self, index):
        # Code here
        return self.X[index], self.y[index]

    def __len__(self):
        # Code here
        return len(self.X)

In [32]:
class NeuralNetwork(nn.Module):
    def __init__(self, input_size, hidden1_size, hidden2_size, output_size):
        super(NeuralNetwork, self).__init__()
        # Code here
        self.fc1 = nn.Linear(input_size, hidden1_size)
        self.fc2 = nn.Linear(hidden1_size, hidden2_size)
        self.fc3 = nn.Linear(hidden2_size, output_size)

    def forward(self, x):
        # Code here
        x = torch.functional.F.silu(self.fc1(x))
        x = torch.functional.F.silu(self.fc2(x))
        x = self.fc3(x)
        return x

In [33]:
# Set the parameters accordingly
LEARNING_RATE = 0.001
BATCH_SIZE = 32
EPOCHS = 30

In [34]:
# Set the loss function and optimizer accordingly
loss_function = nn.CrossEntropyLoss()
# Initialize the model
model = NeuralNetwork(input_size=13, hidden1_size=128, hidden2_size=64, output_size=10)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

In [35]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [36]:
train_loader = DataLoader(Data(X_train_sc, y_train), batch_size=BATCH_SIZE, shuffle=True)
eval_loader = DataLoader(Data(X_eval_sc, y_eval), batch_size=BATCH_SIZE, shuffle=False)
model.to(device)

for epoch in range(1, EPOCHS + 1):
    model.train()
    train_loss = 0.0
    train_correct = 0

    for X_batch, y_batch in train_loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)

        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = loss_function(outputs, y_batch)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        preds = torch.argmax(outputs, dim=1)
        train_correct += (preds == y_batch).sum().item()

    train_accuracy = train_correct / len(X_train)

    model.eval()
    val_loss = 0.0
    val_correct = 0

    with torch.no_grad():
        for X_batch, y_batch in eval_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)

            outputs = model(X_batch)
            loss = loss_function(outputs, y_batch)

            val_loss += loss.item()
            preds = torch.argmax(outputs, dim=1)
            val_correct += (preds == y_batch).sum().item()

    val_accuracy = val_correct / len(X_eval)
    
    print(f"Epoch {epoch}/{EPOCHS}")
    print(f"loss: {train_loss:.8f} - accuracy: {train_accuracy:.16f} " f"- val_loss: {val_loss:.8f} - val_accuracy: {val_accuracy:.16f}")

Epoch 1/30
loss: 411.84001663 - accuracy: 0.7943981481481481 - val_loss: 26.81809171 - val_accuracy: 0.8708333333333333
Epoch 2/30
loss: 222.01539905 - accuracy: 0.8752314814814814 - val_loss: 22.09319066 - val_accuracy: 0.8895833333333333
Epoch 3/30
loss: 190.18859660 - accuracy: 0.8937500000000000 - val_loss: 20.16711137 - val_accuracy: 0.8991666666666667
Epoch 4/30
loss: 170.62432621 - accuracy: 0.9057407407407407 - val_loss: 17.89054187 - val_accuracy: 0.9083333333333333
Epoch 5/30
loss: 156.77507711 - accuracy: 0.9137500000000000 - val_loss: 16.90262080 - val_accuracy: 0.9150000000000000
Epoch 6/30
loss: 145.75258925 - accuracy: 0.9215277777777777 - val_loss: 17.19172376 - val_accuracy: 0.9154166666666667
Epoch 7/30
loss: 135.99204376 - accuracy: 0.9256481481481481 - val_loss: 15.11402527 - val_accuracy: 0.9279166666666666
Epoch 8/30
loss: 128.56776706 - accuracy: 0.9295833333333333 - val_loss: 14.67488606 - val_accuracy: 0.9295833333333333
Epoch 9/30
loss: 120.40518783 - accuracy

In [37]:
test_loader = DataLoader(Data(X_test_sc, y_test), batch_size=BATCH_SIZE, shuffle=False)

model.eval()

all_preds = []
all_labels = []

with torch.no_grad():
    for X_batch, y_batch in test_loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)

        outputs = model(X_batch)
        preds = torch.argmax(outputs, dim=1)

        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(y_batch.cpu().numpy())

accuracy = accuracy_score(all_labels, all_preds)
clf_report = classification_report(all_labels, all_preds)
conf_matrix = confusion_matrix(all_labels, all_preds)

print("Test Set Evaluation:")
print(f"Accuracy : {accuracy:.4f}")
print(f"Classification Report :")
print(clf_report)
print("Confusion Matrix:")
print(conf_matrix)

Test Set Evaluation:
Accuracy : 0.9540
Classification Report :
              precision    recall  f1-score   support

           0       0.89      0.94      0.92       596
           1       0.94      0.95      0.95       599
           2       0.96      0.85      0.90       605
           3       0.93      0.96      0.95       589
           4       0.98      0.98      0.98       607
           5       0.98      0.99      0.98       603
           6       0.99      0.99      0.99       620
           7       0.97      0.97      0.97       589
           8       0.96      0.97      0.97       583
           9       0.95      0.94      0.94       609

    accuracy                           0.95      6000
   macro avg       0.95      0.95      0.95      6000
weighted avg       0.95      0.95      0.95      6000

Confusion Matrix:
[[563   2  15   4   4   0   1   3   0   4]
 [  4 568   3   0   3   1   0   0   0  20]
 [ 47   4 516  29   0   0   0   6   2   1]
 [  2   0   1 567   0   0   0  