# MTH 5320 - Neural Networks
## Tsz Chung Ho
## Homework 1


## Problem 1
Let $\mu$ and $\sigma$ be the mean and standard deviation of $\{x_i\}_{i=1}^{n} \subset \mathbb{R}$: 

$\mu = \frac{1}{n}\sum_{i=1}^n x_i$ and $\sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2$

Let $z_i = \frac{x_i - \mu}{\sigma}$.

We show that the mean ($\mu_z$) and variance ($\sigma_{z}^2$) for $\{z_i\}_{i=1}^{n}$ is 0 and 1 respectively.


We have 

\\[\mu_z = \frac{1}{n} \sum_{i=1}^n z_i = \frac{1}{n\sigma} \bigg(\sum_{i=1}^n x_i - n\mu \bigg) = \frac{1}{n\sigma} (n\mu - n\mu) = 0 \\]

and

\begin{equation}
\sigma_{z}^2 = \frac{1}{n} \sum_{i=1}^n (z_i - \mu_z)^2 = \frac{1}{n} \sum_{i=1}^n z_i^2 = \frac{1}{n\sigma^2} \sum_{i=1}^n (x_i - \mu)^2
=\frac{1}{n\sigma^2} n\sigma^2 = 1 
\end{equation}




## Problem 2

Let $x=(x_1,...,x_m) \in \mathbb{R}^m$

Claim: $||x||_{\infty} = \lim_{p \to \infty} ||x||_p$

We have $||x||_p^p = \sum_{i=1}^m |x_i|^p \leq n\max\{|x_i|\}^p$, so

$||x||_p \leq n^{1/p} ||x||_{\infty}$.

We also have $(0 + ... + (\max{|x_i|})^p + ... + 0) \leq (|x_1|^p +...+|x_m|^p)$, so
$||x||_{\infty} = \max\{|x_i|\} \leq (|x_1|^p +...+|x_m|^p)^{1/p} = ||x||_p$,  so

\\[||x||_{\infty} \leq ||x||_p \leq  n^{1/p} ||x||_{\infty},\\]

and by the Squeeze theorem (letting $p \to \infty$) we have our result.

## Problem 3: Implement kNN with hyperparameters k and p
We allow the Lp norm to be varied as a hyperparameter as well. 

In [56]:
import numpy as np
from scipy.stats import mode
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
import matplotlib.pyplot as plt
import time

# increase the width of boxes in the notebook file (this is only cosmetic)
np.set_printoptions(linewidth=180)

In [57]:
# Create a class for the k-nearest neighbor classifier
class kNearestNeighborClassifier:
    # constructor to save the hyperparameters k and p
    def __init__(self, k = 5, p = 2):
        # initialize the number of neighbors and the Lp norm to use
        self.neighbors = k
        self.norm = p
        # print a warning if k is even
        #if k % 2 == 0:
        #    print('[WARNING] An odd number} is recommended for k to avoid tie votes in the kNN classifier.')
        # print a warning if p <= 0
        #if p <= 0:
        #   print('[WARNING] A positive number} is needed.')
    
    # fit the model to the training data (for kNN, there's no actual fitting involved)
    def fit(self, X, y):
        '''
        Record the class labels for training data
        
        Inputs
        ------
        
        X: a matrix of datapoints from the training data, each row is a point
        y: a vector of labels for each datapoint
        
        '''
        
        # record the unique class labels
        self.classes = np.unique(y)
        
        # print a warning if we only input one class
        if self.classes.shape[0] < 2:
            print('[WARNING] There should be at least two classes in the input data.')
            
        # record the data and labels
        self.data = X
        self.labels = y
    
    # use the classifier to predict the classifications of the testing data
    def predict(self, X):
        '''
        Predict the class labels for the input data
        
        Inputs
        ------
        
        X: a matrix of datapoints from the testing data, each row is a point
        
        Outputs
        -------
        
        classes: the class predicted by the k-nearest neighbor classifier for each testing datapoint
        
        '''
        # initialize the predicted classes
        yPredicted = np.empty([X.shape[0],1])
        
        # loop over the datapoints in X
        for row in range(X.shape[0]):
            datapoint = X[row,]
            
            # find the distances from the datapoint to each training point using the Lp norm
            distances = np.linalg.norm(self.data - datapoint, self.norm, axis = 1)
            
            # find the indices of the smallest k distances
            indices = np.argsort(distances)[:self.neighbors]
            
            # find the the class labels of the nearest neighbors
            nearestClasses = self.labels[indices]
            
            # determine the predicted class by finding the mode
            yPredicted[row] = int(mode(nearestClasses)[0][0])
            
        return yPredicted

# Problem 4: Classify CIFAR10 dataset

We train and test our classifier using the first 10000 images from CIFAR10 dataset

In [58]:
# Importing CIFAR10 PICTURES
from tensorflow.keras.datasets import cifar10 
cifarData = cifar10.load_data()

In [59]:
# Taking first 10000 images 
num_pics = 10000
# Storing picture data as 1D vectors for norm computation and labels
X = cifarData[0][0][:num_pics].reshape([num_pics, 32*32*3])
Y = cifarData[0][1][:num_pics]

# Randomly splitting data into training set and testing set (0.75-0.25 ratio)
(trainX, testX, trainY, testY) = train_test_split(X, Y, test_size = 0.25, random_state = 1)

In [66]:
# Fit the model to training data
model = kNearestNeighborClassifier() #Default k=5, p=2
model.fit(trainX, trainY)
start = time.time()
print(classification_report(testY, model.predict(testX)))
end = time.time() - start
print(f"Time taken is: {end}. (k=5,p=2). Not normalized.")

              precision    recall  f1-score   support

           0       0.14      0.68      0.23       182
           1       0.25      0.05      0.08       214
           2       0.19      0.32      0.24       205
           3       0.20      0.10      0.14       221
           4       0.15      0.09      0.12       202
           5       0.14      0.10      0.12       182
           6       0.17      0.06      0.09       205
           7       0.23      0.05      0.08       216
           8       0.28      0.28      0.28       195
           9       0.62      0.06      0.10       178

    accuracy                           0.17      2000
   macro avg       0.24      0.18      0.15      2000
weighted avg       0.23      0.17      0.15      2000

Time taken is: 137.99938201904297. (k=5,p=2). Not normalized.


# Problem 5
We attempt to improve our accuracy as much as possible by varying the hyperparameters k and p, and by using some normalization methods for our data.


In [67]:
# Taking first 10000 images 
num_pics = 10000
# Storing picture data as 1D vectors for norm computation and labels
X = cifarData[0][0][:num_pics].reshape([num_pics, 32*32*3])
Y = cifarData[0][1][:num_pics]

# Randomly splitting data into training and validation+test sets (0.6 - 0.2 - 0.2 ratio)
(trainX, vtX, trainY, vtY) = train_test_split(X, Y, test_size = 0.4, random_state = 1)
(validateX, testX, validateY, testY) = train_test_split(vtX, vtY, test_size = 0.5, random_state = 1)

In [68]:
# Checking that data is split properly
print(len(trainX))
print(len(testX))
print(len(validateX))

6000
2000
2000


In [69]:
# Normalizing data in two ways
normalized_trainX = normalize(trainX)
normalized_validateX = normalize(validateX)
normalized_testX = normalize(testX)

scaled_trainX = scale(trainX)
scaled_validateX = scale(validateX)
scaled_testX = scale(testX)

In [73]:
print(len(normalized_trainX))
print(len(normalized_testX))
print(len(normalized_validateX))

6000
2000
2000


# Tuning hyperparameters with validation sets
We fix p = 2 for the data set which is normalized because by default that method uses L2 norm. For now we only vary p with the scaled data. Due to computational concerns (lack of resources), we only use p = 1,2 and np.inf.


In [78]:
# Testing different k for normalized data
for y in range(10):
  if y % 2 != 0:
    model = kNearestNeighborClassifier(k = y, p = 2)
    model.fit(normalized_trainX, trainY)
    start = time.time()
    print(classification_report(validateY, model.predict(normalized_validateX)))
    end = time.time() - start
    print(f"Time taken is: {end}. (k={y},p=2). normalized.")

              precision    recall  f1-score   support

           0       0.30      0.43      0.35       217
           1       0.61      0.19      0.29       207
           2       0.20      0.33      0.25       203
           3       0.23      0.19      0.21       202
           4       0.28      0.30      0.29       208
           5       0.24      0.23      0.24       186
           6       0.27      0.26      0.26       188
           7       0.50      0.28      0.36       192
           8       0.38      0.61      0.47       209
           9       0.46      0.21      0.29       188

    accuracy                           0.31      2000
   macro avg       0.35      0.30      0.30      2000
weighted avg       0.35      0.31      0.30      2000

Time taken is: 148.77879667282104. (k=1,p=2). normalized.
              precision    recall  f1-score   support

           0       0.24      0.58      0.34       217
           1       0.44      0.21      0.28       207
           2       0

In [81]:
# Testing different k and p for scaled data
lp = [1,0,2,0,np.inf,0]
for y in range(12):
  if y % 2 != 0:
    model = kNearestNeighborClassifier(k = y, p = lp[(y-1)%6])
    model.fit(scaled_trainX, trainY)
    start = time.time()
    print(classification_report(validateY, model.predict(scaled_validateX)))
    end = time.time() - start
    print(f"Time taken is: {end}. (k={y},p={lp[(y-1)%6]}). scaled.")

              precision    recall  f1-score   support

           0       0.34      0.43      0.38       217
           1       0.60      0.18      0.28       207
           2       0.18      0.33      0.23       203
           3       0.21      0.16      0.18       202
           4       0.20      0.31      0.24       208
           5       0.28      0.19      0.23       186
           6       0.21      0.23      0.22       188
           7       0.49      0.30      0.37       192
           8       0.38      0.53      0.44       209
           9       0.56      0.26      0.35       188

    accuracy                           0.29      2000
   macro avg       0.35      0.29      0.29      2000
weighted avg       0.35      0.29      0.29      2000

Time taken is: 146.7341730594635. (k=1,p=1). scaled.
              precision    recall  f1-score   support

           0       0.28      0.55      0.37       217
           1       0.50      0.17      0.25       207
           2       0.18  

  _warn_prf(average, modifier, msg_start, len(result))


# Trying chosen hyperparameters (k=1,p=2) and (k=9,p=2) on testing data (normalized), as well as (k=7,p=1) on scaled data.



In [84]:
model = kNearestNeighborClassifier(k = 1, p = 2)
model.fit(normalized_trainX, trainY)
start = time.time()
print(classification_report(testY, model.predict(normalized_testX)))
end = time.time() - start
print(f"Time taken is: {end}. (k=1,p=2). normalized.")

              precision    recall  f1-score   support

           0       0.28      0.46      0.35       182
           1       0.63      0.19      0.29       214
           2       0.21      0.34      0.26       205
           3       0.26      0.24      0.25       221
           4       0.27      0.28      0.27       202
           5       0.24      0.27      0.26       182
           6       0.30      0.24      0.27       205
           7       0.51      0.24      0.33       216
           8       0.33      0.60      0.42       195
           9       0.48      0.17      0.26       178

    accuracy                           0.30      2000
   macro avg       0.35      0.30      0.30      2000
weighted avg       0.35      0.30      0.30      2000

Time taken is: 147.04698181152344. (k=1,p=2). normalized.


In [83]:
model = kNearestNeighborClassifier(k = 9, p = 2)
model.fit(normalized_trainX, trainY)
start = time.time()
print(classification_report(testY, model.predict(normalized_testX)))
end = time.time() - start
print(f"Time taken is: {end}. (k=9,p=2). normalized.")

              precision    recall  f1-score   support

           0       0.24      0.58      0.34       182
           1       0.60      0.12      0.20       214
           2       0.20      0.40      0.27       205
           3       0.27      0.21      0.24       221
           4       0.26      0.27      0.26       202
           5       0.34      0.23      0.27       182
           6       0.36      0.18      0.24       205
           7       0.67      0.19      0.29       216
           8       0.32      0.65      0.43       195
           9       0.49      0.13      0.21       178

    accuracy                           0.29      2000
   macro avg       0.38      0.30      0.28      2000
weighted avg       0.38      0.29      0.27      2000

Time taken is: 147.26285529136658. (k=9,p=2). normalized.


In [82]:
model = kNearestNeighborClassifier(k = 7, p = 1)
model.fit(scaled_trainX, trainY)
start = time.time()
print(classification_report(testY, model.predict(scaled_testX)))
end = time.time() - start
print(f"Time taken is: {end}. (k=7,p=1). scaled.")

              precision    recall  f1-score   support

           0       0.33      0.55      0.42       182
           1       0.74      0.18      0.29       214
           2       0.20      0.50      0.29       205
           3       0.27      0.18      0.21       221
           4       0.21      0.38      0.27       202
           5       0.42      0.15      0.22       182
           6       0.29      0.21      0.24       205
           7       0.67      0.19      0.29       216
           8       0.41      0.63      0.49       195
           9       0.58      0.19      0.29       178

    accuracy                           0.31      2000
   macro avg       0.41      0.32      0.30      2000
weighted avg       0.41      0.31      0.30      2000

Time taken is: 146.24528813362122. (k=7,p=1). scaled.


# Extra
We use the hyperparameters (k=1,p=2) on the entire dataset with 0.75-0.25 training-testing random split after normalizing it.

In [51]:
# Using the whole dataset with hyperparameters (k=1,p=2)
X = cifarData[0][0].reshape([len(cifarData[0][0]), 32*32*3])
Y = cifarData[0][1]
(trainX, testX, trainY, testY) = train_test_split(X, Y, test_size = 0.25, random_state = 1)

# Normalizing data
normalized_trainX = normalize(trainX)
normalized_testX = normalize(testX)

# scaled_trainX = scale(trainX)
# scaled_testX = scale(testX)

model = kNearestNeighborClassifier(k = 1, p = 2)
model.fit(normalized_trainX, trainY)
start = time.time()
print(classification_report(testY, model.predict(normalized_testX)))
end = time.time() - start
print(f"Time taken is: {end}. (k,p) = (1,2)")

              precision    recall  f1-score   support

           0       0.37      0.50      0.43      1237
           1       0.66      0.26      0.37      1246
           2       0.28      0.36      0.31      1292
           3       0.25      0.26      0.26      1253
           4       0.28      0.40      0.33      1249
           5       0.32      0.33      0.32      1233
           6       0.36      0.29      0.32      1243
           7       0.53      0.30      0.38      1256
           8       0.39      0.68      0.49      1235
           9       0.49      0.23      0.31      1256

    accuracy                           0.36     12500
   macro avg       0.39      0.36      0.35     12500
weighted avg       0.39      0.36      0.35     12500

Time taken is: 5709.178003072739. (k,p) = (1,2)


# Conclusion
We see that normalizing/scaling our data improved our accuracy. In most of our tests normalizing ended up being better than scaling, but one of the scaling tests turned out the be the best overall. 

We selected (k=1,p=2) with normalized data to test on the whole dataset and obtained an accuracy of 36%. We note that however this may be specific to our choice of a random seed, and that our accuracy may change a lot if we choose another seed. Due to time constraints we were unable to test the case (k=7,p=1) scaled on the whole dataset.

With its limitations in mind, we see that the method is more accurate with more training data. For future experiments we will vary the seed to investigate further.

Lastly, if more computational resources are available we will test more cases for p. When we ran the predictor with p = 3, it took 45 seconds with a size 1000 dataset to process. With p = 1,2 or np.inf, it took about 3 seconds on average. We also note that while np.inf can be computed quickly, it yields the least accurate results. This can be explained easily: no matter how close two pictures look, if one pixel is off, then the images will be considered very different.