# Classification of Images with Masks using Machine Learning

The data set from https://www.kaggle.com/rakshana0802/face-mask-detection-data consists of a total of 3833 images (1915 images of people wearing face masks, and 1918 images of people without face masks). The goal of this project is to train sufficient data with various machine learning algorithms and analyze/compare the accuracy and efficiency of each classification method.

## Data Cleaning

The first problem that I faced was that the images in the data set came in different sizes (some larger than others), file types (.PNG, .JPG, .JPEG), modes (images come in P, RGB, and RGBA modes). The difference in file types and modes is especially prevalent among images of people with face masks. Hence, I wrote a program `img2vec.py` that transforms all images to a standard size (128 $\times$ 128), file type, and mode. The so-called "standard" can be easily changed at the top of the program in `img2vec.py`. The output of this data is a `mask.csv` file, where the first column in the file corresponds to the class of each image (with or without mask) and the rest of the columns corresponds to the pixel value in a specific row/column of the transformed images.

Note that I have also written the program `vec2img.py` that allows the viewing of the transformed images.

## Increasing the Number of Data

One of the issues I faced in the project is the slight lack of data. The reasons are as follows:

1. In order for the classifiers to produce a better generalization in classifying the data, more training data has to be provided. This would mean that I will have to sacrifice the number of testing data. 

2. Similarly, the reverse is also true, that using more of the data as test data would mean sacrificing the number of training data. I would like to see how my classifiers fare in general, and having a small testing data set can lead to higher than normal accuracies.

3. In certain classification methods (eg. Neural Network), I would like to use some validation data while training my model. This further reduces my training/testing data set.

   In order to solve the above issues, I increased the number of data by  
   
 * using the transformed version of the original images
 * horizontally flipping each of the transformed image (mirror images)  
 * flipping each of the transformed images (upside down images)  
 * horizontally flipping each of the upside down images (mirror of upside down images)  
 * rotating the transformed version of the original images to the right (right rotated images)
 * flipping each right rotated images (upside down of right rotated images)  
 * rotating the transformed version of the original images to the left (left rotated images)  
 * flipping each left rotated images (upside down of left roated images) 

 This way, I now have 8 times the number of data as compared to my original data set. I then convert both images into matrices of its respective pixel values before adding them into the data frame which is later sent to `mask.csv`.

## Machine Learning Algorithms Used:

* Support Vector Machines

* Classification via Singular Value Decomposition (SVD) Properties

* Logistic Regression

* Linear Discriminant Analysis (LDA)

* Random Forest/Decision Trees

* Neural Network

## Code to import `mask.csv` into a dataframe and sort into training and test data

In [1]:
PERCENTAGE_OF_TRAIN = 0.8

import pandas as pd
import random
import math as m
import numpy as np
from sklearn.model_selection import train_test_split

# import mask.csv
full_data = pd.read_csv('mask.csv', header = 0)

# split into two separate data frames by images with and without masks
with_mask = full_data[full_data.with_mask == 'Yes']
without_mask = full_data[full_data.with_mask == 'No']

(train_x_withmask, test_x_withmask, train_y_withmask, test_y_withmask) = train_test_split(with_mask.iloc[:,1:], with_mask.iloc[:,0], train_size = 0.8, test_size = 0.2, random_state = 1)
(train_x_withoutmask, test_x_withoutmask, train_y_withoutmask, test_y_withoutmask) = train_test_split(without_mask.iloc[:,1:], without_mask.iloc[:,0], train_size = 0.8, test_size = 0.2, random_state = 1)

train_x = np.vstack((train_x_withmask, train_x_withoutmask))
train_y = pd.concat([train_y_withmask, train_y_withoutmask], axis = 0).reset_index(drop = True)
train_y = np.array(train_y)

test_x = np.vstack((test_x_withmask, test_x_withoutmask))
test_y_true = pd.concat([test_y_withmask, test_y_withoutmask], axis = 0).reset_index(drop = True)
test_y_true = np.array(test_y_true)

In [2]:
# del unneeded variables to save space
del full_data
del with_mask
del without_mask
del train_x_withmask
del train_y_withmask
del test_x_withmask
del test_y_withmask
del train_x_withoutmask
del train_y_withoutmask
del test_x_withoutmask
del test_y_withoutmask

### Code to plot confusion matrices later on

In [3]:
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix

def plot_confmat(clf, X, y):
    (fig, (ax1, ax2)) = plt.subplots(1,2, figsize=(10, 5))
    disp1 = plot_confusion_matrix(clf, X, y, display_labels = ['Yes', 'No'], 
                                  cmap = plt.cm.Blues, normalize = None, ax = ax1)
    disp2 = plot_confusion_matrix(clf, X, y, display_labels = ['Yes', 'No'], 
                                  cmap = plt.cm.Blues, normalize = 'true', ax = ax2)
    disp1.ax_.set_title('Non-normalized Confusion Matrix')
    disp2.ax_.set_title('Normalized Confusion Matrix')
    disp2.im_.set_clim(0,1)
    return disp1.confusion_matrix

### Code to analyze classification result later on

In [4]:
classification_result = {}

## Support Vector Machines (SVM)

The general idea of how a Support Vector Machine works is that an $n-1$ dimension hyperplane is "drawn" on an $n$ dimension "plot" to split the data into two (or more) classes. It is constructed such that the distance between each data point and the hyperplane is at its furthest. The hyperplane can be expressed as the following equation:
$$w^T x + b = 0$$
where $w \in \mathbb{R}^{n-1}$ and $b \in \mathbb{R}$. The data points with parameter values $x_i$ where $w^T x_i + b < 0$ will be classified into one class and the data points that give us $w^T x_i + b > 0$ will be classified into the other class. Running the SVM when our data points are linearly separable (ie. data points can be classified by drawing a straight hyperplane) is extremely simple. Problems arise when data points aren't linearly separable, which is the case for most classification problems.

The simple solution when the data points are not linearly separable is to simply increase the dimensions of the data set to a point where our data is linearly separable. However, this simple solution can be very time and energy consuming, especially when we have more than 10,000 parameters in this case. To tackle this problem, we can use what is called "kernel tricks".

By definition, "a function that takes as inputs vectors in the original space and returns the dot product of the vectors in the feature space is called a kernel function". In other words, if we have $x,z \in X$ and the map $\phi: x \rightarrow \mathbb{R}^N$ then 
$$k(x,z) = \langle \phi(x), \phi(z) \rangle$$
is a kernel function. (The Kernel Trick in Support Vector Classification, Towards Data Science)

The above definition and function will be the key to solving non-linear classification problems. 

Note that the solution of the dual problem can is always in the following form (refer to Aarti Singh's slides for better explanation):
$$w^* = \sum_{i = 1}^{N} \alpha_i y_i x_i$$
where $N$ is the number of data points used to run the SVM and $y_i$ is the class (either 1 or -1) of data $i$. Recall the hyperplane equation that was previously mentioned, we have now found the solution $w^*$ that gives us the separation hyperplane for the SVM classification. Substituting $w^*$ into the hyperplane equation, we get 
$$\sum_{i = 1}^{N} \alpha_i y_i (x_i^T x) + b$$

Now, we can make predictions with the above equation. Just like before, if $\sum_{i = 1}^{N} \alpha_i y_i (x_i^T x) + b > 0$, we will assign it to one of the class, and if $\sum_{i = 1}^{N} \alpha_i y_i (x_i^T x) + b < 0$ we will assign it to the other. Here is where the kernel trick comes into play. We want to map the input vectors $x_i$ and $x$ into a feature space, and we can do so by making the following tweak into the equation:
$$\sum_{i = 1}^{N} \alpha_i y_i (\phi(x_i)^T \phi(x)) + b$$

This then gives us
$$\sum_{i = 1}^{N} \alpha_i y_i \langle \phi(x_i), \phi(x)\rangle + b \text{ or } \sum_{i = 1}^{N} \alpha_i y_i k(x_i, x) + b$$

The same classification decision is done to predict the classes of each test data. 

The next question that generally arises is: What is $k(x,z)$ or $\phi (x)$? This will be answered in each of the following subsections.

### Linear Kernel

In this simple case, $k(x,z) = x^T z$, which is equivalent to not using a kernel trick.

The code below is used to run SVM with a linear kernel. 

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
import time

start = time.time()
classifier1 = SVC(kernel = 'linear', C = 1)
classifier1.fit(train_x, train_y)
[[tp, fn], [fp, tn]] = plot_confmat(classifier1, test_x, test_y_true)
end = time.time()
time_taken = end - start

classification_result['SVM - Linear'] = {'Time' : time_taken,
                                      'True Positive' : tp,
                                      'True Negative' : tn,
                                      'False Positive' : fp,
                                      'False Negative' : fn}

print('Time taken:', round(time_taken,2), 'seconds')

In [None]:
# delete variables to allocate memory
del classifier1

#### Notes:

The change in the regularization parameter `C` does not change the results at all. However, there are slight differences in the time taken to complete the classification tasks. 

On the other hand, it seems that standardizing the variables (pixels of images in this case) does help improve the true positive rate very slightly, but the overall error rate has increased as well. Hence, standardizing the variables in this case is not a good option.

### RBF Kernel

In this case, $k(x,z) = \exp (- \gamma ||x - z||^2)$, where $\gamma > 0$.

The code below is used to run SVM with an RBF kernel.

In [None]:
start = time.time()
classifier2 = SVC(kernel = 'rbf', cache_size = 500, C = 10)
classifier2.fit(train_x, train_y)
[[tp, fn], [fp, tn]] = plot_confmat(classifier2, test_x, test_y_true)
end = time.time()
time_taken = end - start

classification_result['SVM - RBF'] = {'Time' : time_taken,
                                      'True Positive' : tp,
                                      'True Negative' : tn,
                                      'False Positive' : fp,
                                      'False Negative' : fn}

print('Time taken:', round(time_taken,2), 'seconds')

In [None]:
# delete variables to allocate memory
del classifier2

#### Notes:

The change in the regularization parameter `C` in this case results in a more significant difference. In this case, setting `C = 10` produces the best results. However, the time taken to produce such results is still quite long.

### Polynomial Kernel Degree 2

In the next two cases, we have $k(x,z) = (x^T z + c)^d$ to be our kernel function, where $c$ is a constant that can be added, and $d$ is degree of the polynomial. In this current case, $d = 2$, while the next case is set to be $d = 3$.

The code below is used to run SVM with a polynomial kernel of degree 2.

In [None]:
start = time.time()
classifier3 = SVC(kernel = 'poly', degree = 2)
classifier3.fit(train_x, train_y)
[[tp, fn], [fp, tn]] = plot_confmat(classifier3, test_x, test_y_true)
end = time.time()
time_taken = end - start

classification_result['SVM - Degree 2'] = {'Time' : time_taken,
                                      'True Positive' : tp,
                                      'True Negative' : tn,
                                      'False Positive' : fp,
                                      'False Negative' : fn}

print('Time taken:', round(time_taken,2), 'seconds')

In [None]:
# delete variables to allocate memory
del classifier3

### Polynomial Kernel Degree 3

The code below is used to run SVM with a polynomial kernel of degree 3.

In [None]:
start = time.time()
classifier4 = SVC(kernel = 'poly', degree = 3, cache_size = 500)
classifier4.fit(train_x, train_y)
[[tp, fn], [fp, tn]] = plot_confmat(classifier4, test_x, test_y_true)
end = time.time()
time_taken = end - start

classification_result['SVM - Degree 3'] = {'Time' : time_taken,
                                      'True Positive' : tp,
                                      'True Negative' : tn,
                                      'False Positive' : fp,
                                      'False Negative' : fn}

print('Time taken:', round(time_taken,2), 'seconds')

In [None]:
# delete variables to allocate memory
del classifier4

#### Notes:

Running the SVM classifier with a polynomial kernel is significantly faster that the rest of the SVM kernels. Additionally, the error rate produced is not too bad either (slightly larger that that of the RBF kernel). The polynomial kernel of degree 2 can be chosen if the importance of speed outweights the importance of accuracy.

### Sigmoid Kernel

In this case, $k(x,z) = \tanh(\alpha x^T z + c)$.

The code below isused to run SVM with a sigmoid kernel.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

start = time.time()
classifier5 = make_pipeline(StandardScaler(), SVC(kernel = 'sigmoid'))
classifier5.fit(train_x, train_y)
[[tp, fn], [fp, tn]] = plot_confmat(classifier5, test_x, test_y_true)
end = time.time()
time_taken = end - start

classification_result['SVM - Sigmoid'] = {'Time' : time_taken,
                                      'True Positive' : tp,
                                      'True Negative' : tn,
                                      'False Positive' : fp,
                                      'False Negative' : fn}

print('Time taken:', round(time_taken,2), 'seconds')

In [None]:
# delete variables to allocate memory
del classifier5

#### Notes:

The SVM classifier fares worst with the sigmoid kernel. Its accuracy is far lower than when using any of the other SVM kernels.

## Classification via Singular Value Decomposition properties

We now move on to classifying the images via properties of SVD. 

The function below is used to separate the training data into two matrices, one for images of class 'Yes' and the other for images of class 'No'. Then, the function returns the SVD matrices of each class matrix.

In [None]:
import scipy as sc

def classify_svd_training(train_mat, train_class):
    X = train_mat.T
    y = train_class.T
    
    U = [[], []]
    S = [[], []]
    V = [[], []]
    for i, class_val in enumerate(['Yes', 'No']):
        index = (y == class_val)
        matrix = X[:, index]
        (U[i], S[i], V[i]) = sc.linalg.svd(matrix, full_matrices = False)
    return (U, S, V)        

The function below used to classify our test data runs the following:

1) We take a test data point and convert it into a vector. Let's call this the vector $b$.

2) After performing SVD on the training data set in the previous section, we get the following:

$$A_i = U_i \Sigma_i V_i^T$$

* $A_i$ is the $m \times n$ training data matrix of class $i$ ($m$ represents the number of pixels in the image and $n$ represents the number of training data). Note that the value of $m$ is constant for all images since we have already transformed each image to a standard size. 

* $U_i$ is an $m \times r$ matrix where each column is orthonormal to one another.

* $\Sigma_i$ is an $r \times r$ matrix where the diagonals are singular values of $A_i$. The singular values in $\Sigma_i$ is ordered in descending order in the diagonals of the matrix.

* $V_i$ is an $n \times r$ matrix where each column is orthonormal to one another.

Now, we can think of the SVD of $A_i$ as organizing the data in the matrix $A_i$ in the sense that the most important components of $A_i$ (or the important details of images of class $i$) are arranged in the first few columns of $U_i, \Sigma_i, V_i$. On the other hand, the least important components of $A_i$ (or the white noises in the images of class $i$) are arranged in the last few columns of $U_i, \Sigma_i, V_i$. 

With this in mind, we want to select $k$ columns of data from $U_i$ to perform our classification. Let's call this version of $U_i$ with only $k$ columns as $U'_i$. We now want to find the vector $x$ such that 

$$U'_i x = b$$

However, since $U'_i$ is not invertible, we get $x$ by solving the normal equation

$$({U'_i}^T U'_i) x = {U'_i}^T b$$

After we manage to obtain $x$, we want to find the norm of the residual vector $r_i = b - U'_i x$.

3) Assuming we ran part (2) with the training matrix for the class 'Yes', run part (2) again with the training matrix for the class 'No' (or vice versa). We should now have the norms of both residual vectors $r_i$. Compare the norms and classify the test data point to the class with the smaller residual norm. Ie. if $||r_{Yes}|| < ||r_{No}||$ then we classify the test data point into the class 'Yes'.

4) Repeat the steps in parts (1) to (3) for each of the test data points. We now have our set of predictions and can compare them with their actual classes.

In [None]:
def classify_test(test_mat, n, U):
    X = test_mat.T
    test_size = X.shape[1]
    classification = []
    for i in range(test_size):
        b = X[:, i]
        resnorm = np.empty(2)
        for j in range(2):
            A = U[j][:,:n]
            x = np.linalg.inv(A.T @ A) @ A.T @ b
            res = b - A @ x
            resnorm[j] = np.linalg.norm(res, 2) / np.linalg.norm(b, 2)
        if resnorm[0] < resnorm[1]:
            classification.append('Yes')
        else:
            classification.append('No')
    return classification

The code below is used to find the SVD of both class matrices.

Note that in the following code, we split our training data into two: training (80\%) and validation (20\%) data. The reason for doing this is because we will be experimenting with different number of singular values to classify the images. In doing so we will need to know the classification accuracies of the "test" data and the corresponding number of singular values used. The number of singular values ($k$) that produces the best classification predictions will then be used to predict the real test data. This way, we know for sure that $k$ singular values will definitely produce some of the best possible predictions on the test data. 

Conversely, if we were to not have a validation data set, and find our best value $k$ based solely on testing with the real test data, it is hard to ensure that that value of $k$ produces the best possible generalized predictions instead of coincidently obtaining a high accuracy value (especially in the case where the accuracy vs $k$-singular values graph fluctuates a lot).

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

# splitting train data into train + validation data
(svd_train_x, svd_val_x, svd_train_y, svd_val_y) = train_test_split(train_x, train_y, train_size = 0.8, test_size = 0.2, random_state = 1)

start = time.time()
(train_U, train_S, train_V) = classify_svd_training(svd_train_x, svd_train_y)
end = time.time()
time_taken = end - start
svd_train_time = time_taken

print('Time taken:', round(time_taken, 2), 'seconds')

### Magnitude of Singular Values of both classes

The following figures are plotted to give us an idea of the range of values of $k$ we should use to form $U'_i$. Recall that choosing a $k$ too small means that we miss out on important components of the training images and can negatively impact classification accuracy. Similarly, selecting a $k$ too large means that we include unnecessary components and white noise of the training images into our classifications, which can lower our predictive accuracy as well.

In [None]:
fig, ax = plt.subplots(1,3, figsize = (15, 5))

x1 = list(range(1,len(train_S[0])+1))
y1 = train_S[0]
ax[0].plot(x1, y1, label = 'Yes')

x2 = list(range(1, len(train_S[1])+1))
y2 = train_S[1]
ax[0].plot(x2, y2, label = 'No')

ax[0].set_xlabel('k-th Singular Value')
ax[0].set_ylabel('s')
ax[0].title.set_text('Original Values (s)')
ax[0].legend()

x1 = list(range(1,len(train_S[0])+1))
y1 = np.log(train_S[0])
ax[1].plot(x1, y1, label = 'Yes')

x2 = list(range(1, len(train_S[1])+1))
y2 = np.log(train_S[1])
ax[1].plot(x2, y2, label = 'No')

ax[1].set_xlabel('k-th Singular Value')
ax[1].set_ylabel('log(s)')
ax[1].title.set_text('Log Transformed Values (log(s))')
ax[1].legend()

x1 = list(range(1,151))
y1 = np.log(train_S[0][:150])
ax[2].plot(x1, y1, label = 'Yes')

x2 = list(range(1, 151))
y2 = np.log(train_S[1][:150])
ax[2].plot(x2, y2, label = 'No')

ax[2].set_xlabel('k-th Singular Value')
ax[2].set_ylabel('log(s)')
ax[2].title.set_text('Log Transformed First 150 Values (log(s))')
ax[2].legend()

fig.show()

As we can see in the first figure above, the change in magnitude of the singular values is so large that it is hard to tell which value of $k$ we should choose. 

We then proceeded to take the natural log of each singular value and plot the same figure. The results (shown in the middle figure) shows that there is a sharp dip at around 2700. This means that the singular values occurring after 2700 all correspond to white noises. Additionally, we also see a huge change in the magnitudes of singular values at around 200. 

This brings us to the last figure on the right. In this case, we can see that there is no straightforward way to tell which exact value of $k$ to choose. Hence, I have decided to run the predictions with $k = 5, 10, 15, \ldots, 600$ below and plot a figure of accuracy vs prediction time based on the different values of $k$. 

In [None]:
time_taken = []
accuracy = []
n = list(range(5,601,5))

for i in n:
    start = time.time()
    val_y_pred = classify_test(svd_val_x, i, train_U)
    end = time.time()
    time_taken.append(round(end - start, 2))
    confmat = confusion_matrix(svd_val_y, val_y_pred, labels = ['Yes', 'No'])
    acc = (confmat[0][0] + confmat[1][1]) / np.sum(confmat)
    accuracy.append(acc)

fig, ax1 = plt.subplots(figsize = (15,5))

color = 'tab:red'
ax1.set_xlabel('k')
ax1.set_ylabel('Time taken', color=color)
ax1.plot(n, time_taken, color=color)
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis

color = 'tab:blue'
ax2.set_ylabel('Accuracy', color=color)  # we already handled the x-label with ax1
ax2.plot(n, accuracy, color=color)
ax2.tick_params(axis='y', labelcolor=color)
ax2.set_ylim([0.7,1])

fig.suptitle('Time taken vs Accuracyof Validation Data based on k', y = 1)
fig.tight_layout()  # otherwise the right y-label is slightly clipped
plt.show()

As we can see in the figure above, the time taken to predict increases exponentially as the value of $k$ increases. This is accompanied by slight increase or decrease in accuracy. In this case, we can say that when $k = 545$, we have our most accurate prediction using this classification method. However, when $k = 155$, we seem to get a similar result, but at a significantly shorter amount of time.

Overall, if we do not have the luxury to run the above code to find the accuracy of each $k$, it is best to just randomly select a $k$-value between 150 and 200 as the prediction accuracies between those values start to stabilize (less fluctuation) and take a relatively short amount of time too.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

k_singularval = (accuracy.index(max(accuracy)) + 1) * 5
varname = 'SVD - ' + str(k_singularval)

start = time.time()
test_y_pred = classify_test(test_x, k_singularval, train_U)
end = time.time()
time_taken = end - start

print('Time taken:', round(time_taken,2), 'seconds')

fig, ax = plt.subplots(1,2, figsize = (10,5))

for i, normal in enumerate([None, 'true']):
    confmat = confusion_matrix(test_y_true, test_y_pred, labels = ['Yes', 'No'], normalize = normal)
    disp = ConfusionMatrixDisplay(confmat, display_labels = ['Yes', 'No'])
    disp.plot(ax = ax[i], cmap = plt.cm.Blues)
    if i == 0:
        disp.ax_.set_title('Non-normalized Confusion Matrix')
        classification_result[varname] = {'Time' : svd_train_time + time_taken,
                                     'True Positive' : confmat[0][0],
                                     'True Negative' : confmat[1][1],
                                     'False Positive' : confmat[1][0],
                                     'False Negative' : confmat[0][1]}
    else:
        disp.ax_.set_title('Normalized Confusion Matrix')
        disp.im_.set_clim(0,1)
        


In [None]:
k_singularval = (accuracy.index(max(accuracy[:40])) + 1) * 5
varname = 'SVD - ' + str(k_singularval)

start = time.time()
test_y_pred = classify_test(test_x, 155, train_U)
end = time.time()
time_taken = end - start

print('Time taken:', round(time_taken,2), 'seconds')

fig, ax = plt.subplots(1,2, figsize = (10,5))

for i, normal in enumerate([None, 'true']):
    confmat = confusion_matrix(test_y_true, test_y_pred, labels = ['Yes', 'No'], normalize = normal)
    disp = ConfusionMatrixDisplay(confmat, display_labels = ['Yes', 'No'])
    disp.plot(ax = ax[i], cmap = plt.cm.Blues)
    if i == 0:
        disp.ax_.set_title('Non-normalized Confusion Matrix')
        classification_result['SVD - 155'] = {'Time' : svd_train_time + time_taken,
                                     'True Positive' : confmat[0][0],
                                     'True Negative' : confmat[1][1],
                                     'False Positive' : confmat[1][0],
                                     'False Negative' : confmat[0][1]}
    else:
        disp.ax_.set_title('Normalized Confusion Matrix')
        disp.im_.set_clim(0,1)

In [None]:
# delete variables to allocate memory
del classify_svd_training
del classify_test
del train_U
del train_S
del train_V
del time_taken
del accuracy
del svd_train_x
del svd_val_x
del svd_train_y
del svd_val_y
del val_y_pred
del fig
del ax1
del ax2

## Logistic Regression

Before the start of training a model using logistic regression, my guess would be that its results would be significantly worse than most of the previous methods, simply because of the following:

1. Logistic regression is a linear classifier, meaning the training data will be separated by a single straight hyperplane. Its only difference with the linear kernel SVM model is the way the hyperplane is produce. As we can see in the linear kernel SVM section, the classification of testing data took a long time and the resulting accuracy is not really high.

2. However, the difference between logistic regression and the linear kernel SVM is that logistic regression classifies data into classes based on their probabilities to be in each class. Additionally, this probability follows a sigmoid function. This means that if a test data has a 0.6 chance that it is in class 'Yes', and 0.4 chance that it is in class 'No', it will be classified into the 'Yes' class. On the other hand, the SVM method produces results that are binary, ie. if the test data is in the 'Yes' region of the hyperplane, it will be classified as 'Yes'. This could end up allowing the logistic regression method to perform slightly better than the linear kernel SVM method.

In [None]:
from sklearn.linear_model import LogisticRegression

start = time.time()
classifier6 = LogisticRegression(solver = 'saga', penalty = 'l1', C =100, max_iter = 100)
classifier6.fit(train_x, train_y)
[[tp, fn], [fp, tn]] = plot_confmat(classifier6, test_x, test_y_true)
end = time.time()
time_taken = end - start

print('Time taken:', round(time_taken,2), 'seconds')

classification_result['Logistic Reg.'] = {'Time' : time_taken,
                                          'True Positive' : tp,
                                          'True Negative' : tn,
                                          'False Positive' : fp,
                                          'False Negative' : fn}

In [None]:
# delete variables to allocate memory
del classifier6

#### Notes:

As expected, the results from the classification via Logistic Regression only produces a slightly better result as compared to the SVM with linear kernel. This is the case even after the tuning of paramters `C` and `penalty`. In fact, we always get the warning that `the coef_ did not converge` regardless of how much the `max_iter` parameter was extended. This is probably due to the fact that the data set is not separated linearly, and hence separating the training data itself already produces huge errors even with various parameter tunings.

## Linear Discriminant Analysis (LDA)

Just like the Logistic Regression, LDA is also a linear classifier. Hence, it is also expected that this classifier will fare badly.

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

start = time.time()
classifier7 = LinearDiscriminantAnalysis(solver = 'svd')
classifier7.fit(train_x, train_y)
[[tp, fn], [fp, tn]] = plot_confmat(classifier7, test_x, test_y_true)
end = time.time()
time_taken = end - start

print('Time taken:', round(time_taken,2), 'seconds')

classification_result['LDA'] = {'Time' : time_taken,
                                'True Positive' : tp,
                                'True Negative' : tn,
                                'False Positive' : fp,
                                'False Negative' : fn}

In [None]:
# delete variables to allocate memory
del classifier7

#### Notes:

While I have predicted that the LDA will fare badly compared to most other classifiers I have used previously, it is very unexpected that the overall error rate of LDA is so much worse than that of the Logistic Regression and the linear kernel SVM.

### Logistic Regression vs LDA

After running the Logistic Regression and LDA classifiers, it is common to wonder why there is such a huge difference in accuracy between the two classifiers.

According to several websites (links under the Reference and Resources section), the difference in accuracy is mainly due to the different requirements required by both of these classifiers:

* In Logistic Regression, the distributions of the predictors do not matter. On the other hand, LDA assumes that the predictors follow a multivariate normal distribution.

* In Logistic Regression, there is no requirement in the predictors' within-group covariance matrices. This is hugely different compared to the LDA, where we assume that the sample predictors' (the training data's predictors in our case) within-group covariance matrices (namely $S_{Yes}$ and $S_{No}$) should be equal to the population predictors' covariance matrix (denoted $\Sigma$).

As we can see in our data, each "feature" or "predictor" actually represents the pixel values of each pixel, which means that they are very likely not normally distributed, nor do they have covariances equal to a "population" covariance. I believe that it is due to the restrictions of the LDA classifier and the failure of the predictor variables to meet the restrictions, LDA does not perform as well as the Logistic Regression in this case.

## Random Forest

The Random Forest Classification method is basically done by running Decision Tree Classification methods multiple times with slight tuning in data and features used. So before we dive straight into Random Forest Classification, let's first train our training data with a Decision Tree Classifier.

The general idea of running a Decision Tree Classifier is simple:

1. We have $m$ training data with $n$ features. In our case, this translates to our 3449 rows of training data and 65,536 features (pixels).

2. We start off at a root node with all our training data. Then, a feature is selected to split the training data into two. Eg. we start with the pixel (1,1) in each image. Images with pixel (1,1) values more than $\alpha$ will be transferred to a new node at the right side of the root node, and the rest will be transferred to a new node at the left side of the root node.

3. As of right now, we have transferred all of our training data into nodes that are either at the left or right nodes of the root node. Now, we want to check if any of the nodes contain only a single class, ie. the node contains images that are all of class 'Yes' or all of class 'No'. If this is the case, we are done with this node, no further action is needed to work on this node. Otherwise, we will need to repeat the process of step 2, this time selecting a different feature to classify our training data. Note that in every classification step, we are not allowed to transfer our data into a used node (ie. it has to be a new node every step).

4. After running steps 2 and 3 multiple times, we will arrive at a point where each and every non-empty nodes contain only a single class. At this point, we are done.

Note that the Decision Tree Classification method has the following problems:

* The time taken can be quite long depending on the data set, or the chosen training data. This is due to the fact that the "learning of optimal decision tree is NP-complete" (Wikipedia). This can be observed even when running the classification method on this data set, where the time taken to train the model and classify the testing data can go as fast as less than 2 minutes, to being as slow as more than 2 hours.

* Decision Trees tend to overfit. Recall that we will keep transferring data into new nodes until each node contains only one class of data. This becomes a problem when we put our test data into the decision tree since the properties of our test data is not exactly the same as those of the training data.

In [None]:
from sklearn.tree import DecisionTreeClassifier

start = time.time()
classifier8 = DecisionTreeClassifier(criterion = 'gini')
classifier8.fit(train_x, train_y)
[[tp, fn], [fp, tn]] = plot_confmat(classifier8, test_x, test_y_true)
end = time.time()
time_taken = end - start

print('Time taken:', round(time_taken,2), 'seconds')

classification_result['Decision Tree'] = {'Time' : time_taken,
                                'True Positive' : tp,
                                'True Negative' : tn,
                                'False Positive' : fp,
                                'False Negative' : fn}

In [None]:
# delete variables to allocate memory
del classifier8

After running the Decision Tree Classifier above, we have the following findings:

* The time taken for this to run is actually quite short. The only other classifier with such short time taken so far is the classifier that utilizes SVD properties.

* The results of this test is also very accurate. This is due to the fact that this classifier does not apply to strictly linear data, unlike the linear kernel SVM, Logistic Regression, and LDA

Now that we have tried out the Decision Tree Classifier, we move on to the Random Forest Classifier. The Random Forest Classifier is different from the Decision Tree Classifer in the following ways:

* Instead of using all features to classify our training data, the features used is selected at random.

* Instead of using all of our training data for classification, the training data is selected at random with replacement (aka bootstrapping). This means that there is a chance that we might select the same exact data (although it is less likely if the size of data is huge).

Now, with our new randomly selected subset of training data with randomly selected features, we run the Decision Tree Classifier once again. 

This time, however, we will be running all the above for a number of times (note the parameter `n_estimators` in the code below) before concluding with a model that is an average of all the models we got.

In [None]:
from sklearn.ensemble import RandomForestClassifier

start = time.time()
classifier9 = RandomForestClassifier(n_estimators = 100, criterion = 'gini')
classifier9.fit(train_x, train_y)
[[tp, fn], [fp, tn]] = plot_confmat(classifier9, test_x, test_y_true)
end = time.time()
time_taken = end - start

print('Time taken:', round(time_taken,2), 'seconds')

classification_result['Random Forest'] = {'Time' : time_taken,
                                'True Positive' : tp,
                                'True Negative' : tn,
                                'False Positive' : fp,
                                'False Negative' : fn}

In [None]:
# delete variables to allocate memory
del classifier9

#### Notes:

After running the Random Forest Classifier, the results that was obtained was very unexpected... in a good way. This classification method has given us one of the highest accuracy rate among all the methods we have used so far, and in the shortest amount of time too! 

A thing to note about the Random Forest Classifier is that, even after playing around with the `n_estimators` parameter with values 100, 200, 1000... The accuracy of the classifiers are highly similar, although the time taken to build the classifiers increases rapidly. Hence, I have decided to use `n_estimators = 100` since it gives us the best accuracy vs time trade-off.

## Neural Network

This section is significantly trickier than the others, due to the following reasons:

* more parameters to tune

* there is an infinite amount of different types of Neural Network architecture. Other than the neurons in the input and output layers being fixed, there is an infinite number of hidden layer and neuron count architectures. A lot of experimentation has to be done regarding the number of hidden layers and neurons to be used in our model.

Due to these reasons, the changes in code for each parameter tune will not be repeated here, and will instead be done in a separate file, namely `keras_nn.py`. This file is written by importing the Keras library since it is easy to navigate and make parameter changes.

Additionally, for future references, I have also written `tensorflow_nn.py` that does the same exact thing but instead using the tensorflow library.

### Default Settings and Values in Training of Neural Network

The following are some settings that are set as our default in the training of our Neural Network, and will not be changed unless stated specifically in the section.

1. Optimizer: Adam

2. Learning Rate: `1e-4`

3. Batch Size = 256

4. Network Architecture: 16384 $\rightarrow$ 500 $\rightarrow$ 2

### Effects of Parameter Tuning

In training the best possible Neural Netowrk model, there are certain parameters that needs to be tuned and experimented on. In the following sections, I have experimented the parameters in different ways and observed their effects on cost and accuracy changes in my training and validation data.

* #### Standardization (Transformation) of data and L1/L2 regularization parameters

 In this section, we want to observe how transforming the data and adding regularization parameters can affect the accuracies of the training, validation and test data.

 The type of data transformations used here are:

 1. No transformation, ie. the same input data from before is used

 2. Normalized input, ie. each input neuron value is transformed with the following formula:
$$x' = \frac{x - x_{min}}{x_{max} - x_{min}}$$
where $x$ is the initial input value for a particular neuron, and $x_{min}$ and ${x_max}$ are smallest and largest input values for the particular neuron among the entire training data set.

 3. Standardized input, ie. each input neuron value is transformed to fit a normal distribution via the following formula:
$$x' = \frac{x - \mu_x}{\sigma_x}$$
where $x$ is the initial input value for a particular neuron, and $\mu_x$ and $\sigma_x$ are the mean and standard deviation respectively for the particular neuron among the entire training data set.

 As for our L1 and L2 regularization parameters, the reason we want to use them is this:  
 - We have our cost function which we want to minimize in order to improve our model. In general, the following are some of the types of cost functions:
 
   1. Quadratic cost:  
   $$C = \frac{1}{n} \sum_{i = 1}^{n}||y_i - y'_i||_2$$
      where $y_i$ is the actual value of the responding variable for data point $i$ and $y'_i$ is its corresponding predicted value. This is usually used when we train our Neural Network model for regression purposes and our output layer has only one node.
   
   2. Cross Entropy cost:  
   $$C = \frac{1}{n} \sum_{i = 1}^{n} \sum_{c = 1}^{k} y_{i,c} \log_e({y'_{i,c})}$$
      where $y_{i,c} = \begin{cases} 1 & \text{if data point } i \text{ is in class } c \\ 0 & \text{otherwise} \end{cases}$ and $y'_{i,c}$ is the predicted probability that the data point $i$ is class $c$. This is usually used when we train our Neural Network model for classification purposes and our output layer has $k$ number of nodes (to classify data points into $k$ classes). Note that in the case of binary classification (2 classes), one can use an output layer with only one node.
      
   In this analysis, we will be using the cross entropy cost function.
   
 - Now, simply having a cost function might not be enough as it leads to a possible overfitting of the data set. To counter this issue, we add regularization parameter(s) into our cost function. There are 3 (main) types of regularization that we will use in the training of this Neural Network:
 
   1. L1 Regularization
      $$L1 = \lambda_1 \sum_{j = 1}^{p} |\beta_j|$$
      where $\beta_j$ represents each weight and bias in the neural network model.
      
   2. L2 Regularization  
      $$L2 = \lambda_2 \sum_{j = 1}^{p} \beta_j^2$$
   
   3. Dropout/Pruning  
      In Neural Network, using a dropout regularization basically means reducing the number of nodes (in the hidden layer) and arcs as we train the model in order for it to better generalize the data set.
      
 - With this in mind, we can now add either the L1 or L2 (or both) regularization into our cost function. The dropout is a different type of regularization method that we will experiment later on. Our new cost function becomes $$C' = C + L1 + L2$$
   
   We can now experiment around with different $\lambda_1$ and $\lambda_2$ values to see which values produce the best model. Note that setting a $\lambda$ too large can lead to underfitting, whereas setting it too small can lead to overfitting.
 
 For our L1 and L2 regularization parameters, we tested each of them with the values `0`, `1e-5`, `1e-4`, `1e-3`, `1e-2`, and `1e-1`, ie:
 
 1. L1 = `0`, L2 = `0`
 2. L1 = `0`, L2 = `1e-5`
 3. L1 = `0`, L2 = `1e-4`
 4. L1 = `0`, L2 = `1e-3`
 5. L1 = `0`, L2 = `1e-2`
 6. L1 = `0`, L2 = `1e-1`
 7. L1 = `1e-5`, L2 = 0
 8. ... and so on
 
 The code and results are shown below.

In [None]:
(NUM_OF_DATA, INPUT_NODES) = train_x.shape
OUTPUT_NODES = 2

from sklearn.model_selection import train_test_split

# convert test and training y values from 'Yes' or 'No' to [1,0] or [0,1] respectively
train_y_01 = np.zeros((NUM_OF_DATA,2), dtype = int)
for i in range(NUM_OF_DATA):
    if train_y[i] == 'Yes':
        train_y_01[i][0] = 1
    else:
        train_y_01[i][1] = 1
test_y_01 = np.zeros((len(test_y_true), 2), dtype = int)
for i in range(len(test_y_true)):
    if test_y_true[i] == 'Yes':
        test_y_01[i][0] = 1
    else:
        test_y_01[i][1] = 1
        
# splitting training data into train and validation data sets
(nn_x_train, nn_x_validation, nn_y_train, nn_y_validation) = train_test_split(train_x, train_y_01, train_size = 0.8, test_size = 0.2, random_state = 1)

In [None]:
# Neural Network architecture (neurons in hidden layer)
LAYER1_NODES = 500

# Setting epochs, batch size and learning rate
EPOCHS = 50
BATCH_SIZE = 256
LEARNING_RATE = 1e-4

# L1 and L2 ragularization parameters to experiment with
L1_list = [0, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5]
L2_list = [0, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5]

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from keras import optimizers
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.constraints import maxnorm
from keras import regularizers #for l1 or l2 regularizers
from keras.callbacks import EarlyStopping #stop training when monitored argument stops decreasing/increasing
import matplotlib.pyplot as plt

# prepare dataset with input and output scalers, can be none
def transform_dataset(input_scaler, train_X, val_X, test_X):
    # scale inputs
    if input_scaler is not None:
        # fit scaler
        input_scaler.fit(train_X)
        # transform training dataset
        trainX = input_scaler.transform(train_X)
        # transform validation dataset
        validX = input_scaler.transform(val_X)
        # transform test dataset
        testX = input_scaler.transform(test_X)
        return trainX, validX, testX
    else:
        return train_X, val_X, test_X

# train the Neural Network model    
def evaluate_model(train_X, train_y, val_X, val_y, test_X, test_y):
    start = time.time()
    # define model
    model = Sequential([
        #input to first hidden layer
        Dense(output_dim = LAYER1_NODES, input_dim = INPUT_NODES,
              activation = 'relu', kernel_constraint= maxnorm(4),
              kernel_regularizer = regularizers.l1_l2(l1 = L1, l2 = L2)
              ),
                       
        #second hidden layer to output
        Dense(output_dim = OUTPUT_NODES, input_dim = LAYER1_NODES, activation = 'softmax'),
        ])
    
    # compile model
    opt = optimizers.adam(learning_rate = LEARNING_RATE)
    model.compile(loss = 'categorical_crossentropy', optimizer = opt, metrics = ['accuracy'])
    
    # fit model
    history = model.fit(train_X, train_y, epochs = EPOCHS, batch_size = BATCH_SIZE, 
          validation_data = (val_X, val_y), verbose = 0,
          # callbacks = [EarlyStopping(monitor='val_accuracy', patience=20)]
          ) 

    # evaluate the model
    _, test_acc = model.evaluate(test_X, test_y)
    
    end = time.time()
    time_taken = end - start
    
    train_cost = history.history['loss']
    train_acc = history.history['accuracy']
    val_cost = history.history['val_loss']
    val_acc = history.history['val_accuracy']
    return (train_cost, train_acc, val_cost, val_acc, test_acc, time_taken)

# transform data and train model
def run_model(input_scaler, train_X, train_y, val_X, val_y, test_X, test_y):
    # get dataset
    trainX, valX, testX = transform_dataset(input_scaler, train_X, val_X, test_X)
    result = evaluate_model(trainX, train_y, valX, val_y, testX, test_y)
    return result

for j, L1 in enumerate(L1_list):
    fig1, ax1 = plt.subplots(2,3, figsize = (40,20))
    fig2, ax2 = plt.subplots(2,3, figsize = (40,20))
    for i, L2 in enumerate(L2_list):
        print('Running L1 =',L1, 'L2 =',L2)
        (none_train_cost, none_train_acc, none_val_cost, none_val_acc, none_test_acc, _) = run_model(None, nn_x_train, nn_y_train, nn_x_validation, nn_y_validation, test_x, test_y_01)
        (minmax_train_cost, minmax_train_acc, minmax_val_cost, minmax_val_acc, minmax_test_acc, _) = run_model(MinMaxScaler(), nn_x_train, nn_y_train, nn_x_validation, nn_y_validation, test_x, test_y_01)
        (std_train_cost, std_train_acc, std_val_cost, std_val_acc, std_test_acc, _) = run_model(StandardScaler(), nn_x_train, nn_y_train, nn_x_validation, nn_y_validation, test_x, test_y_01)
        
        ep = list(range(1,EPOCHS+1))
     
        # plot cost functions
        ax1[i//3, i%3].plot(ep, none_train_cost, 'r:', label = 'None (Train)')
        ax1[i//3, i%3].plot(ep, none_val_cost, 'ro-', label = 'None(Val)')
        ax1[i//3, i%3].plot(ep, minmax_train_cost, 'b:', label = 'MinMax (Train)')
        ax1[i//3, i%3].plot(ep, minmax_val_cost, 'bo-', label = 'MinMax (Val)')
        ax1[i//3, i%3].plot(ep, std_train_cost, 'k:', label = 'Std (Train)')
        ax1[i//3, i%3].plot(ep, std_val_cost, 'ko-', label = 'Std (Val)')
        ax1[i//3, i%3].set_xlabel('EPOCH', fontsize = 20)
        ax1[i//3, i%3].set_ylabel('cost', fontsize = 20)
        ax1[i//3, i%3].set_title('Change in cost (L1='+str(L1)+',L2='+str(L2)+')', fontsize = 25)
        ax1[i//3, i%3].legend(fontsize = 20)
        
        # plot accuracy functions
        ax2[i//3, i%3].plot(ep, none_train_acc, 'r:', label = 'None (Train)')
        ax2[i//3, i%3].plot(ep, none_val_acc, 'ro-', label = 'None(Val)')
        ax2[i//3, i%3].plot(ep, [none_test_acc]*len(ep), 'r-')
        ax2[i//3, i%3].plot(ep, minmax_train_acc, 'b:', label = 'MinMax (Train)')
        ax2[i//3, i%3].plot(ep, minmax_val_acc, 'bo-', label = 'MinMax (Val)')
        ax2[i//3, i%3].plot(ep, [minmax_test_acc]*len(ep), 'b-')
        ax2[i//3, i%3].plot(ep, std_train_acc, 'k:', label = 'Std (Train)')
        ax2[i//3, i%3].plot(ep, std_val_acc, 'ko-', label = 'Std (Val)')
        ax2[i//3, i%3].plot(ep, [std_test_acc]*len(ep), 'k-')
        ax2[i//3, i%3].set_xlabel('EPOCH', fontsize = 20)
        ax2[i//3, i%3].set_ylabel('accuracy', fontsize = 20)
        ax2[i//3, i%3].set_ylim(0.4, 1)
        ax2[i//3, i%3].set_title('Change in accuracy (L1='+str(L1)+',L2='+str(L2)+')', fontsize = 25)
        ax2[i//3, i%3].legend(fontsize = 20)
    fig1.show()
    fig2.show()


In [None]:
# delete variables that will no longer be used to allocate memory
del none_train_cost
del none_train_acc
del none_val_cost
del none_val_acc
del none_test_acc
del minmax_train_cost
del minmax_train_acc
del minmax_val_cost
del minmax_val_acc
del minmax_test_acc
del std_train_cost
del std_train_acc
del std_val_cost
del std_val_acc
del std_test_acc
del fig1
del ax1
del fig2
del ax2
del L1_list
del L2_list

 ##### Notes:
 
 1. $\lambda_1$ for the L1 regularization seems to have huge negative effects on the training of the model, especially when we set it with large values such as 0.1 and 0.01. In such cases, regardless of the value of $\lambda_2$, our resulting models perform very badly. The best cases would be when we were to use the non-transformed original data as input in the training of the models since they produce highly fluctuating predictions of both the training and the validation data. On the other hand, we constantly obtain a prediction accuracy of around 0.5 in both our training and validation data sets if we were to normalize/standardize our data to form our inputs. This shows that the models are not learning at all.
 
 2. Discounting the cases where $\lambda_1 = 0.1$ or $0.01$, we find that training inputs that are standardized perform best in regards to training a model that produces high training, validation and test accuracies. The models with standardized inputs can often produce training and test accuracies that are 5-10\% higher than that of the non-transformed original inputs, which produce the worst set of results among all 3 types of models.

 3. Similarly, in cases where we have the value of $\lambda_2$ set too large, we tend to observe a decrease in learning in the Neural Network models. In these cases, while the cost values constantly decreases, there tends to be no change/fluctuations around the a less than optimum accuracies in the training and validation data. The fix to this problem is to decrease the value of $\lambda_1$ and $\lambda_2$. 

 4. The issue that was encountered in parts 2 and 3 is known as underfitting.

 5. The best results is obtained when we use only the L2 regularizer and set $\lambda_2$ = `1e-4` or `1e-5`. There seems to be a lot more fluctuations and lower accuracy values if we have $\lambda$ go any smaller, although things might change if we also switch out certain other parameters.

 6. Regardless of L1 and L2 values used, we find that the models utilizing the original data (not transformed) tend to have significantly higher costs as compared to the normalized and standardized input data.


 * #### Batch size and Learning Rate
 
   1. In the paper "Don't Decay the Learning Rate, Increase the Batch Size" by Smith, Kindermans, Ying and Le, they wrote:  

    *** 
 "When we decay the learning rate, the noise scale falls, enabling us to converge to the minimum of the cost function. However we can achieve the same reduction in noise scale at constant learning rate by increasing the batch size. The main contribution of this work is to show that it is possible to make efficient use of vast training batches, if one increases the batch size during training at constant learning rate until $B \sim N/10$ (batch size around 1/10 of training data size). After this point, we revert to the use of decaying learning rates." 
    ***   
  Therefore, I had to play around with various different batch sizes and learning rates in this section to observe how the model fares with different combinations.
  
   2. The batch sizes that will be tested are 4, 32, 128, 256, 512, and 1024. The learning rates that will be tested are 0.1, 0.01, 0.001, 0.0001, `1e-5`, and `1e-6`.

  The code and results are shown below.

In [None]:
LAYER1_NODES = 500

# Setting epochs and L1, L2 regularization parameters
EPOCHS = 50
L1 = 0
L2 = 1e-5


batch_size_list = [4, 32, 128, 256, 512, 1024]
lr_list = [1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6]

for j, BATCH_SIZE in enumerate(bacth_size_list):
    fig1, ax1 = plt.subplots(2,3, figsize = (40,20))
    fig2, ax2 = plt.subplots(2,3, figsize = (40,20))
    for i, LEARNING_RATE in enumerate(lr_list):
        print('Running BS =',BATCH_SIZE, 'LR =',LEARNING_RATE)
        (train_cost, train_acc, val_cost, val_acc, test_acc, _) = run_model(StandardScaler(), nn_x_train, nn_y_train, nn_x_validation, nn_y_validation, test_x, test_y_01)
        
        ep = list(range(1,EPOCHS+1))
     
        # plot cost functions
        ax1[i//3, i%3].plot(ep, train_cost, 'r:', label = 'Train')
        ax1[i//3, i%3].plot(ep, val_cost, 'ro-', label = 'Val')
        ax1[i//3, i%3].set_xlabel('EPOCH', fontsize = 20)
        ax1[i//3, i%3].set_ylabel('cost', fontsize = 20)
        ax1[i//3, i%3].set_title('Change in cost (BS='+str(BATCH_SIZE)+', LR='+str(LEARNING_RATE)+')', fontsize = 25)
        ax1[i//3, i%3].legend(fontsize = 20)
        
        # plot accuracy functions
        ax2[i//3, i%3].plot(ep, train_acc, 'r:', label = 'Train')
        ax2[i//3, i%3].plot(ep, val_acc, 'ro-', label = 'Val')
        ax2[i//3, i%3].plot(ep, [test_acc]*len(ep), 'r-')
        ax2[i//3, i%3].set_xlabel('EPOCH', fontsize = 20)
        ax2[i//3, i%3].set_ylabel('accuracy', fontsize = 20)
        ax2[i//3, i%3].set_ylim(0.4, 1)
        ax2[i//3, i%3].set_title('Change in accuracy (BS='+str(BATCH_SIZE)+', LR='+str(LEARNING_RATE)+')', fontsize = 25)
        ax2[i//3, i%3].legend(fontsize = 20)
    fig1.show()
    fig2.show()


In [None]:
# delete variables that will no longer be used to allocate memory
del train_cost
del train_acc
del val_cost
del val_acc
del test_acc
del fig1
del ax1
del fig2
del ax2
del evaluate_model
del run_model
del batch_size_list
del lr_list

##### Notes:

1. Analysis on small batches as compared to large batches:  
 * Takes a long time to train, since the model has to be updated more times. In particular, the model updates every time we run through a batch. So if we have a training size $N$ and batch size $k$, the model is updated $N/k$ times per epoch.  
 * The smaller the batch size, the smaller the learning rate that should be used. With a small batch size, it is easier to overshoot in the gradient descent compared to bigger batch sizes.
 
2. Overall, it seems that using a batch size of between 32 and 256, learning rate of about `1e-5` to `1e-4`, and $\lambda$ of `1e-4` produces the most balanced result in terms of no underfitting/overfitting and highest training/validation/test accuracies.

 * #### Network Architecture
 
   1. From what we have achieved so far, we seem to be able to train relatively good models. In fact, we have even managed to train some models to the point where we achieve 100\% training accuracies. However, the corresponding validation accuracies have never broken the 88\% mark, signalling the issue of overfitting.
   
   2. Other than incorporating regularization and changing the batch size and learning rate, another thing that could be done to improve the validation accuracy and reduce the overfitting of the model is to change the network architecture.
   
   3. The network architectures that we will experiment with are:
     * 16384 $\rightarrow$ 2  
     * 16384 $\rightarrow$ 50 $\rightarrow$ 2  
     * 16384 $\rightarrow$ 500 $\rightarrow$ 2  
     * 16384 $\rightarrow$ 1000 $\rightarrow$ 2  
     * 16384 $\rightarrow$ 50 $\rightarrow$ 50 $\rightarrow$ 2  
     * 16384 $\rightarrow$ 50 $\rightarrow$ 1000 $\rightarrow$ 2  
     * 16384 $\rightarrow$ 500 $\rightarrow$ 50 $\rightarrow$ 2  
     * 16384 $\rightarrow$ 500 $\rightarrow$ 500 $\rightarrow$ 2  
     * 16384 $\rightarrow$ 500 $\rightarrow$ 1000 $\rightarrow$ 2  
     * 16384 $\rightarrow$ 1000 $\rightarrow$ 50 $\rightarrow$ 2  
     * 16384 $\rightarrow$ 1000 $\rightarrow$ 500 $\rightarrow$ 2  
     * 16384 $\rightarrow$ 1000 $\rightarrow$ 1000 $\rightarrow$ 2  
     
   The code and results are shown below.

In [None]:
# Setting epochs, batch size, learning rate and L1, L2 regularization parameters
EPOCHS = 50
L1 = 0
L2 = 1e-5
BATCH_SIZE = 256
LEARNING_RATE = 1e-4

NN_DESIGNS = [(INPUT_NODES, OUTPUT_NODES), # no hidden layer
              (INPUT_NODES, 50, OUTPUT_NODES), # 1 hidden layer
              (INPUT_NODES, 500, OUTPUT_NODES),
              (INPUT_NODES, 1000, OUTPUT_NODES),
              (INPUT_NODES, 50, 50, OUTPUT_NODES), # 2 hidden layers
              (INPUT_NODES, 50, 1000, OUTPUT_NODES),
              (INPUT_NODES, 500, 50, OUTPUT_NODES),
              (INPUT_NODES, 500, 500, OUTPUT_NODES),
              (INPUT_NODES, 500, 1000, OUTPUT_NODES),
              (INPUT_NODES, 1000, 50, OUTPUT_NODES),
              (INPUT_NODES, 1000, 500, OUTPUT_NODES),
              (INPUT_NODES, 1000, 1000, OUTPUT_NODES)]

def evaluate_model(nn_design, train_X, train_y, val_X, val_y, test_X, test_y):
    # define model
    model = Sequential()
    
    layers = len(nn_design)
    
    for i in range(1, layers):
        # for all layers, activation function is ReLU
        if i < layers-1:
            act = 'relu'
        # for final layer, activation function is softmax
        else:
            act = 'softmax'
            
        # adding the layers of NN
        model.add(Dense(output_dim = nn_design[i], 
                    input_dim = nn_design[i-1],
                    activation = act,
                    kernel_regularizer = regularizers.l1_l2(l1 = L1, l2 = L2)))
    
    # compile model
    opt = optimizers.adam(learning_rate = LEARNING_RATE)
    model.compile(loss = 'categorical_crossentropy', optimizer = opt, metrics = ['accuracy'])

    # fit model
    history = model.fit(train_X, train_y, epochs = EPOCHS, batch_size = BATCH_SIZE, 
          validation_data = (val_X, val_y),
          # callbacks = [EarlyStopping(monitor='val_accuracy', patience=20)]
          ) 

    # evaluate the model
    _, test_acc = model.evaluate(test_X, test_y)
    
    train_cost = history.history['loss']
    train_acc = history.history['accuracy']
    val_cost = history.history['val_loss']
    val_acc = history.history['val_accuracy']
    return (train_cost, train_acc, val_cost, val_acc, test_acc)

(trainX, valX, testX) = transform_dataset(StandardScaler(), nn_x_train, nn_x_validation, test_x)

fig1, ax1 = plt.subplots(2,6, figsize = (40,20))
fig2, ax2 = plt.subplots(2,6, figsize = (40,20))
for i, design in enumerate(NN_DESIGNS):
    fig1, ax1 = plt.subplots(1,2, figsize = (40,20))
    print('Running design =', design)
    
    (train_cost, train_acc, val_cost, val_acc, test_acc) = evaluate_model(design, trainX, nn_y_train, valX, nn_y_validation, testX, test_y_01)
    
    ep = list(range(1,EPOCHS+1))
     
    ax1[i//5, i%5].plot(ep, train_cost, 'k:', label = 'Train')
    ax1[i//5, i%5].plot(ep, val_cost, 'ko-', label = 'Val')
    ax1[i//5, i%5].set_xlabel('EPOCH', fontsize = 20)
    ax1[i//5, i%5].set_ylabel('cost', fontsize = 20)
    ax1[i//5, i%5].set_title('Change in cost (design='+str(design)+')', fontsize = 25)
    ax1[i//5, i%5].legend(fontsize = 20)
    
    ax2[i//5, i%5].plot(ep, train_acc, 'r:', label = 'Train')
    ax2[i//5, i%5].plot(ep, val_acc, 'ro-', label = 'Val')
    ax2[i//5, i%5].plot(ep, [test_acc]*len(ep), 'r-')
    ax2[i//5, i%5].set_xlabel('EPOCH', fontsize = 20)
    ax2[i//5, i%5].set_ylabel('accuracy', fontsize = 20)
    ax2[i//5, i%5].set_ylim(0.4, 1)
    ax2[i//5, i%5].set_title('Change in accuracy (design='+str(design)+')', fontsize = 25)
    ax2[i//5, i%5].legend(fontsize = 20)
    
    fig1.show()
    fig2.show()

In [None]:
# delete variables that will no longer be used to allocate memory
del train_cost
del train_acc
del val_cost
del val_acc
del test_acc
del fig1
del ax1
del fig2
del ax2
del evaluate_model
del trainX
del valX
del testX
del NN_DESIGNS

##### Notes:

1. As shown in this section, having more neurons does not necessarily equate to better results. I initially made the guess of using 1 hidden layer with 500 neurons due to the large number of input neurons. What ended up happening was that while the prediction accuracies were good in some cases, each run of the model took a really long time. Additionally this has often led to a lack of memory in my computer. In short, it is advisable to always start off experimenting with small number of neurons and hidden layers and build from there in order to save time and resources.

2. Based on the experimentations in this subsection, it seems that having 2 hidden layers with 500 and 500 neurons respectively is the best way to move forward. The training and validation accuracies with this hidden layer architecture are higher than most of the other architectures (plus they run faster too).

3. However, our job here isn't done. While we now have a model that predicts well (the Neural Network with the 16384 $\rightarrow$ 500 $\rightarrow$ 500 $\rightarrow$ 2 architecture achieves 100\% accuracy with more EPOCH training), our validation and test accuracies has not had much improvement. Hence, we will move on and experiment with the dropout regularization technique.


 * #### Adding Dropout
 
 Since dropout in Neural Network is simply another type of regularizer, it may or may not be used along with the L1 and L2 regularizers. Therefore, in this section, we will experiment the training of our Neural Network model with 3 cases:
    
    1. Without dropout, but with the best L1, L2 regularization parameter pairing concluded from previous sections
    
    2. With dropout, but without any L1, L2 regularization
    
    3. With dropout, and with the best L1, L2 regularization parameter pairing concluded from previous sections
    
 The code and results are shown below:

In [None]:
# Setting epochs, batch size, learning rate and L1, L2 regularization parameters
EPOCHS = 50
L1 = 0
L2 = 1e-5
BATCH_SIZE = 256
LEARNING_RATE = 1e-4

DESIGN = (INPUT_NODES, 500, OUTPUT_NODES)

DROPOUT_L1L2_OPTIONS = [(False, True), (True, False), (True, True)] # (Dropout if True, L1/L2 regularized if True)

def evaluate_model(nn_design, dropout, l1_l2, train_X, train_y, val_X, val_y, test_X, test_y):
    # define model
    model = Sequential()
    
    layers = len(nn_design)
    
    for i in range(1, layers):
        # for all layers, activation function is ReLU
        if i < layers-1:
            act = 'relu'
        # for final layer, activation function is softmax
        else:
            act = 'softmax'
            
        # adding the layers of NN
        if l1_l2:
            model.add(Dense(output_dim = nn_design[i], 
                    input_dim = nn_design[i-1],
                    activation = act,
                    kernel_regularizer = regularizers.l1_l2(l1 = L1, l2 = L2)))
        else:
            model.add(Dense(output_dim = nn_design[i], 
                    input_dim = nn_design[i-1],
                    activation = act)
                      
        if dropout:
            model.add(Dropout(0.25))
    
    # compile model
    opt = optimizers.adam(learning_rate = LEARNING_RATE)
    model.compile(loss = 'categorical_crossentropy', optimizer = opt, metrics = ['accuracy'])

    # fit model
    history = model.fit(train_X, train_y, epochs = EPOCHS, batch_size = BATCH_SIZE, 
          validation_data = (val_X, val_y),
          # callbacks = [EarlyStopping(monitor='val_accuracy', patience=20)]
          ) 

    # evaluate the model
    _, test_acc = model.evaluate(test_X, test_y)
    
    train_cost = history.history['loss']
    train_acc = history.history['accuracy']
    val_cost = history.history['val_loss']
    val_acc = history.history['val_accuracy']
    return (train_cost, train_acc, val_cost, val_acc, test_acc)

(trainX, valX, testX) = transform_dataset(StandardScaler(), nn_x_train, nn_x_validation, test_x)

fig1, ax1 = plt.subplots(1,3, figsize = (40,20))
fig2, ax2 = plt.subplots(1,3, figsize = (40,20))
for i, (dropout, l1_l2) in enumerate(DROPOUT_L1l2_OPTIONS):
    fig1, ax1 = plt.subplots(1,2, figsize = (40,20))
    print('Running design =', design)
    
    (train_cost, train_acc, val_cost, val_acc, test_acc) = evaluate_model(DESIGN, dropout, l1_l2 trainX, nn_y_train, valX, nn_y_validation, testX, test_y_01)
    
    ep = list(range(1,EPOCHS+1))
     
    ax1[i].plot(ep, train_cost, 'k:', label = 'Train')
    ax1[i].plot(ep, val_cost, 'ko-', label = 'Val')
    ax1[i].set_xlabel('EPOCH', fontsize = 20)
    ax1[i].set_ylabel('cost', fontsize = 20)
    ax1[i].set_title('Change in cost (design='+str(design)+')', fontsize = 25)
    ax1[i].legend(fontsize = 20)
    
    ax2[i].plot(ep, train_acc, 'r:', label = 'Train')
    ax2[i].plot(ep, val_acc, 'ro-', label = 'Val')
    ax2[i].plot(ep, [test_acc]*len(ep), 'r-')
    ax2[i].set_xlabel('EPOCH', fontsize = 20)
    ax2[i].set_ylabel('accuracy', fontsize = 20)
    ax2[i].set_ylim(0.4, 1)
    ax2[i].set_title('Change in accuracy (design='+str(design)+')', fontsize = 25)
    ax2[i].legend(fontsize = 20)
    
    fig1.show()
    fig2.show()

In [None]:
# delete variables that will no longer be used to allocate memory
del train_cost
del train_acc
del val_cost
del val_acc
del test_acc
del fig1
del ax1
del fig2
del ax2
del evaluate_model
del trainX
del valX
del testX
del DROPOUT_L1L2_OPTIONS

##### Notes:

Here we see that using the dropout technique did not really help us with our case, and therefore we will be sticking with our original model (ie. no dropout used).

### Training Best(?) Possible Neural Network in Keras

Using all the information obtain from the previous subsections, we attempt to construct and train the best possible Neural Network model to classify images of people who do and do not wear masks.

In [None]:
#Running NN

from keras import optimizers
from keras.models import Sequential
from keras.layers import Dense
from keras import regularizers #for l1 or l2 regularizers
from keras.callbacks import EarlyStopping #stop training when monitored argument stops decreasing/increasing

EPOCHS = 300
L1 = 0
L2 = 1e-5
BATCH_SIZE = 256
LEARNING_RATE = 1e-4


start = time.time()
model = Sequential([
    #input to first hidden layer
    Dense(output_dim = LAYER1_NODES, input_dim = INPUT_NODES,
          activation = 'relu',
          kernel_regularizer = regularizers.l2(LAMBDA)
          ),
    
    #first hidden layer to second hidden layer
    Dense(output_dim = LAYER2_NODES, input_dim = LAYER1_NODES, 
          activation = 'relu',
          kernel_regularizer = regularizers.l2(LAMBDA)
          ),
    
    #second layer to third layer
    Dense(output_dim = LAYER3_NODES, input_dim = LAYER2_NODES, 
          activation = 'relu',
          kernel_regularizer = regularizers.l2(LAMBDA)
          ),
    
    #third hidden layer to output
    Dense(output_dim = OUTPUT_NODES, input_dim = LAYER2_NODES, activation = 'softmax'),
    ])

# compile models with the learning rates set
opt = optimizers.adam(learning_rate = LEARNING_RATE)
model.compile(loss = 'categorical_crossentropy', optimizer = opt, metrics = ['accuracy'])

history = model.fit(nn_x_train, nn_y_train, epochs = EPOCHS, batch_size = BATCH_SIZE, 
          validation_data = (nn_x_validation, nn_y_validation), verbose = 0,
          callbacks = [EarlyStopping(monitor='val_accuracy', patience=30)]
          ) 

#evaluate model on test set
_, accuracy = model.evaluate(test_x, test_y_01)
end = time.time()
time_taken = end - start

print('Test Data Accuracy:', '{:.3f}'.format(accuracy))
print('Time taken:', round(time_taken,2), 'seconds')


### Results

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

# obtain predictions from model
test_y_pred_01 = model.predict(test_x, verbose = 0)

# convert the float values of test_y_pred_01 into a list of predictions 'Yes' or 'No'
test_y_pred = []
for i in range(len(test_y_pred_01)):
    if test_y_pred_01[i][0] > test_y_pred_01[i][1]:
        test_y_pred.append('Yes')
    else:
        test_y_pred.append('No')
        
# plot confusion matrix and record relevant data
fig, ax = plt.subplots(1,2, figsize = (10,5))

for i, normal in enumerate([None, 'true']):
    confmat = confusion_matrix(test_y_true, test_y_pred, labels = ['Yes', 'No'], normalize = normal)
    disp = ConfusionMatrixDisplay(confmat, display_labels = ['Yes', 'No'])
    disp.plot(ax = ax[i], cmap = plt.cm.Blues)
    if i == 0:
        disp.ax_.set_title('Non-normalized Confusion Matrix')
        classification_result['Neural Network'] = {'Time' : time_taken,
                                     'True Positive' : confmat[0][0],
                                     'True Negative' : confmat[1][1],
                                     'False Positive' : confmat[1][0],
                                     'False Negative' : confmat[0][1]}
    else:
        disp.ax_.set_title('Normalized Confusion Matrix')
        disp.im_.set_clim(0,1)

##### Notes:

For some reason, even after all the parameter tuning I have done, the resulting model still seems to be overfitting the data. It is possible that this model that I have trained is the best possible model and that it is impossible to train a Neural Network model that has a predictive value of more than 90\% on the test data.

However, as mentioned previously, there is almost an infinite amount of configurations and architecture designs in building a Neural Network model. It is more possible that there is a better hidden layer design, or better tuning of parameters that I have overlooked that can produce a model with a higher predictive value.

## Time and Accuracy Comparisons

The following figure is used to compare the difference in error rate, false positive rate, false negative rate, and time of each of the used method.

In [None]:
df = pd.DataFrame.from_dict(classification_result, orient = 'index')
df = df.reset_index()
total_sum = df['True Positive'] + df['True Negative'] + df['False Negative'] + df['False Positive']
df = df.assign(ErrorRate = (df['False Negative'] + df['False Positive']) / total_sum,
              TP_rate = df['True Positive'] / (df['True Positive'] + df['False Negative']),
              TN_rate = df['True Negative'] / (df['True Negative'] + df['False Positive']),
              FP_rate = df['False Positive'] / (df['True Negative'] + df['False Positive']),
              FN_rate = df['False Negative'] / (df['True Positive'] + df['False Negative']))
df = df.rename(columns = {'index' : 'Method', 'ErrorRate' : 'Error %', 'TP_rate' : 'True Positive %',
                         'TN_rate' : 'True Negative %', 'FP_rate' : 'False Positive %', 'FN_rate' : 'False Negative %'})
df = df.sort_values('Error %', ascending = False)

ind = np.arange(len(df))
width = 0.3

method = np.asarray(df['Method'])
time_taken = np.asarray(df['Time'])
fp = np.asarray(df['False Positive %'])
fn = np.asarray(df['False Negative %'])
err = np.asarray(df['Error %'])

fig, ax1 = plt.subplots(figsize = (20,10))
ax1.bar(ind - width, err, width,color = 'yellow', edgecolor = 'black')
ax1.bar(ind, fp, width, color = 'red', edgecolor = 'black')
ax1.bar(ind + width, fn, width, color = 'green', edgecolor = 'black')
ax1.legend(['Error Rate', 'False Positive Rate', 'False Negative Rate'], fontsize = 14)
ax1.set_ylim(0,0.55)
ax1.set_ylabel('Rate', fontsize = 14)
ax1.set_yticklabels([0, 0.1, 0.2, 0.3, 0.4, 0.5], fontsize = 14)
ax1.set_xticks(ind + width / 2)
ax1.set_xticklabels(method, rotation = 20, fontsize = 14)
ax1.set_title('Error and Time Analysis for Different ML Methods', fontsize = 18)

for i in range(len(df)):
    ax1.text(ind[i] - width, err[i] + 0.01, str(round(err[i], 3)), 
             ha = 'center', va = 'bottom', size = 14, rotation = 'vertical')
    ax1.text(ind[i], fp[i] + 0.01, str(round(fp[i], 3)), 
             ha = 'center', va = 'bottom', size = 14, rotation = 'vertical')
    ax1.text(ind[i] + width, fn[i] + 0.01, str(round(fn[i], 3)), 
             ha = 'center', va = 'bottom', size = 14, rotation = 'vertical')

ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis

color = 'blue'
ax2.set_ylabel('Time', color=color, fontsize = 14)  # we already handled the x-label with ax1
ax2.plot(method, time_taken, 'o-b')
# ax2.set_ylim(0,1300)
# ax2.set_yticklabels(list(range(0,1201,200)), fontsize = 14)
ax2.tick_params(axis='y', labelcolor=color)

## Key Takeaways

## References and Resources

Plots:
* https://stackoverflow.com/questions/61825227/plotting-multiple-confusion-matrix-side-by-side

* https://matplotlib.org/3.1.1/gallery/subplots_axes_and_figures/figure_title.html

* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html

* https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py

* https://matplotlib.org/examples/api/barchart_demo.html

SVM:
* https://scikit-learn.org/stable/modules/svm.html

* https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

* https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

* https://stats.stackexchange.com/questions/18030/how-to-select-kernel-for-svm

* https://towardsdatascience.com/a-guide-to-svm-parameter-tuning-8bfe6b8a452c

* https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f#:~:text=The%20%E2%80%9Ctrick%E2%80%9D%20is%20that%20kernel,the%20data%20by%20these%20transformed

* http://www.cs.cmu.edu/~aarti/Class/10601/slides/svm_11_22_2011.pdf

Logistic Regression:
* https://realpython.com/logistic-regression-python/

* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

* https://www.knime.com/blog/regularization-for-logistic-regression-l1-l2-gauss-or-laplace#:~:text=Regularization%20for%20Logistic%20Regression%3A%20L1%2C%20L2%2C%20Gauss%20or%20Laplace%3F,-Mon%2C%2003%2F12&text=Regularization%20can%20be%20used%20to%20avoid%20overfitting.&text=In%20other%20words%3A%20regularization%20can,from%20overfitting%20the%20training%20dataset

* https://stackoverflow.com/questions/22851316/what-is-the-inverse-of-regularization-strength-in-logistic-regression-how-shoul

LDA:
* https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html

Logistic Regression vs LDA:
* https://stats.stackexchange.com/questions/95247/logistic-regression-vs-lda-as-two-class-classifiers

Random Forest
* https://www.datacamp.com/community/tutorials/random-forests-classifier-python

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

* https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76

* https://en.wikipedia.org/wiki/Decision_tree_learning

* https://towardsdatascience.com/how-to-visualize-a-decision-tree-from-a-random-forest-in-python-using-scikit-learn-38ad2d75f21c

Neural Network:
* https://adventuresinmachinelearning.com/python-tensorflow-tutorial/

* https://adventuresinmachinelearning.com/improve-neural-networks-part-1/

* https://www.ritchieng.com/machine-learning/deep-learning/tensorflow/regularization/

* https://en.wikipedia.org/wiki/Backpropagation

* https://machinelearningmastery.com/early-stopping-to-avoid-overtraining-neural-network-models/

* https://www.kdnuggets.com/2019/12/5-techniques-prevent-overfitting-neural-networks.html

* https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw#:~:text=The%20number%20of%20hidden%20neurons,size%20of%20the%20input%20layer

* https://towardsdatascience.com/pruning-deep-neural-network-56cae1ec5505

* https://towardsdatascience.com/how-to-train-neural-network-faster-with-optimizers-d297730b3713

* https://stats.stackexchange.com/questions/345990/why-does-the-loss-accuracy-fluctuate-during-the-training-keras-lstm

* https://stats.stackexchange.com/questions/255105/why-is-the-validation-accuracy-fluctuating

* https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/

* https://stackoverflow.com/questions/45587378/how-to-get-predicted-values-in-keras

* https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/

* http://theorangeduck.com/page/neural-network-not-working

* https://machinelearningmastery.com/how-to-improve-neural-network-stability-and-modeling-performance-with-data-scaling/