# Multi-Layer Perceptron (MLP) - Large & Complex Data Set


In this notebook we will train MLP classifier on a **large & non-linear data set**. We will use an image classification (multi-class) problem for experimentation. 

For a comparative understanding, we will compare the performance of the MLP with the SVM and Logistic Regression classifiers.

We will use dimensionality reduction technique (Principle Component Analysis) to project the features into a smaller dimension to expedite the training time.

Due to the non-linearity of the features (i.e., pixels), we will use the Gaussian Radial Basis Function (RBF) Kernel based Support Vector Machine (SVM). Previously we have seen that the Gaussian RBF Kernel based SVM performs better than Softmax regression classifier.

In this notebook we will investigate whether MLP outperforms the Gaussian RBF Kernel SVM on a very large complex data set.

We will conduct the following experiments.


## Experiments

- Experiment 1: Multi-Layer Perceptron 
- Experiment 2: Multi-Layer Perceptron + PCA
- Experiment 2: Support Vector Machine (SVC with RBF Kernel) + PCA
- Experiment 3: Logistic Regression (Softmax Regression) + PCA


## Dataset: MNIST


We will use the MNIST dataset, which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents.


There are 70,000 images. Each image is 28x28 pixels, and each feature simply represents one pixel’s intensity, from 0 (white) to 255 (black).

Thus, each image has 784 features. 

In [48]:
import numpy as np
import pandas as pd



from sklearn.datasets import fetch_mldata
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

## Load Data and Create Data Matrix (X) and the Label Vector (y)

In [37]:
mnist = fetch_mldata('MNIST original')

X, y = mnist["data"], mnist["target"] 

print(X.shape)
print(y.shape)

(70000, 784)
(70000,)




## Split Data Into Training and Test Sets

The MNIST dataset is already split into a training set (the first 60,000 images) and a test set (the last 10,000 images).

We will shuffle the training set to ensure that all cross-validation folds will be similar. 

In [38]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

# Optimization Using Dimensionaly Reduction

We can optimize the running-time of the Logistic Regression algorithm by reducing the number of features. Our assumption is that the essence or core content of the data does not span along all dimensions. The technique for reducing the dimension of data is known as dimensionality reduction.

For a gentle introduction to various dimensionality reduction technique, see the notebook "Dimensionality Reduction" in the Github repository.

We will use the Principle Component Analysis (PCA) dimensionality reduction technique to project the MNIST dataset (784 features) to a lower dimensional space by retaining maximum variance. 

The goal is to see the improvement in training time due to this dimensionality reduction.

Before we apply the PCA, we need to standardize the data.

## Standardize the Data

PCA is influenced by scale of the data. Thus we need to scale the features of the data before applying PCA. 

For understanding the negative effect of not scaling the data, see the following post:

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py

Note that we fit the scaler on the training set and transform on the training and test set. 

In [39]:
scaler = StandardScaler()

# Fit on training set only.
scaler.fit(X_train)

# Apply transform to both the training set and the test set.
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)



## Apply PCA

While applying PCA we can set the number of principle components by the "n_components" attribute. But more importantly, we can use this attribute to determine the % of variance we want to retain in the extracted features.

For example, if we set it to 0.95, sklearn will choose the **minimum number of principal components** such that 95% of the variance is retained.

In [40]:
%%time
pca = PCA(n_components=0.95)

pca.fit(X_train)

CPU times: user 18 s, sys: 926 ms, total: 18.9 s
Wall time: 5.96 s


## Number of Principle Components

We can find how many components PCA chose after fitting the model by using the following attribute: n_components_

We will see that 95% of the variance amounts to **315 principal components**.

In [41]:
print("Number of Principle Components: ", pca.n_components_)  

Number of Principle Components:  331


## Apply the Mapping (Transform) to both the Training Set and the Test Set

In [42]:
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

## Experiments

We will conduct the following experiments.

- Experiment 1: Multi-Layer Perceptron 
- Experiment 2: Multi-Layer Perceptron + PCA
- Experiment 2: Support Vector Machine (SVC with RBF Kernel) + PCA
- Experiment 3: Logistic Regression (Softmax Regression) + PCA


## Experiment 1: MLP 

First we train the MLP without applying PCA on the data. So we use 784 features.

See the notebook "Perceptron-MLP-Nonlinear Data" for a discussion on various solvers that are used by MLP and the hyperparameters.

Since the data set for this experiment is large, we will use the "sgd" solver. Although "adam" is very similar to "sgd", it requires a lot more epochs (set by the "max_iter" hyperparameter) on this data set.

Although we did not do hyperparameter tunining, using multiple experimentation we converged to some near-optimal values for the parameters.
- One hidden layer with 150 neurons worked pretty well.
- 200 epochs was enough. Larger epochs results in overtraining that causes overfitting.
- The regularization constant alpha was set to 0.1.
- The'logistic' activation function performed better than 'relu'.
- The 'learning_rate' is set to 'adaptive. It is useful for faster convergence. It will ensure that the learning rate remains large when learning takes place and is decreased when learning slows down.

To prevent overfitting due to overtraining, a useful technique is to apply **early stopping**. So, the "early_stopping" parameter should be set to True. 

If we use early stopping, then we should also set the "n_iter_no_change" to a suitable value. It defines the maximum number of epochs to not meet tol improvement. The default is 10.

We did not use these two hyperparameters as stopped training at 200 epochs.


In [43]:
%%time
mlp_clf = MLPClassifier(hidden_layer_sizes=(150,), max_iter=200, alpha=0.1,
                    solver='sgd', verbose=True, tol=1e-5, random_state=1, 
                    learning_rate = 'adaptive', learning_rate_init=0.1, activation='logistic')

mlp_clf.fit(X_train, y_train)

print("No. of Iterations:", mlp_clf.n_iter_ )

y_train_predicted = mlp_clf.predict(X_train)

train_accuracy_mlp = np.mean(y_train_predicted == y_train)
print("\nTraining Accuracy: ", train_accuracy_mlp)

Iteration 1, loss = 0.41387661
Iteration 2, loss = 0.25759762
Iteration 3, loss = 0.22611659
Iteration 4, loss = 0.20767159
Iteration 5, loss = 0.19583482
Iteration 6, loss = 0.18801675
Iteration 7, loss = 0.18316432
Iteration 8, loss = 0.17839837
Iteration 9, loss = 0.17495118
Iteration 10, loss = 0.17279221
Iteration 11, loss = 0.17159208
Iteration 12, loss = 0.16934288
Iteration 13, loss = 0.16825178
Iteration 14, loss = 0.16750019
Iteration 15, loss = 0.16681934
Iteration 16, loss = 0.16650353
Iteration 17, loss = 0.16513859
Iteration 18, loss = 0.16393651
Iteration 19, loss = 0.16341908
Iteration 20, loss = 0.16301870
Iteration 21, loss = 0.16276893
Iteration 22, loss = 0.16169063
Iteration 23, loss = 0.16145966
Iteration 24, loss = 0.16092329
Iteration 25, loss = 0.16071513
Iteration 26, loss = 0.15961301
Iteration 27, loss = 0.15920282
Iteration 28, loss = 0.15883535
Iteration 29, loss = 0.15803681
Iteration 30, loss = 0.15770153
Iteration 31, loss = 0.15796250
Iteration 32, los




Training Accuracy:  0.9908166666666667
CPU times: user 13min 43s, sys: 2min 11s, total: 15min 54s
Wall time: 4min 21s


## Experiment 1: Evaluate MLP  on Test Data

In [44]:
%%time
y_test_predicted = mlp_clf.predict(X_test)

test_accuracy_mlp = np.mean(y_test_predicted == y_test)
print("\nTest Accuracy: ", test_accuracy_mlp)

print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))

print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))


Test Accuracy:  0.9752

Test Confusion Matrix:
[[ 970    0    2    1    1    2    2    1    1    0]
 [   0 1124    4    0    0    1    2    1    3    0]
 [   5    0 1000    3    3    1    3    7   10    0]
 [   0    0    5  992    0    3    0    4    3    3]
 [   1    0    2    1  962    0    6    2    2    6]
 [   2    0    0    7    2  870    4    1    5    1]
 [   7    3    3    0    4    9  929    0    3    0]
 [   1    9   10    3    2    0    0  994    1    8]
 [   5    1    2    5    4    3    2    4  946    2]
 [   4    6    1   10   12    4    0    6    1  965]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.97      0.99      0.98       980
         1.0       0.98      0.99      0.99      1135
         2.0       0.97      0.97      0.97      1032
         3.0       0.97      0.98      0.98      1010
         4.0       0.97      0.98      0.98       982
         5.0       0.97      0.98      0.97       892
         6.0      

## Expriment 2: MLP + PCA

In [45]:
%%time
mlp_clf_pca = MLPClassifier(hidden_layer_sizes=(150,), max_iter=200, alpha=0.1,
                    solver='sgd', verbose=True, tol=1e-5, random_state=1, 
                    learning_rate = 'adaptive', learning_rate_init=0.1, activation='logistic')

mlp_clf_pca.fit(X_train_pca, y_train)

print("No. of Iterations:", mlp_clf_pca.n_iter_ )

y_train_predicted = mlp_clf_pca.predict(X_train_pca)

train_accuracy_mlp = np.mean(y_train_predicted == y_train)
print("\nTraining Accuracy: ", train_accuracy_mlp)

Iteration 1, loss = 0.41019127
Iteration 2, loss = 0.25623357
Iteration 3, loss = 0.22501134
Iteration 4, loss = 0.20832951
Iteration 5, loss = 0.19745664
Iteration 6, loss = 0.19024809
Iteration 7, loss = 0.18504124
Iteration 8, loss = 0.18141359
Iteration 9, loss = 0.17838519
Iteration 10, loss = 0.17618330
Iteration 11, loss = 0.17466941
Iteration 12, loss = 0.17273813
Iteration 13, loss = 0.17061268
Iteration 14, loss = 0.17042505
Iteration 15, loss = 0.16969841
Iteration 16, loss = 0.16898439
Iteration 17, loss = 0.16804512
Iteration 18, loss = 0.16753382
Iteration 19, loss = 0.16659614
Iteration 20, loss = 0.16568833
Iteration 21, loss = 0.16475467
Iteration 22, loss = 0.16453055
Iteration 23, loss = 0.16369052
Iteration 24, loss = 0.16375910
Iteration 25, loss = 0.16297690
Iteration 26, loss = 0.16286800
Iteration 27, loss = 0.16390069
Iteration 28, loss = 0.16280298
Iteration 29, loss = 0.16208437
Iteration 30, loss = 0.16162074
Iteration 31, loss = 0.16133763
Iteration 32, los




Training Accuracy:  0.98895
CPU times: user 8min 34s, sys: 1min 33s, total: 10min 7s
Wall time: 2min 35s


## Experiment 2: Evaluate MLP + PCA on Test Data

In [46]:
%%time
y_test_predicted = mlp_clf_pca.predict(X_test_pca)

test_accuracy_mlp = np.mean(y_test_predicted == y_test)
print("\nTest Accuracy: ", test_accuracy_mlp)

print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))

print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))


Test Accuracy:  0.977

Test Confusion Matrix:
[[ 971    0    2    1    0    2    2    1    1    0]
 [   0 1124    2    1    0    1    3    1    3    0]
 [   5    0 1003    2    3    0    2    6   10    1]
 [   0    0    3  996    0    2    0    4    3    2]
 [   1    0    1    0  961    0    6    1    2   10]
 [   2    0    0    7    0  871    6    1    4    1]
 [   7    3    1    0    4    8  932    0    3    0]
 [   2   10    8    3    1    0    0  997    0    7]
 [   4    1    3    5    4    4    2    4  944    3]
 [   5    5    1    9    9    5    0    4    0  971]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.97      0.99      0.98       980
         1.0       0.98      0.99      0.99      1135
         2.0       0.98      0.97      0.98      1032
         3.0       0.97      0.99      0.98      1010
         4.0       0.98      0.98      0.98       982
         5.0       0.98      0.98      0.98       892
         6.0       

## Experiment 2: SVC (RBF Kernel) + PCA

In [17]:
%%time
svm_clf_pca = SVC(C=1, gamma=0.001)
svm_clf_pca.fit(X_train_pca, y_train)

CPU times: user 3min 34s, sys: 399 ms, total: 3min 35s
Wall time: 3min 35s


## Experiment 2: Evaluate SVC (RBF Kernel) + PCA on Test Data

In [18]:
%%time

y_test_predicted = svm_clf_pca.predict(X_test_pca)

accuracy_score_test = np.mean(y_test_predicted == y_test)
print("\nTest Accuracy: ", accuracy_score_test)

print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))

print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))


Test Accuracy:  0.9659

Test Confusion Matrix:
[[ 968    0    2    1    0    3    3    1    2    0]
 [   0 1126    3    0    0    1    3    0    2    0]
 [   6    2  993    3    2    0    1   14   10    1]
 [   0    0    2  984    1    7    0   10    6    0]
 [   1    0    8    0  945    2    4    7    2   13]
 [   2    0    1   12    3  854    7    5    7    1]
 [   6    2    1    0    4    9  930    2    4    0]
 [   0    7   17    3    1    1    0  986    0   13]
 [   3    0    4    9    6   12    3    9  926    2]
 [   4    6    4   12   17    2    0   14    3  947]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.98      0.99      0.98       980
         1.0       0.99      0.99      0.99      1135
         2.0       0.96      0.96      0.96      1032
         3.0       0.96      0.97      0.97      1010
         4.0       0.97      0.96      0.96       982
         5.0       0.96      0.96      0.96       892
         6.0      

## Experiment 3: Logistic Regression (Softmax Regression) + PCA

We use the best performing solver (i.e., lbfgs) from previous notebook to train the logistic regression model on the PCA transformed data.

In [13]:
%%time
softmax_reg_pca = LogisticRegression(solver='lbfgs', multi_class='multinomial')

softmax_reg_pca.fit(X_train_pca, y_train)

CPU times: user 32 s, sys: 2.11 s, total: 34.2 s
Wall time: 8.67 s




## Experiment 4: Evaluate Softmax Regression + PCA on Test Data

In [14]:
print("No. of Iterations:", softmax_reg_pca.n_iter_ )


y_test_predicted = softmax_reg_pca.predict(X_test_pca)


accuracy_score_test = np.mean(y_test_predicted == y_test)
print("\nTest Accuracy: ", accuracy_score_test)


print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))


print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))

No. of Iterations: [100]

Test Accuracy:  0.9265

Test Confusion Matrix:
[[ 957    0    1    2    1    6    8    3    2    0]
 [   0 1114    3    2    0    1    3    2   10    0]
 [   7    5  931   17   12    3    9   11   34    3]
 [   3    3   18  919    1   22    3   11   23    7]
 [   1    2    8    2  917    0   10    4    9   29]
 [   7    5    3   33    8  778   13    6   35    4]
 [  12    3    8    2    6   12  912    1    2    0]
 [   0    9   29    5    6    1    0  948    0   30]
 [   6    6    6   22    9   24    7   11  874    9]
 [   9    7    2   10   25    7    0   26    8  915]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.96      0.98      0.97       980
         1.0       0.97      0.98      0.97      1135
         2.0       0.92      0.90      0.91      1032
         3.0       0.91      0.91      0.91      1010
         4.0       0.93      0.93      0.93       982
         5.0       0.91      0.87      0.89    

# Summary of Results from 3 Experiments

In [47]:
data = [["MLP  (200 epochs)", 0.9752, "4min 21s"], 
        ["MLP + PCA (200 epochs)", 0.977, "2min 35s"], 
        ["SVM(RBF) + PCA", 0.9659, "3min 34s"],
        ["Softmax + PCA", 0.9265, "8.67 s"]]

pd.DataFrame(data, columns=["Classifier", "Accuracy", "Running-Time"])


Unnamed: 0,Classifier,Accuracy,Running-Time
0,MLP (200 epochs),0.9752,4min 21s
1,MLP + PCA (200 epochs),0.977,2min 35s
2,SVM(RBF) + PCA,0.9659,3min 34s
3,Softmax + PCA,0.9265,8.67 s


## Comparative Understanding

We have done 4 experiments using MLP (without PCA), MLP (with PCA), Kernel SVM (with PCA) and Logistic Regression (with PCA) classifiers.

We make following observations.
- The MLP outperforms other two classifiers.
- MLP with PCA performs slightly better than MLP without PCA. Also it is faster.
- Understandably logistic regression performed poorly due to the non-linear nature of the data. However, it is faster.
- MLP (with PCA) is faster than the Kernel SVM. Because the dual SVM optimization complexity is $O(N^2d)$.

### Thus, for large non-linear data set (e.g., image classification) MLP performs better than the RBF kernel based SVM.