## **PRINCIPAL COMPONENT ANALYSIS**
22.05.23

We will use the MNIST digits dataset, which comes pre-installed in sklearn. This dataset has 28x28 pixel images of handwritten digits 0-9. Your task is to classify these to determine which digits they are.

In [None]:
# Loading the MNIST dataset:

from keras.datasets import mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(X_train.shape[0],-1)
X_test = X_test.reshape(X_test.shape[0], -1)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [None]:
y_train.shape

(60000,)

### **DATA PREPARATION**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA

In [None]:
# Scaler designation:
scaler = StandardScaler()

In [None]:
from sklearn import preprocessing
from sklearn.pipeline import make_pipeline

# Fit Principal Component Analysis (PCA):
pca = PCA(n_components = .95)

# Initial Pipeline :
preprocessing = make_pipeline(scaler, pca)

In [None]:
# Fit and transform on training data:

preprocessing.fit_transform(X_train)

array([[-9.22158806e-01, -4.81479035e+00,  6.75598364e-02, ...,
         6.48901824e-01, -5.58761091e-01,  7.00234248e-01],
       [ 8.70897698e+00, -7.75440302e+00, -3.44791044e+00, ...,
         3.84235441e-01,  1.21430123e-02,  5.67996671e-02],
       [ 2.32838932e+00,  9.43133817e+00, -6.18411405e+00, ...,
         7.15532228e-01, -5.28732321e-01, -1.57928342e+00],
       ...,
       [-3.77721201e+00, -3.23056436e+00, -3.80619883e+00, ...,
        -2.30269745e-01,  5.69853658e-01,  6.13243214e-01],
       [ 1.72236917e+00, -4.94812525e+00,  6.95122764e-03, ...,
         1.04435864e-02,  3.95711616e-01,  1.71289370e-01],
       [-1.42725062e+00, -6.17538558e+00, -2.96768709e+00, ...,
        -1.22635613e+00, -4.48097640e-01, -3.93178000e-01]])

### **KNN MODEL WITH PCA TRANSFORMED DATA**

In [None]:
from sklearn.neighbors import KNeighborsClassifier


# Instantiate Model: 
KNpca = KNeighborsClassifier(n_neighbors=6)

# Second Pipeline for preprocessed data and Model:
KNpca_pl = make_pipeline(preprocessing, KNpca)

# Fit of the pipeline on the training data:
KNpca_pl.fit(X_train, y_train)

Pipeline(steps=[('pipeline',
                 Pipeline(steps=[('standardscaler', StandardScaler()),
                                 ('pca', PCA(n_components=0.95))])),
                ('kneighborsclassifier', KNeighborsClassifier(n_neighbors=6))])

In [None]:
# Test Predictions for PCA model:
%%time
pca_test_pred = KNpca_pl.predict(X_test)

CPU times: user 37.2 s, sys: 866 ms, total: 38.1 s
Wall time: 25.5 s


### **KNN MODEL WITHOUT PCA TRANSFORMED DATA**

In [None]:
# Preprocessing without PCA

# Scaler designation:
scaler = StandardScaler()

# Fit and transform on training data:

scaledx = scaler.fit_transform(X_train)

In [None]:
# Model Instantiation:
KN = KNeighborsClassifier(n_neighbors= 6)

# Fit on training data:
KN.fit(scaledx, y_train)

KNeighborsClassifier(n_neighbors=6)

In [None]:
# Test Predictions for non PCA model:
%%time
test_pred2 = KN.predict(X_test)

CPU times: user 1min 7s, sys: 1.21 s, total: 1min 8s
Wall time: 40.8 s


### **MODEL RUNTIME AND PERFORMANCE COMPARISON**

In [None]:
print("The total runtime for the PCA model was 38.1 seconds")
print("The total runtime for the non PCA model was 1 minute 8 seconds")

The total runtime for the PCA model was 38.1 seconds
The total runtime for the non PCA model was 1 minute 8 seconds


In [None]:
from sklearn.metrics import classification_report

pca_test_scores = classification_report(y_test, pca_test_pred)
test_scores = classification_report(y_test, test_pred2)

print(f'Test Evaluation for PCA Model:\n {pca_test_scores}')
print('\n')
print(f'Test Evaluation for non PCA Model:\n {test_scores}')

Test Evaluation for PCA Model:
               precision    recall  f1-score   support

           0       0.95      0.99      0.97       980
           1       0.96      0.99      0.97      1135
           2       0.96      0.94      0.95      1032
           3       0.92      0.96      0.94      1010
           4       0.95      0.95      0.95       982
           5       0.93      0.92      0.93       892
           6       0.96      0.97      0.97       958
           7       0.94      0.93      0.94      1028
           8       0.97      0.91      0.94       974
           9       0.93      0.91      0.92      1009

    accuracy                           0.95     10000
   macro avg       0.95      0.95      0.95     10000
weighted avg       0.95      0.95      0.95     10000



Test Evaluation for non PCA Model:
               precision    recall  f1-score   support

           0       0.49      0.99      0.66       980
           1       1.00      0.90      0.95      1135
        

The PCA model performed better than the non PCA model across all metrics. 