# Dimensionality Reduction with MNIST

Using the MNIST dataset to test out dimensionality reduction to validate improvement in speed times.


In [1]:
import numpy as np
import pandas as pd

np.random.seed(123)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12


In [2]:
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
mnist

{'COL_NAMES': ['label', 'data'],
 'DESCR': 'mldata.org dataset: mnist-original',
 'data': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ..., 
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'target': array([ 0.,  0.,  0., ...,  9.,  9.,  9.])}

## Splitting into training & testing

In [3]:
X, y = mnist["data"], mnist["target"]
X.shape

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

## Training a Random Forest Classifier

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
import timeit

param_distribs = {'n_estimators': randint(10,1000), 'max_depth': randint(1, 10), 
                  'min_samples_split': randint(2, 10), 'min_samples_leaf': randint(1,10),
                  'max_leaf_nodes' : randint(2,10)}

rnd_search_rfc = RandomizedSearchCV(RandomForestClassifier(), param_distribs, scoring ='accuracy', n_iter=10, cv=3, n_jobs = -1)

start_rfc = timeit.default_timer()
rnd_search_rfc.fit(X_train, y_train_5)
stop_rfc = timeit.default_timer()
print("RFC Best Score:", rnd_search_rfc.best_score_)
print("Time taken to run:", stop_rfc - start_rfc) 

RFC Best Score: 0.933033333333
Time taken to run: 589.342932496239


In [6]:
from sklearn.metrics import accuracy_score
y_pred_rfc = rnd_search_rfc.predict(X_test)
rfc_results = accuracy_score(y_test_5, y_pred_rfc)
print("Random Forest Classifier Test Set Accuracy:", rfc_results)

Random Forest Classifier Test Set Accuracy: 0.9365


## PCA and training random forest classifier on reduced set 

In [7]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_train)
X_test_reduced = pca.transform(X_test)

In [8]:
param_distribs = {'n_estimators': randint(10,1000), 'max_depth': randint(1, 10), 
                  'min_samples_split': randint(2, 10), 'min_samples_leaf': randint(1,10),
                  'max_leaf_nodes' : randint(2,10)}

rnd_search_rfc_reduced = RandomizedSearchCV(RandomForestClassifier(), param_distribs, scoring ='accuracy', 
                                            n_iter=10, cv=3, n_jobs = -1)
start_rfc_reduced = timeit.default_timer()
rnd_search_rfc_reduced.fit(X_reduced, y_train_5)
stop_rfc_reduced = timeit.default_timer()
print("RFC Reduced Best Score:", rnd_search_rfc_reduced.best_score_)
print("Time taken to run:", stop_rfc_reduced - start_rfc_reduced) 

RFC Reduced Best Score: 0.90965
Time taken to run:: 498.4436029112504


In [9]:
y_pred_rfc_red = rnd_search_rfc_reduced.predict(X_test_reduced)
rfc_reduced_results = accuracy_score(y_test_5, y_pred_rfc_red)
print("Random Forest Classifier Reduced Test Set Accuracy:", rfc_reduced_results)

Random Forest Classifier Reduced Test Set Accuracy: 0.9108


### Comparison of RFC on regular & reduced set

The time it took to perform the first classifier on the original number of sets  was approx. 10 mins and it found an esitmator with an accuracy of 0.93. When performing on the test set it had an accuracy of 0.94.

The time it took to perform the second reduced classifier was just over 8 minutes and it found an estimator with an accuracy of 0.91. When performing on the test set it had an accuracy of 0.91, which was a 2% decrease comapred to the original model.

The time saving wasnt very significant, although it is none the less faster. At an estimation of 2% accuracy loss it would be worth it based on the standalone fact it is only a minimal loss, but much more worth it if the time savings were more siginificant.

## Using t-SNE

In [8]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, n_iter = 250, random_state=42)
start = timeit.default_timer()
X_reduced_tsne = tsne.fit_transform(X)
stop = timeit.default_timer()
print("Time to run:", stop - start) 

Time to run: 8816.29653494836


Reduced all ffeatures to 2 dimensions, quite a long time to run considering the size of the dataset

In [10]:
X_reduced_tsne.shape

(70000, 2)

I created the code to try LLE, while running my computer constantly freezed and crashes.. was a first for this machine.. sorry for the lack of results here

In [None]:
from sklearn.manifold import LocallyLinearEmbedding
lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10, random_state=42, n_jobs = -1)

start_lle = timeit.default_timer()
X_reduced_lle = lle.fit_transform(X)
stop_lle = timeit.default_timer()
print("Time to run:", X_reduced_lle - start_lle) 