# Assignment 10: Dimensionality Reduction

Dataset(s) needed: MNIST ("Modified National Institute of Standards and Technology") dataset.

In [76]:
#Load the MNIST dataset
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)
X = mnist.data / 255.0
y = mnist.target
print(X.shape, y.shape)

(70000, 784) (70000,)


<h3> Q.1. Split the data into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing).
</h3>

In [77]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=60000, random_state=42)

<h3> Q.2. Train a Logistic Regression classifier on the dataset and see how long it takes.</h3>

In [78]:
from sklearn.linear_model import LogisticRegression
import time

log_clf = LogisticRegression(fit_intercept=False, max_iter=1000, solver='lbfgs')
start_time = time.time()
# Train the classifier
log_clf.fit(X_train, y_train)
end_time = time.time()

print("Training took {:.2f}s".format(end_time - start_time))

Training took 104.98s


<h3> Q.3. Evaluate the resulting model on the test set.</h3>

In [79]:
from sklearn.metrics import accuracy_score

y_pred = log_clf.predict(X_test)
print("Logistic Accuracy Score:", accuracy_score(y_true=y_test, y_pred=y_pred))

Logistic Accuracy Score: 0.9212


<h3> Q.4. Use PCA to reduce the dataset's dimensionality, with an explained variance ratio of 95%.</h3>

In [80]:
from sklearn.decomposition import PCA
import pandas as pd

# Grab feature columns
feat_cols = [ 'pixel'+str(i) for i in range(X_train.shape[1]) ]
df = pd.DataFrame(X_train,columns=feat_cols)
df['y'] = y_train
df['label'] = df['y'].apply(lambda i: str(i))

pca = PCA(n_components=155)
pca_result = pca.fit_transform(df[feat_cols].values)
print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))
total = 0
for evr in pca.explained_variance_ratio_:
    total += evr
print("Total Explained Ratio:", total)

Explained variation per principal component: [0.09736019 0.07162769 0.06157279 0.05407583 0.04894241 0.04314663
 0.0326955  0.02886339 0.02755206 0.02336354 0.02114186 0.02036159
 0.01710273 0.01697588 0.01579852 0.01483028 0.01315072 0.01277798
 0.01188548 0.01154643 0.01069553 0.01010967 0.00954102 0.00907833
 0.00882614 0.00838996 0.00809334 0.00785285 0.00740609 0.00689452
 0.00657504 0.00644894 0.00601529 0.00586087 0.00568734 0.00542785
 0.00505607 0.00487531 0.00479006 0.00466511 0.00454422 0.00445376
 0.00419137 0.00396211 0.00384115 0.00375532 0.00361444 0.00350354
 0.00338201 0.00319514 0.00316586 0.00309288 0.00295258 0.00287322
 0.00282207 0.00269456 0.00267291 0.00256465 0.00253613 0.00243878
 0.00239702 0.00238198 0.00229797 0.00221263 0.00212635 0.00205955
 0.00202272 0.00194566 0.00191948 0.00188817 0.00187128 0.0018004
 0.00176297 0.00172727 0.0016457  0.00163152 0.00161328 0.00154714
 0.0014698  0.00142147 0.00140752 0.00140012 0.00139287 0.00134772
 0.00132494 0.0013

<h3> Q.5. Train a new Logistic Regression classifier on the reduced dataset and see how long it takes. Was training much faster? Explain your results.
</h3>

In [81]:
log_mod = LogisticRegression(fit_intercept=False, max_iter=1000, solver='lbfgs')
start_time = time.time()
# Train the classifier
log_mod.fit(pca_result, y_train)
end_time = time.time()
print("Training took {:.2f}s".format(end_time - start_time))
# Training was much faster, nearly 10x faster. The training time I've observed was 106 seconds without PCA, and 11.7
# seconds with PCA. The dimensionality reduction was effective in lowering the training time.

Training took 10.82s


<h3> Q.6. Evaluate the new classifier on the test set: how does it compare to the previous classifier? Discuss the speed / accuracy trade-off and in which case you'd prefer a very slight drop in model performance for a x-time speedup in training.
</h3>

In [84]:
test_feat_cols = [ 'pixel'+str(i) for i in range(X_test.shape[1]) ]
test_df = pd.DataFrame(X_test,columns=test_feat_cols)

pca_test = PCA(n_components=155)
pca_result_test = pca.transform(test_df[test_feat_cols].values)

y_pred_pca = log_mod.predict(pca_result_test)
print(accuracy_score(y_true=y_test, y_pred=y_pred_pca))
# It seems the accuracy score has been reduced to 91.44%. The previous model (no PCA) was 92.12%.
# Given these numbers have been arrived at correctly, I would be willing to use PCA with this reduced accuracy, especially
# on a larger dataset where it could take hours/days to fit a model. If the accuracy score was drastically reduced, in this
# case maybe around <70%, I would consider keeping the 'slow' model. It really depends on what accuracy score is desired
# and the overall computation time.

0.9145


<h3> Q.7. Create a new text cell in your Notebook: Complete a 50-100 word summary 
    (or short description of your thinking in applying this week's learning to the solution) 
     of your experience in this assignment. Include:
<br>                                                                    
What was your incoming experience with this model, if any?
what steps you took, what obstacles you encountered.
how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?)
This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.
</h3>

In [None]:
# Enter summary here
# No incoming knowledge. The biggest issue I had was reducing the dimensionality of the test set. I wasn't sure if
# I should have done PCA on the entire set and then split it up, or if the way I did it was correct.
# The other problem I encountered was when I reduced the dimensionality of the test set, I ran a fit_transform and
# received a low accuracy number (12%). Then I read an article that fit_transform should only be used on training data
# while transform is to be used on test data as to not by biased. This increased my accuracy drastically to the number
# I had expected to see (>90%). I can imagine that on much larger data sets PCA may not only be nice-to-have, but 100%
# necessary to fit models within a reasonable timeframe.