**PCA + Logistic Regression (MNIST)**


In [0]:
from sklearn.datasets import fetch_mldata
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
import pandas as pd

**Download and Load the Data**

In [0]:
# You can add the parameter data_home to wherever to where you want to download your data
mnist = fetch_mldata('MNIST original')



In [0]:
mnist

{'COL_NAMES': ['label', 'data'],
 'DESCR': 'mldata.org dataset: mnist-original',
 'data': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'target': array([0., 0., 0., ..., 9., 9., 9.])}

In [0]:
# These are the images
mnist.data.shape

(70000, 784)

In [0]:
# These are the labels
mnist.target.shape

(70000,)

### Splitting Data into Training and Test Sets

In [0]:
# test_size: what proportion of original data is used for test set
train_img, test_img, train_lbl, test_lbl = train_test_split(
    mnist.data, mnist.target, test_size=1/7.0, random_state=0)

In [0]:
print(train_img.shape)

(60000, 784)


In [0]:
print(train_lbl.shape)

(60000,)


In [0]:
print(test_img.shape)

(10000, 784)


In [0]:
print(test_lbl.shape)

(10000,)


### Standardizing the Data
Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially, if it was measured on different scales.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data

Notebook going over the importance of feature Scaling: http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py

In [0]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Fit on training set only.
scaler.fit(train_img)

# Apply transform to both the training set and the test set.
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)

### PCA to Speed up Machine Learning Algorithms (Logistic Regression)
Step 0: Import and use PCA. After PCA you will apply a machine learning algorithm of your choice to the transformed data

In [0]:
from sklearn.decomposition import PCA

Make an instance of the Model

In [0]:
pca = PCA(.95)

Fit PCA on training set. **Note: you are fitting PCA on the training set only**

In [0]:
pca.fit(train_img)

PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [0]:
pca.n_components_

330

Apply the mapping (transform) to both the training set and the test set.

In [0]:
train_img = pca.transform(train_img)
test_img = pca.transform(test_img)

**Step 1**: Import the model you want to use

In sklearn, all machine learning models are implemented as Python classes

In [0]:
from sklearn.linear_model import LogisticRegression

**Step 2**: Make an instance of the Model

In [0]:
# all parameters not specified are set to their defaults
# default solver is incredibly slow thats why we change it
# solver = 'lbfgs'
logisticRegr = LogisticRegression(solver = 'lbfgs')



Step 3: Training the model on the data, storing the information learned from the data

Model is learning the relationship between x (digits) and y (labels)

In [0]:
logisticRegr.fit(train_img, train_lbl)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

**Step 4**: Predict the labels of new data (new images)

Uses the information the model learned during the model training process

In [0]:
# Returns a NumPy Array
# Predict for One Observation (image)
logisticRegr.predict(test_img[0].reshape(1,-1))

array([1.])

In [0]:
# Predict for Multiple Observations (images) at Once
logisticRegr.predict(test_img[0:10])

array([1., 9., 2., 2., 7., 1., 8., 3., 3., 7.])

### Measuring Model Performance
accuracy (fraction of correct predictions): correct predictions / total number of data points

Basically, how the model performs on new data (test set)

In [0]:
score = logisticRegr.score(test_img, test_lbl)
print(score)

0.92


### Number of Components, Variance, Time Table

In [0]:
pd.DataFrame(data = [[1.00, 784, 48.94, .9158],
                     [.99, 541, 34.69, .9169],
                     [.95, 330, 13.89, .92],
                     [.90, 236, 10.56, .9168],
                     [.85, 184, 8.85, .9156]], 
             columns = ['Variance Retained',
                      'Number of Components', 
                      'Time (seconds)',
                      'Accuracy'])

Unnamed: 0,Variance Retained,Number of Components,Time (seconds),Accuracy
0,1.0,784,48.94,0.9158
1,0.99,541,34.69,0.9169
2,0.95,330,13.89,0.92
3,0.9,236,10.56,0.9168
4,0.85,184,8.85,0.9156
