# Exploration of dimension reduction

In [1]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier as rf

from sklearn.datasets import load_breast_cancer
from sklearn.datasets import load_digits
from sklearn.datasets import fetch_olivetti_faces

from sklearn.model_selection import train_test_split
import numpy as np
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn import preprocessing
from sklearn.decomposition import PCA

# Digits

We use PCA to reduce the dimension of this data, since they are images and all units are the same (pixel values).

### Before PCA

In [2]:
# Get data
dX, dy = load_digits(return_X_y=True)
dX_train, dX_test, dy_train, dy_test = train_test_split(dX, dy, test_size=.3, random_state=61)

In [3]:
%%timeit
preds = svm.SVC(C=10**1, kernel='poly', gamma=10**-2, coef0=1).fit(dX_train, dy_train).predict(dX_test)

10 loops, best of 3: 42.1 ms per loop


In [4]:
preds = svm.SVC(C=10**1, kernel='poly', gamma=10**-2, coef0=1).fit(dX_train, dy_train).predict(dX_test)
print "Pre-pca Accuracy: ", 1.* sum(preds == dy_test) / len(preds)

Pre-pca Accuracy:  0.983333333333


### After pca, keeping .25 of the features

In [5]:
# Reduce features
red_dX = PCA(int(.25 * dX.shape[1])).fit_transform(dX)
dX_train, dX_test, dy_train, dy_test = train_test_split(red_dX, dy, test_size=.3, random_state=61)

In [6]:
%%timeit
preds = svm.SVC(C=10**1, kernel='poly', gamma=10**-2, coef0=1).fit(dX_train, dy_train).predict(dX_test)

10 loops, best of 3: 28.2 ms per loop


In [7]:
preds = svm.SVC(C=10**1, kernel='poly', gamma=10**-2, coef0=1).fit(dX_train, dy_train).predict(dX_test)
print "Post-pca Accuracy: ", 1.* sum(preds == dy_test) / len(preds)

Post-pca Accuracy:  0.987037037037


Reducing the dimension slightly increased accuracy and also took shorter to train.  Shorter training time is very expected, since there are fewer features to train on.  The increased accuracy could be a result of getting rid of some noise or other superfluous information.


# Faces

Again we use PCA to reduce the dimension of this data since they are also images and all units are the same.

### Before PCA

In [8]:
# Get data
item = fetch_olivetti_faces()
fX = item.data
fy = item.target
fX_train, fX_test, fy_train, fy_test = train_test_split(fX, fy, test_size = .3, random_state=100)

In [9]:
%%timeit
preds = svm.SVC(C=10**1, kernel='poly', gamma=10**-2, coef0=1).fit(fX_train, fy_train).predict(fX_test)

1 loop, best of 3: 599 ms per loop


In [10]:
preds = svm.SVC(C=10**1, kernel='poly', gamma=10**-2, coef0=1).fit(fX_train, fy_train).predict(fX_test)
print "Pre-pca Accuracy: ", 1.* sum(preds == fy_test) / len(preds)

Pre-pca Accuracy:  0.933333333333


### After PCA, keeping .25 of the features

In [11]:
# Reduce features
red_fX = PCA(int(.25 * fX.shape[1])).fit_transform(fX)
fX_train, fX_test, fy_train, fy_test = train_test_split(red_fX, fy, test_size=.3, random_state=100)

In [12]:
%%timeit
preds = svm.SVC(C=10**1, kernel='poly', gamma=10**-2, coef0=1).fit(fX_train, fy_train).predict(fX_test)

10 loops, best of 3: 75.1 ms per loop


In [13]:
preds = svm.SVC(C=10**1, kernel='poly', gamma=10**-2, coef0=1).fit(fX_train, fy_train).predict(fX_test)
print "Post-pca Accuracy: ", 1.* sum(preds == fy_test) / len(preds)

Post-pca Accuracy:  0.933333333333


Reducing dimension greatly reduced training time, while suffering no setback in accuracy.


# Cancer

We use the tree dimension reduction method.  Since the features have different units, PCA is not a viable option.

### Before reduction

In [14]:
# Get data
dX, dy = load_digits(return_X_y=True)
dX_train, dX_test, dy_train, dy_test = train_test_split(dX, dy, test_size=.3, random_state=61)

In [15]:
%%timeit
preds = rf().fit(dX_train, dy_train).predict(dX_test)

The slowest run took 4.08 times longer than the fastest. This could mean that an intermediate result is being cached.
10 loops, best of 3: 38.6 ms per loop


In [16]:
preds = rf().fit(dX_train, dy_train).predict(dX_test)
print "Pre-reduction Accuracy: ", 1.* sum(preds == dy_test) / len(preds)

Pre-reduction Accuracy:  0.940740740741


### After reduction

In [17]:
# Reduce features
model = rf().fit(dX_train, dy_train)
imps = model.feature_importances_  # get importances
best = imps.argsort()[-int(.25*len(imps)):][::-1]  # sort by 
red_dX = dX[:,best]
dX_train, dX_test, dy_train, dy_test = train_test_split(red_dX, dy, test_size=.3, random_state=61)

In [18]:
%%timeit
preds = rf().fit(dX_train, dy_train).predict(dX_test)

10 loops, best of 3: 35.2 ms per loop


In [19]:
preds = rf().fit(dX_train, dy_train).predict(dX_test)
print "Post-reduction Accuracy: ", 1.* sum(preds == dy_test) / len(preds)

Post-reduction Accuracy:  0.927777777778


For me, reducing dimension slightly decreased my accuracy for the cancer dataset, using random forest for classification.  It did speed up fitting though.