# Dimensionality Reduction With PCA

Pricipal component analysis (PCA) is a technique from linear algebra that is used to reduce the number of dimensions or attributes of datasets without losing too much of their explanatory power.

Why reduce dimensionality? Because it is (a)computationally less expensive to process datasets with fewer dimensions and (b) easier to visualize datasets with fewer dimensions. 

Ultimately, the goal of PCA is to alleviate the so-called curse of dimensionality.

The scikit-learn package provides a PCA class in the *decomposition* module, which will be used on the [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set). 

In [1]:
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

iris = load_iris()
X = iris.data
y = iris.target

# shape of X and y
print(X.shape, y.shape)

(150, 4) (150,)


In [2]:
# instantiate a PCA object with default values
pca = PCA()

# transform iris data
pca_iris = pca.fit_transform(X)

# check the variance explained by each dimension
pca.explained_variance_ratio_

array([ 0.92461621,  0.05301557,  0.01718514,  0.00518309])

As the above output shows, 92.5% variance of the iris dataset can be explained by the first component or column.

The PCA transformation is repeated using 2 dimensions only.

In [3]:
pca = PCA(n_components=2)
pca_iris = pca.fit_transform(X)

# check the shape of the transformed dataset
pca_iris.shape

(150, 2)

It is clear from the shape of the PCA-transformed dataset that the 150 observations only have 2 columns now.

We can easily check the total variance explained by the two columns.

In [4]:
print("{:.3f}".format(pca.explained_variance_ratio_.sum()))

0.978


Perform knn classification using the pca-transfomed 2-component dataset

In [5]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

# values for tuning of knn paraemeters n_neighbors and weights 
n_neighbor_values = list(range(1,31))
weight_values = ['distance', 'uniform']
grid_values = dict(n_neighbors=n_neighbor_values, weights=weight_values)
knn = KNeighborsClassifier()

# grid-seach parameters with 10-fold cross-validations
gridKnn = GridSearchCV(estimator=knn, param_grid=grid_values, scoring='accuracy', cv=10)
gridKnn.fit(pca_iris,y)

GridSearchCV(cv=10, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 'weights': ['distance', 'uniform']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=0)

Print the best parameters:

In [6]:
print("The best parameters are {} with a mean score of {:.2f}".format(gridKnn.best_params_, gridKnn.best_score_))

The best parameters are {'n_neighbors': 8, 'weights': 'distance'} with a mean score of 0.97


In the previous notebook, a grid search on the full iris dataset using the knn estimator and the same grid search values returned '13' and 'uniform' as the optimal values for *n_neighbors* and *weights* hyper-parameters, respectively.  

More importantly, the best score achieved by training with all the components was 0.98, or just 1% better!

The knn estimator tuned by performing a cross-validatede grid-search of the hyper-parameter space of *n_neighbors* and *weights* can now be used to make predictions:

In [7]:
gridKnn.predict([[1.5,2.7]])

array([1])

What if the requirement was that the PCA must explain at least a certain percent of variance rather than have *n_components*? In this case, *n_components* must be set to the required percent  rather than the required number of components when instantiating the PCA object:

In [8]:
# explain at least 90% of variance
pca = PCA(n_components=0.90)

pca_iris_var = pca.fit_transform(X)
pca_iris_var.shape

(150, 1)

In [9]:
# total variance explained
print("{:.3f}".format(pca.explained_variance_ratio_.sum()))

0.925
