# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Intro to Dimensionality Reduction
Week 8 | Lesson 1.1

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Understand the motivations for dimensionality reduction
- Follow the logical workflow behind dimensionality reduction
- Describe the basic intuition of Principal Component Analysis
- Calculate eigenvectors and eigenvalues for use in Principal Component Analysis


### STUDENT PRE-WORK
*Before this lesson, you should already be able to:*
- Have a working understand of scikit learn and numpy
- Be able to create functions from scratch in python
- Have a basic understanding of linear algebra concepts such as matrices

<a name="introduction"></a>
## Introduction: What is Dimensionality Reduction? 

Dimensionality reduction reduces the number of random variables that you are considering for analysis until you are left with the most important variables.

Dimensionality reduction is not an end goal in itself, but a tool to form a dataset with more parsimonious features for further visualization and/or modelling.

> Check: where have we already done dimensionality reduction? What are the potential benefits?

Imagine we have a linear graph, with one variable on the x axis and another on the y axis. Fitting a line models most of the information in the data (but leaves some noise). We can reduce the dimensions until the 45 degree line is completely horizontal - both of our measurements are now on the same plane - they are *one-dimensional*.

![graph1](./assets/images/graph1.jpg)

![graph2](./assets/images/graph2.jpg)

So our goal is to reduce dimensions without losing information.

In other words, to remove redundancies in our data


In [2]:
import numpy as np
import pandas as pd
x1 = [np.random.randn() for i in range(20)]
x2 = [np.random.randn() for i in range(20)]
x3 = [x*3 + np.random.randn()/10 for x in x1]
pd.DataFrame(zip(x1, x2, x3)).corr()

Unnamed: 0,0,1,2
0,1.0,-0.153276,0.999527
1,-0.153276,1.0,-0.163417
2,0.999527,-0.163417,1.0


Where might there be redundant information here?

## A refresher on covariance and correlation

What is covariance?

Covariance is a measure of how two variables covary (i.e. how much the change in one is associated with the change in the other).  It specifically looks at how variables covary linearly.

$$
\text{COV}(X,Y) =  \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{n}
$$


Can take positive and negative values

What is correlation?

Normalized covariance.  

$$
\text{corr}(X,Y) =  \frac{COV(X,Y)}{\sigma_x \sigma_y}
$$

Can take values from -1 to 1.


## Why do we care about covariance?

It incorporates the signal (variance) and also redundancy.

Covariance is calculated between two variables.  What if we have many variables?

We can calculate a covariance matrix (similar to the correlation matrix we've seen)

In [3]:
pd.DataFrame(zip(x1, x2, x3)).cov()

Unnamed: 0,0,1,2
0,1.126418,-0.18025,3.378774
1,-0.18025,1.227733,-0.576718
2,3.378774,-0.576718,10.144468


The values on the diagonal are just variance since $COV(X,X) = VAR(X)$.

What would an 'ideal' covariance matrix look like?


An "ideal" covariance matrix for data would have large numbers (variances) along the diagonal because this would indicate a large amount of signal in the data. It would also have zero values in the off-diagonal elements because these values indicate redundancy across our variables.

What can we do to try to remove any redundancies and preserve the signal?

Enter PCA!

<a name="demo"></a>
## Demo: Applications of Dimensionality Reduction

Our first priority is to get comfortable with the initial manual workflow of PCA. (We'll expand in a following lesson.)

- Isolate the feature data
- Center and scale the feature data
- Calculate their covariance matrix
- Calculate the eigenvalues and eigenvectors
- Choose the best n principal components
- Calculate newly extracted feature data



```python
x = data.ix[selection].values
y = data.ix[selection].values
x_standard = StandardScaler().fit_transform(x)

```

A **covariance matrix** of n-features is just an n x n matrix, where the elements are the [covariances](https://en.wikipedia.org/wiki/Covariance) for each pair of _n_ features.

```
cov_mat = np.cov(x_standard.T)
```

(We're **transposing** the matrix only because np.cov expects features to be on the rows and columns to hold observations.)

Now, we decompose our matrix by calling the numpy linear algebra function ```linalg.eig()```. to calculate the [**eigenvectors** and **eigenvalues**](https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors).

```
eigenValues, eigenVectors = np.linalg.eig(cov_mat)
```

The eigenvectors of a linear transformation are vectors that do not change direction under that transformation, but only have their magnitude scaled by some scalar value (the eigenvalue).

In this context, the eigenvectors are the new dimensions of our data.  These are the principal components.

The larger an eigenvalue, the more variance (information) in our data its corresponding eigenvector explains.

Once we have our eigenvalues, we can work on transforming our data onto another dimensional space. Remember the visual representation from above - this is exactly what we are doing in this step. 


<a name="guided-practice"></a>
## Guided Practice: Conducting Dimensionality Analysis

Now that you know the procedure, let's run through an implementation of dimensionality reduction with a real dataset.

We're going to be revisiting the [wine](./assets/datasets/wine_v.csv) dataset that lists the attributes of various different wine varieties.

In [4]:
import pandas as pd
import numpy as np
import os
from matplotlib import pyplot as plt
import numpy as np
import math
from sklearn.preprocessing import StandardScaler

In [5]:
wine = pd.read_csv('./assets/datasets/wine_v.csv')
wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Varietal
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,Cabernet
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,Cabernet
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,Cabernet
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,Cabernet
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,Cabernet


In [6]:
""" Isolate the feature data."""
x = wine.ix[:,0:11].values
y = wine.ix[:,12].values


Let's look at the covariace matrix for the data.

In [7]:
pd.DataFrame(np.cov(x.T))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,3.031416,-0.079851,0.22782,0.281756,0.007679,-2.800921,-6.482346,0.002195,-0.183586,0.05401,-0.114421
1,-0.079851,0.032062,-0.019272,0.000484,0.000517,-0.019674,0.450426,7e-06,0.006495,-0.007921,-0.0386
2,0.22782,-0.019272,0.037947,0.039434,0.001869,-0.124252,0.227697,0.000134,-0.016298,0.010328,0.022815
3,0.281756,0.000484,0.039434,1.987897,0.00369,2.758611,9.416441,0.000945,-0.018644,0.001321,0.063219
4,0.007679,0.000517,0.001869,0.00369,0.002215,0.002738,0.073387,1.8e-05,-0.001926,0.002962,-0.011092
5,-2.800921,-0.019674,-0.124252,2.758611,0.002738,109.414884,229.737521,-0.000433,0.113653,0.091592,-0.773698
6,-6.482346,0.450426,0.227697,9.416441,0.073387,229.737521,1082.102373,0.004425,-0.337699,0.239471,-7.209298
7,0.002195,7e-06,0.000134,0.000945,1.8e-05,-0.000433,0.004425,4e-06,-0.0001,4.8e-05,-0.000998
8,-0.183586,0.006495,-0.016298,-0.018644,-0.001926,0.113653,-0.337699,-0.0001,0.023835,-0.005146,0.033832
9,0.05401,-0.007921,0.010328,0.001321,0.002962,0.091592,0.239471,4.8e-05,-0.005146,0.028733,0.016907


In [8]:
""" Center and scale the feature data."""
x_standard = StandardScaler().fit_transform(x)

""" Calculate their covariance matrix. """
cov_mat = np.cov(x_standard.T)

pd.DataFrame(cov_mat)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1.000626,-0.256291,0.672124,0.114849,0.093764,-0.15389,-0.113252,0.668465,-0.683406,0.18312,-0.061707
1,-0.256291,1.000626,-0.552841,0.001919,0.061336,-0.01051,0.076518,0.02204,0.235084,-0.26115,-0.202415
2,0.672124,-0.552841,1.000626,0.143667,0.20395,-0.061016,0.035555,0.365176,-0.542243,0.312966,0.109972
3,0.114849,0.001919,0.143667,1.000626,0.055644,0.187166,0.203155,0.355506,-0.085706,0.005531,0.042102
4,0.093764,0.061336,0.20395,0.055644,1.000626,0.005566,0.04743,0.200758,-0.265192,0.371493,-0.221279
5,-0.15389,-0.01051,-0.061016,0.187166,0.005566,1.000626,0.668084,-0.02196,0.070422,0.05169,-0.069452
6,-0.113252,0.076518,0.035555,0.203155,0.04743,0.668084,1.000626,0.071314,-0.066536,0.042974,-0.205783
7,0.668465,0.02204,0.365176,0.355506,0.200758,-0.02196,0.071314,1.000626,-0.341913,0.148599,-0.49649
8,-0.683406,0.235084,-0.542243,-0.085706,-0.265192,0.070422,-0.066536,-0.341913,1.000626,-0.196771,0.205761
9,0.18312,-0.26115,0.312966,0.005531,0.371493,0.05169,0.042974,0.148599,-0.196771,1.000626,0.093653


In [9]:
"""Calculate the eigenvalues and eigenvectors."""
eigenValues, eigenVectors = np.linalg.eig(cov_mat)

In [16]:
""" Choose the best n principal components.  Calculate newly extracted feature data."""

eig_pairs = [(np.abs(eigenValues[i]), eigenVectors[:,i])\
             for i in range(len(eigenValues))]
eig_pairs.sort()
eig_pairs.reverse()
for i in eig_pairs[:2]:
    print(i[0],i[1])

(3.1010718226758938, array([ 0.48931422, -0.23858436,  0.46363166,  0.14610715,  0.21224658,
       -0.03615752,  0.02357485,  0.39535301, -0.43851962,  0.24292133,
       -0.11323207]))
(1.9271148896490469, array([-0.11050274,  0.27493048, -0.15179136,  0.27208024,  0.14805156,
        0.51356681,  0.56948696,  0.23357549,  0.00671079, -0.03755392,
       -0.38618096]))


### Now what?

We can use this to transform our data onto a lower dimension space.

In [17]:
W = np.hstack((eig_pairs[0][1].reshape(11,1), \
               eig_pairs[1][1].reshape(11,1))) # Our transformation matrix
W

array([[ 0.48931422, -0.11050274],
       [-0.23858436,  0.27493048],
       [ 0.46363166, -0.15179136],
       [ 0.14610715,  0.27208024],
       [ 0.21224658,  0.14805156],
       [-0.03615752,  0.51356681],
       [ 0.02357485,  0.56948696],
       [ 0.39535301,  0.23357549],
       [-0.43851962,  0.00671079],
       [ 0.24292133, -0.03755392],
       [-0.11323207, -0.38618096]])

In [12]:
X_reduced = x_standard.dot(W)
X_reduced

array([[-1.61952988,  0.45095009],
       [-0.79916993,  1.85655306],
       [-0.74847909,  0.88203886],
       ..., 
       [-1.45612897,  0.31174559],
       [-2.27051793,  0.97979111],
       [-0.42697475, -0.53669021]])

This matrix represents the original data transformed into the two-dimensional space.

Let's look at the covariance matrix again.  

Remember we now only have two features.

Is it better?

In [13]:
cov_mat = np.cov(X_reduced.T)

pd.DataFrame(cov_mat)

Unnamed: 0,0,1
0,3.101072,-6.280611e-16
1,-6.280611e-16,1.927115


## How else can we see how well we managed to keep the signal, but remove the redundancy?

Fraction of variance explained by our new features!

This is calculated using the eigenvalues.

In [14]:
tot = sum(eigenValues)
var_exp = [(i / tot)*100 for i in sorted(eigenValues, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print cum_var_exp

[  28.17393128   45.68220118   59.77805108   70.80743772   79.52827474
   85.52471351   90.83190641   94.67696732   97.81007747   99.4585608   100.        ]


## Remember, this was just pre-processing.  We can build models in the same way with the new data

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x_standard, y, test_size=0.33, random_state=1)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
print clf.score(X_test, y_test), "mean accuracy, using {0} dimensions.".format(x_standard.shape[1])

X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.33, random_state=1)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
print clf.score(X_test, y_test), "mean accuracy, using {0} principal component dimensions.".format(X_reduced.shape[1])

0.617424242424 mean accuracy, using 11 dimensions.
0.513257575758 mean accuracy, using 2 principal component dimensions.


<a name="ind-practice"></a>
## Independent Practice: Dimensionality Reduction on the Iris dataset
Now that we've gone over the long-form approach to dimensionality reduction and worked through an example, let's put your skills to the test! We're going to be working with the classic [iris dataset](./assets/datasets/iris.csv). We want to decompose the data to the point of finding the eigenvectors and eigenvalues. Grab the [starter code](./code/starter-code/w7d2-dimensionality-reduction-iris-starter-code.ipynb) to begin!



<a name="conclusion"></a>
## Conclusion (5 mins)
- Recap and recall the process steps in dimensionality reduction
    -  Covariance Matrix: First, we create a covariance matrix to decompose so that we may find our eigenvalues / eigenvectors. 
    -  Eigenvectors & Eigenvalues: We decompose the covariance matrix to derive our eigenvectors and eigenvalues, and select the top  combined eigenpairs to become our principal components.
    -  Lastly, we project the eigenpairs onto a new feature subspace.

***



### ADDITIONAL RESOURCES

- [Unsupervised Dimensionality Reduction in sklearn](http://scikit-learn.org/stable/modules/unsupervised_reduction.html)
- [In depth overview of Dimensionality Reduction and PCA from Stanford University](http://ufldl.stanford.edu/wiki/index.php/PCA)