# PCA

PCA or Pricipal component Analysis is a technique used for dimensionality reduction , It projects each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible. It does so by creating new uncorrelated variables that successively maximize variance. 

## what is dimenionality reduction ?

Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data

## why do we need to dimensionality reduction ?

- High Dimensional data with too many features is hard and longer to process
- Most of the time many feature are co-related eg - humidity and rainfall , therefore processing them independently is redundant
- Many Machine Learning simply breaks down when working with high dimentional data. This Phenomenon is commonly refered to as **Curse of dimensionality**

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler
from scipy.linalg import eigh
from sklearn import decomposition


import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler
from scipy.linalg import eigh
from sklearn import decomposition

In [None]:
train_df = pd.read_csv("../input/digit-recognizer/train.csv")

In [None]:
train_df.head()

In [None]:
label = train_df.label
train = train_df.drop('label', axis=1)

In [None]:
print(label.shape)
print(train.shape)

In [None]:
label[0]

In [None]:
train.head()

In [None]:
plt.figure(figsize=(2,2))
# reshape d from 1d to 2d pixel array for given idx ( prefer 28 X 28)
grid_data = train.loc[0].values.reshape(28,28)
#plot above grid image with cmap as gray and interpoltion as none
plt.imshow(grid_data,interpolation='none',cmap='gray')

#display plot
plt.show()

# Data Preprocessing 

In [None]:
standardized_data = StandardScaler().fit_transform(train)
standardized_data.shape

# PCA Implementation 

to implement pca we need two things
1. Co-variance Matrix
2. Eigen Vectors and Eigen Values
3. Projection onto 2D Plane

### 1. Co-variance Matrix

Covariance Matrix basically describe the variance of the data

Equation : 

$\mathbf{S} = \mathbf{A}^\intercal  \mathbf{A}$


In [None]:
covar_matrix = np.matmul(standardized_data.T,standardized_data)
covar_matrix

### 2. Eigen Vectors and Eigen Values 
   
   The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude

Basic Equation is :               

$\mathbf{S}\mu = \lambda \mu $

where $\lambda$ is eigen value, $\mathbf{S}$ is Co-variance matrix and $\mu $ is eigen vectors

In [None]:
# since we need to project (42000 X 784) to (42000 X 2). Therefore we need to select top 2 eigen values
values, vectors = eigh(covar_matrix,eigvals=(782,783))

print("Shape of eigen vectors = ",vectors.shape)
print(vectors)

# converting the eigen vectors into (2d) shape 
vectors = vectors.T
print("Updated shape of eigen vectors = ",vectors.shape)
print(vectors)
# here the vectors[1] represent the eigen vector corresponding 1st principal eigen vector
# here the vectors[0] represent the eigen vector corresponding 2nd principal eigen vector

### Projecting onto 2D Plane

In [None]:
new_coord =  np.matmul(vectors, standardized_data.T)
print(new_coord)
print(new_coord.shape)

In [None]:
pca_data = pd.DataFrame({"1st_principal" : new_coord[1]
                         ,"2nd_principal" : new_coord[0], "label" : label})

In [None]:
pca_data

In [None]:
sns.FacetGrid(pca_data, hue='label', height=8).map(plt.scatter, "1st_principal", "2nd_principal", 'label').add_legend()
plt.show()

# PCA Implementation using Scikit-Learn

In [None]:
pca = decomposition.PCA()
pca.n_components = 2
pca_data_sci = pca.fit_transform(standardized_data)
pca_data_sci.shape

In [None]:
pca_data_sci_new = pd.DataFrame({"1st_principal" : pca_data_sci.T[0]
                         , "2nd_principal" : pca_data_sci.T[1], "label" : label})

In [None]:
pca_data_sci_new

In [None]:
sns.FacetGrid(pca_data_sci_new, hue='label',height=8).map(plt.scatter, "1st_principal", "2nd_principal", 'label').add_legend()
plt.show()

# Variance Explained by PCA

In [None]:
pca.n_components = 784

pca_data = pca.fit_transform(standardized_data)

percentage_var_explained = pca.explained_variance_ / np.sum(pca.explained_variance_)

#cumulative sum of the percentage_var_explained
cumulative_explained_variance = np.cumsum(percentage_var_explained)

In [None]:
plt.figure(figsize=(6,4))
plt.plot(cumulative_explained_variance,linewidth=3)
plt.grid()

plt.xlabel('n_components')
plt.ylabel('Cumulative_explained_variance')
plt.show()

the above grpah shows that by choosing 200 n_components or principals we can get a variance of around 90% 
thus instead of working with all 784 dimensions we can work with around 200 dimensions without any major loss in information