# Principal Component Analysis

This notebook discusses Principal Component Analysis in Detail using Horsecohlic DataSet. 
Data being used here  is 'Horse Colic Dataset' which predicts whether a horse can survive or not based on past medical conditions.
Data is available via following links.
1.  [Kaggle](http://www.kaggle.com/uciml/horse-colic)
2. [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Horse+Colic)

Importing required libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for plots
#import seaborn as sns #fo
%matplotlib inline
import os
import warnings
warnings.filterwarnings('ignore')

Reading data from **CSV file** and saving as **Pandas' Dataframe**

In [None]:
print(os.listdir("../input"))
data = pd.read_csv('../input/horse.csv')
data.head()

Data is then cleaned and converted to numeric form for processing and returned as **data_merge**.

Tutorial link for data cleaning - [Data Cleaning](http://www.kaggle.com/sabasiddiqi/data-examination-and-cleaning).

In [None]:
#print("Shape of data (samples, features): ",data.shape)
data.dtypes.value_counts()

nan_per=data.isna().sum()/len(data)*100
plt.bar(range(len(nan_per)),nan_per)
plt.xlabel('Features')
plt.ylabel('% of NAN values')
plt.plot([0, 25], [40,40], 'r--', lw=1)
plt.xticks(list(range(len(data.columns))),list(data.columns.values),rotation='vertical')

obj_columns=[]
nonobj_columns=[]
for col in data.columns.values:
    if data[col].dtype=='object':
        obj_columns.append(col)
    else:
        nonobj_columns.append(col)
#print(len(obj_columns)," Object Columns are \n",obj_columns,'\n')
#print(len(nonobj_columns),"Non-object columns are \n",nonobj_columns)

data_obj=data[obj_columns]
data_nonobj=data[nonobj_columns]

#print("Data Size Before Numerical NAN Column(>40%) Removal :",data_nonobj.shape)
for col in data_nonobj.columns.values:
    if (pd.isna(data_nonobj[col]).sum())>0:
        if pd.isna(data_nonobj[col]).sum() > (40/100*len(data_nonobj)):
            #print(col,"removed")
            data_nonobj=data_nonobj.drop([col], axis=1)
        else:
            data_nonobj[col]=data_nonobj[col].fillna(data_nonobj[col].median())
#print("Data Size After Numerical NAN Column(>40%) Removal :",data_nonobj.shape)

for col in data_obj.columns.values:
    data_obj[col]=data_obj[col].astype('category').cat.codes
data_merge=pd.concat([data_nonobj,data_obj],axis=1)

target=data['outcome']
temp=target
#print(target.value_counts())
target=data_merge['outcome']
#print(target.value_counts())

train_corr=data_merge.corr()
#sns.heatmap(train_corr, vmax=0.8)
corr_values=train_corr['outcome'].sort_values(ascending=False)
corr_values=abs(corr_values).sort_values(ascending=False)
#print("Correlation of mentioned features wrt outcome in ascending order")
#print(abs(corr_values).sort_values(ascending=False))

#print("Data Size Before Correlated Column Removal :",data_merge.shape)

for col in range(len(corr_values)):
        if abs(corr_values[col]) < 0.1:
            data_merge=data_merge.drop([corr_values.index[col]], axis=1)
            #print(corr_values.index[col],"removed")
#print("Data Size After Correlated Column Removal :",data_merge.shape)

In [None]:
data_merge.head()

## Principal Component Analysis 

### Dimensionality Reduction

When dealing with high-dimensional data it is difficult to visualize or interpret the data; such data sometimes has data redundancy issue due to multicorrelated features. Also, as the number of dimensions increases, computation time and power increases as well. These issues can be solved using ***Dimensionality Reduction***. 

It is the process of reducing the number of random variables under considertion (also known as features) without losing information. It does so by obtaining a set of principle variables. Thus,
1. Reducing storage space and compress data 
2. Reduce Computation Time and Power
3. Deals with multicolinearity, i.e. deals with redundant data
4. Helps in data visualization and interpretation (e.g. Projection into two dimensions)

### Principal Component Analysis

One way to perform Dimensionality Reduction is through ***Principal Component Analysis (PCA)*** . It is a linear transformation technique used to identify strong patterns in data by finding out variable correlation. It maps the data to a lower dimensional subspace in a way that data variance is maximized while retaining most of the information.

Steps to perform PCA:
* Step 1. [Data Standardization](#step1) 
* Step 2: [Covariance Matrix](#step2)
* Step 3: [Eigen Decomposition of Covariance Matrix](#step3)
* Step 4: [Projection Onto New Feature Space](#step4)


## <a id='step1'>Step 1 - Data Standardization</a>

As PCA deals with variance maximization of two variables, it is important to have both variables on same scale. 


In [None]:
data.describe()

In [None]:
from sklearn.preprocessing import StandardScaler
Xstd = StandardScaler().fit_transform(data_merge)

## <a id='step2'>Step 2 - Covariance Matrix</a>

The standardized data  is then used to find covariance matrix. We need covariance matrix to find eigen values and vectors in the next step. 

A Covariance Matrix is used to analyze linear relationship between two variable, i.e. how the variables change together. 
Lets say we have two random variables x and y, 
* if x increases as y increases -> positive linear relationship ***(+ve covariance value)***
* if x decreases as y decreases -> positive linear relationship ***(+ve covariance value)***
* if x increases as y decreases -> negative linear relationship  ***(-ve covariance value)***
* if x decreases as y increases -> negative linear relationship  ***(-ve covariance value)***

![Sketch.png](attachment:Sketch.png)

In [None]:
print('Covariance matrix: \n', np.cov(Xstd.T))

## <a id='step3'>Step 3 - Eigen Decomposition of Covariance Matrix</a>

When a random vector is multiplies by a covarinace matrix, it moves towards the direction of greatest variation. In this way we can extract dimension with greatest 
spread of data. 

One way to do it, is to find the exact vector that doesn't converge (i.e. its direction doesnt change), means it is already giving the maximum data variance.
It is done by computing eigen values and vectors where eigen values represent the scale of vector and eigen vector represent the direction, given by

***Covariance Matrix x Vector = Scalar Component x Vector***

where, Vector is the eigen vector 
            and Scalar component is the eigen value
![Sketch_1.png](attachment:Sketch_1.png)
            

In [None]:
cov_mat = np.cov(Xstd.T)
eigen_values, eigen_vectors = np.linalg.eig(cov_mat)
print('Eigen vectors \n',eigen_vectors)
print('\nEigen values \n',eigen_values)

Now, 
Making list of eigen values and vectors and sorting the list w.r.t eigen values (Descending order) where highest eigen values represent highest variation. 

In [None]:
pairs = [(np.abs(eigen_values[i]), eigen_vectors[:,i]) for i in range(len(eigen_values))]
pairs.sort()
pairs.reverse()

print('Eigen Values in descending order:')
for i in pairs:
    print(i[0])

To check how much aech Principle component represent variation, let us find the cumulative sum of eigen values and plot them.

In [None]:
tot = sum(eigen_values)
var_per = [(i / tot)*100 for i in sorted(eigen_values, reverse=True)]
cum_var_per = np.cumsum(var_per)

plt.figure(figsize=(10,10))
x=['PC %s' %i for i in range(1,len(var_per))]
ind = np.arange(len(var_per)) 
plt.bar(ind,var_per)
plt.xticks(ind,x);
plt.plot(ind,cum_var_per,marker="o",color='orange')
plt.xticks(ind,x);


Plot shows that first component bears almost 20% of information, and first 9 components carry 80% information.

## <a id='step4'>Step 4 - Projection onto New Feature Space</a>

Reshaping eigen pairs to form a projection matrix, which is then multiplied by samples to transform data to new feature space

In [None]:
N=16
value=10
a = np.ndarray(shape = (N, 0))
for x in range(1,value):
    b=pairs[x][1].reshape(16,1)
    a = np.hstack((a,b))
print("Projection Matrix:\n",a)

In [None]:
Y = Xstd.dot(a)

Now comparing data visualization before and after PCA, (1st Principle Component vs 2nd)

In [None]:
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
for name in ('died', 'euthanized', 'lived'):
    plt.scatter(
        x=Y[temp==name,3],
        y=Y[temp==name,4],
    )
plt.legend( ('died', 'euthanized', 'lived'))
plt.title('After PCA')

plt.subplot(1,2,2)
for name in ('died', 'euthanized', 'lived'):
    plt.scatter(
        x=Xstd[temp==name,3],
        y=Xstd[temp==name,4],
    )
plt.title('Before PCA')
plt.legend( ('died', 'euthanized', 'lived'))


**Reerences:**

[Principle Component Analysis in Python](http://plot.ly/ipython-notebooks/principal-component-analysis/)

In [None]:
#add PCA ckitlearn shortcut