# Principal Components Analysis

Dimension Reduction: [Link](https://docs.google.com/presentation/d/17iiHw0ShtvWybPVkVUkB9BXpTOCwoBQaXc_I8d_EJEI/edit?usp=sharing)

Goals:
- Review components of PCA and its role in modeling
- Go through a couple data demos
- Pick the relevant components to keep
- Introduce t-sne

## Quick Maths

- The first principal component of a set of features ${X_1, X_2, . . . , X_p }$ is the normalized linear combination of the features ${Z_1 = φ_{11}X_1 + φ_{21}X_2 + . . . + φ_{p1}X_p}$ that has the largest variance.



- By normalized, we mean
${\sum_{j=1}^{p}}$ ${φ^2_{j1}=1}$

	
- We refer to the elements ${φ_{11}, . . . , φ_{p1}}$ as the loadings of the
first principal component; together, the loadings make up
the principal component loading vector,
${φ_1 = (φ_{11} φ_{21} . . . φ_{p1})^T}$


In [None]:
%matplotlib inline
import numpy as np
#import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd

Analysis Borrowed from  "Introduction to Statistical Learning with Applications in R" [link](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf)

In [None]:
df = pd.read_csv('USArrests.csv', index_col=0)
df.head()

In [None]:
df.info()

In [None]:
df.mean()

In [None]:
df.var()

In [None]:
# Calculate and plot
corr_matrix = df.corr()
sns.heatmap(corr_matrix);

## Given the the values of the mean and variance what does this suggest we do?

In [None]:
from sklearn.preprocessing import scale
X = pd.DataFrame(scale(df), index=df.index, columns=df.columns)

In [None]:
# Fit PCA  Model
from sklearn.decomposition import PCA
pca = PCA()
X2D = pca.fit_transform(X)


In [None]:
# Obtain the loadings.
# What do these mean?
pca_loadings = pd.DataFrame(pca.components_.T, index=df.columns, columns=['Z1', 'Z2','Z3','Z4'])
pca_loadings

In [None]:
df_plot = pd.DataFrame(pca.fit_transform(X), columns=['PC1', 'PC2', 'PC3', 'PC4'], index=X.index)
df_plot.head()

In [None]:
# Variance Explained
pca.explained_variance_ratio_

In [None]:
plt.plot([1,2,3,4],np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

## We really only need two components!

#### Biplot

In [None]:
fig , ax1 = plt.subplots(figsize=(15,12))

ax1.set_xlim(-3.5,3.5)
ax1.set_ylim(-3.5,3.5)

# Plot Principal Components 1 and 2
for i in df_plot.index:
    ax1.annotate(i, (df_plot.PC1.loc[i], -df_plot.PC2.loc[i]), ha='center')

# Plot reference lines
ax1.hlines(0,-3.5,3.5, linestyles='dotted', colors='b')
ax1.vlines(0,-3.5,3.5, linestyles='dotted', colors='b')

ax1.set_xlabel('First Principal Component')
ax1.set_ylabel('Second Principal Component')
    
# Plot Principal Component loading vectors, using a second y-axis.
ax2 = ax1.twinx().twiny() 

ax2.set_ylim(-1,1)
ax2.set_xlim(-1,1)
ax2.set_xlabel('Principal Component loading vectors', color='red')

# Plot labels for vectors. Variable 'a' is a small offset parameter to separate arrow tip and text.
a = 1.07  
for i in pca_loadings[['Z1', 'Z2']].index:
    ax2.annotate(i, (pca_loadings.Z1.loc[i]*a, pca_loadings.Z2.loc[i]*a), color='red')

# Plot vectors
ax2.arrow(0,0,pca_loadings.Z1[0], pca_loadings.Z2[0])
ax2.arrow(0,0,pca_loadings.Z1[1], pca_loadings.Z2[1])
ax2.arrow(0,0,pca_loadings.Z1[2], pca_loadings.Z2[2])
ax2.arrow(0,0,pca_loadings.Z1[3], pca_loadings.Z2[3])
plt.show()

## Back To Cancer Data

Breast Cancer [link](https://www.mldata.io/dataset-details/breast_cancer/#customize_download)

In [None]:
bc=pd.read_csv('breast_cancer_scikit_onehot_dataset.csv')
bc.shape

In [None]:
target=bc['class']
target = bc['class'].map(lambda x: 1 if x == 4 else 0)
target = pd.DataFrame(target)
target.columns=['outcome']


In [None]:
X=bc.drop(columns=['class'])

In [None]:
predictor = pd.DataFrame(scale(X), index=X.index, columns=X.columns)

In [None]:
from sklearn.decomposition import PCA 
pca = PCA()
X2D = pca.fit_transform(predictor)

In [None]:
pca.explained_variance_ratio_

In [None]:
pca = PCA().fit(predictor)
plt.plot([1,2,3,4,5,6,7,8,9],np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

In [None]:
#pca.components_.T[:, 0:2]    

In [None]:
pca_loadings = pd.DataFrame(PCA(n_components = 2).fit(predictor).components_.T, index=predictor.columns, 
columns=['Z1','Z2'])
pca_loadings

In [None]:
plt.scatter(X2D[:, 0], X2D[:, 1], edgecolor='b',c=target.outcome.map
            ({0: 'blue', 1: 'orange'}), alpha=0.5)
plt.xlabel('Prinicpal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

In [None]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(predictor, target, test_size=0.3,random_state=9) 

In [None]:
from sklearn.tree import DecisionTreeClassifier
# Initial paramters used in model
clf_tree = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=2, class_weight='balanced')

In [None]:
clf_tree.fit(X_train, y_train)

In [None]:
y_pred = clf_tree.predict(X_test)

In [None]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X2D, target,test_size=.3, random_state=9)

In [None]:
clf_tree.fit(X_train, y_train)

In [None]:
y_pred = clf_tree.predict(X_test)

In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

### t-distributed Stohastic Neighbor Embedding( t-SNE)

- Find a projection for a high-dimensional feature space onto a plane (or a 3D hyperplane) such that those points that were far apart in the initial n-dimensional space will end up far apart on the plane. Those that were originally close would remain close to each other.


- Essentially a search for a new and less-dimensional data representation that preserves neighborship of examples.

t-SNE Tutorial: [link](https://distill.pub/2016/misread-tsne/)

In [None]:
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(predictor)

In [None]:
tsne = TSNE(random_state=17)
tsne_repr = tsne.fit_transform(X_scaled)

In [None]:
#plt.scatter(tsne_repr[:, 0], tsne_repr[:, 1], alpha=.5);

In [None]:
plt.scatter(tsne_repr[:, 0], tsne_repr[:, 1],c=target.outcome.map
            ({0: 'blue', 1: 'orange'}), alpha=.5);