### Principle Component Analysis
This notebook uses [sklearn](https://scikit-learn.org/stable/) to perform PCA on the data set altered with [this source code](https://github.com/stibbs1998/admissions_internship/blob/master/src/data/001-st-clean_data.py). 

Import necessary libraries.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib notebook

import warnings
warnings.filterwarnings('ignore')

Load the .csv file as a DataFrame. We then assign all of the non-object columns to `X`, with the exception of `df['Enrolled']` - the target variable - which is assigned to `y`. 

In [None]:
df = pd.read_csv('../data/interim/Third_order_clean_confidential.csv').drop(columns='Unnamed: 0')
X = df.select_dtypes(exclude='object').drop(columns='Enrolled').fillna(-999)
y = df.Enrolled

Before performing PCA on the data set, we have to standardize the `X` DataFrame.  From there we are able to fit the standardized DataFrame and assign it to `x_new`.

In [None]:
# In general, it's a good idea to scale the data prior to PCA.
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)    

pca = PCA()
x_new = pca.fit_transform(X)

Create a [scree plot](https://en.wikipedia.org/wiki/Scree_plot) of the eigenvalues.

In [None]:
f, axes = plt.subplots(figsize=(8,6))
plt.plot(1-(np.cumsum(pca.explained_variance_ratio_))*100,'o--');
plt.xlabel("Number of Components",size=15);
plt.yticks(np.linspace(-100,0,6),np.linspace(0,100,6))
plt.ylabel("Percent of Data Left to Explain",size=15);
plt.title('Scree Plot of Data',size=15);

Create a [biplot](https://en.wikipedia.org/wiki/Biplot) to try and identify which variables have relationships with the main principle components.

In [None]:
def myplot(score,coeff,labels=None, n = None):
    xs = score[:,0]
    ys = score[:,1]
    
    if n == None:
        n = np.arange(coeff.shape[0])
        
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    
    plt.subplots(figsize=(10,6))
    plt.scatter(xs * scalex,ys * scaley, c = y,alpha=0.3,cmap=cm.bone)
    plt.colorbar()

    for i in n:
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
    plt.xlim(-1,1)
    plt.ylim(-1,1)
    plt.xlabel("PC{}".format(1))
    plt.ylabel("PC{}".format(2))

In [None]:
myplot(score = x_new[:,0:],
       coeff = np.transpose(pca.components_[0:, :]),
       labels = df.select_dtypes(exclude=['object']).columns.values[:],
      n = [3,4,5,6])
plt.savefig('../reports/figures/biplot.png')