### Principle Component Analysis
This notebook uses [sklearn](https://scikit-learn.org/stable/) to perform PCA on the data set altered with [this source code](https://github.com/stibbs1998/admissions_internship/blob/master/src/data/001-st-clean_data.py). 

Import necessary libraries.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

import matplotlib.pyplot as plt
import matplotlib.cm as cm
plt.style.use('fivethirtyeight') 

import warnings
warnings.filterwarnings('ignore')

import sys
sys.path.insert(0, '../src/visualization/')
import visualize as vis

Load the .csv file as a DataFrame. We then assign all of the non-object columns to `X`, with the exception of `df['Enrolled']` - the target variable - which is assigned to `y`. 

In [None]:
df = pd.read_csv('../data/interim/Third_order_clean_confidential.csv').drop(columns='Unnamed: 0')
X = df.select_dtypes(exclude='object').drop(columns='Enrolled').fillna(-999)
y = df.Enrolled

Before performing PCA on the data set, we have to standardize the `X` DataFrame.  From there we are able to fit the standardized DataFrame and assign it to `x_new`.

In [None]:
# scaling the data before PCA
from sklearn.preprocessing import scale
scaled = pd.DataFrame(scale(X),columns=X.columns.values)

# implementing PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=6).fit(scaled)
pca_samples = pca.transform(scaled)

Create a [scree plot](https://en.wikipedia.org/wiki/Scree_plot) of the eigenvalues.

In [None]:
vis.plot_explained_variance_ratio(pca)

Explain the variance in the DataSet.  Since there are so many columns, it makes sense to look at this in smaller groups.

In [None]:
# dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]

# # PCA components
# components = pd.DataFrame(np.round(pca.components_, 4), columns = scaled.keys()) 
# components.index = dimensions

# # PCA explained variance
# ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1) 
# variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance']) 
# variance_ratios.index = dimensions

# for i in range(int(len(scaled.columns)/5)):

#     # Create a bar plot visualization
#     fig, ax = plt.subplots(figsize = (12,6))

#     # Plot the feature weights as a function of the components
#     components[components.columns[5*i:5*(i+1)]].plot(ax = ax, kind = 'bar')
#     ax.set_ylabel("Feature Weights") 
#     ax.set_xticklabels(dimensions, rotation=0)
#     plt.legend(bbox_to_anchor=(1.25,0.5))

#     # Display the explained variance ratios# 
#     for i, ev in enumerate(pca.explained_variance_ratio_): 
#         ax.text(i-0.40, ax.get_ylim()[1] + 0.05, "Explained Var\n %.4f"%(ev))

In [None]:
# pca_results = vis.pca_results(scaled, pca)

In [None]:
abs(pca_results).sort_values('Dimension 1',axis=1,ascending=False)

Create a [biplot](https://en.wikipedia.org/wiki/Biplot) to try and identify which variables have relationships with the main principle components.

In [None]:
vis.biplot(X,scaled,pca);
# plt.ylim(-5,5);
# plt.xlim(-3,3);