# PCA - Principal Component Analysis
---
In this kernel we're going to reduce the number of features in our dataset through the PCA, this will help improving the performance of our models in computational time and it may help improving the overall model precission if we're dealing with overfitting. We'll then use both datasets (the original one and the after the PCA) to compare our models.

# Imports

In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import seaborn as sns

# load the train data file
train = pd.read_csv("../input/feature-exploration-and-dataset-preparation/train_clean_standarized.csv", index_col=0)
train_resampled = pd.read_csv("../input/resampling/train_resampled.csv", index_col=0)

In [None]:
# let's separate our features from the target
X = train.drop('TARGET', axis=1)
y = train['TARGET'].values

# 1. Feature correlation

Let's have a look at the feature correlation: It's difficult to conclude anything by obvserving the heatmap below due to the high number of features. However at a first glance we can see that in general the correlation isn't so high and we'll probably need a large number of Principal components to explain the variance.

In [None]:
sns.heatmap(X.corr());

In [None]:
X.corr()

Nevertheless, looking at the correlation matrix we can see that some of the features (probably the ones that reference different time periods - i.e. imp_op_var39_comer_ult1 and imp_op_var39_comer_ult3 present a high correlation.

# 2. PCA
We're going to calculate the explained variance for the different number of components:

In [None]:
pca_test = PCA().fit(X)

In [None]:
# plot the Cumulative Summation of the Explained Variance for the different number of components
plt.figure()
plt.plot(np.cumsum(pca_test.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Dataset Explained Variance')
plt.show()

We can see that with a number between 50 and 100 components, we'll still get a good % of variance explained. Let's try with 80 components:

In [None]:
# instantiate PCA
pca = PCA(n_components=80)

# fit PCA
principalComponents = pca.fit_transform(X)

Let's have a look at the dataset containing our 80 principal components:

In [None]:
train_pc = pd.DataFrame(data = principalComponents)
train_target = pd.Series(y, name='TARGET')

train_pc_df = pd.concat([train_pc, train_target], axis=1)
train_pc_df.head(5)

In [None]:
sns.heatmap(train_pc.corr());

In [None]:
# we calculate the variance explained by priciple component
print('Variance of each component:', pca.explained_variance_ratio_)
print('\n Total Variance Explained:', round(sum(list(pca.explained_variance_ratio_))*100, 2))

As we can see from the results above, this 90 principal components explain 91% of the variance in the data.

We're going to use these components to compare the result of our models against the ones built using the full dataset.

# 3. PCA on resampled data

We're going to use the output from the resampling exercise to run a Principal Component Analysis on these data:

In [None]:
# let's separate our features from the target
X_resampled = train_resampled.drop('TARGET', axis=1)
y_resampled = train_resampled['TARGET'].values

In [None]:
sns.heatmap(X_resampled.corr());

In [None]:
X_resampled.corr()

In [None]:
pca_resampled_test = PCA().fit(X_resampled)

# plot the Cumulative Summation of the Explained Variance for the different number of components
plt.figure()
plt.plot(np.cumsum(pca_resampled_test.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Dataset Explained Variance')
plt.show()

In [None]:
# instantiate PCA
pca_resampled = PCA(n_components=80)

# fit PCA
principalComponents_resampled = pca_resampled.fit_transform(X_resampled)

In [None]:
train_resampled_pc = pd.DataFrame(data = principalComponents_resampled)
train_resampled_target = pd.Series(y_resampled, name='TARGET')

train_resampled_pc_df = pd.concat([train_resampled_pc, train_resampled_target], axis=1)
train_resampled_pc_df.head(5)

In [None]:
sns.heatmap(train_resampled_pc_df.corr());

In [None]:
# we calculate the variance explained by priciple component
print('Variance of each component:', pca_resampled.explained_variance_ratio_)
print('\n Total Variance Explained:', round(sum(list(pca_resampled.explained_variance_ratio_))*100, 2))

# Output

In [None]:
train_pc_df.to_csv('train_PCA.csv')
train_resampled_pc_df.to_csv('train_resampled_PCA.csv')