 ### Distributions & PCA Analysis

The kernel is dedicated to Exploratory Data Analysis for Santander Value Prediction Challenge. Firstly this competition has attracted me with size of dataset and finally a student like me with limited resources can work at full potential. Additionally the kernel is also marked Beginner Friendly and please do revert back at the comments so that I can answer any questions plus link resources as per the requests. So, lets get into exploring...


In [None]:
import warnings
warnings.simplefilter("ignore")
import numpy as np
import pandas as pd
from scipy.special import boxcox
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#PLOTLY
import plotly.offline as offline
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import cufflinks as cf
from plotly.graph_objs import Scatter, Figure, Layout
cf.set_config_file(offline=True)

Interactive plots grab attention easily plus are cool to look and explore. The following notebook uses plotly. Plotly is an interactive framework available for both R and Python. Also, what only few people know is that it works awesome with cufflinks integration. You will shortly see how easy it is to plot these interactive plots.  But, before that lets load our datasets.

One handy way to check the stuff in directory is using notebook's inline method to execute command line scripts. To see whats in parent directory. All you have to do is include a `!` mark in the cell to execute the `ls` command

In [None]:
!ls ../input

The dataset files are included in input folder. Without further ado let's import them

In [None]:
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

Quick Look on shapes ... 

In [None]:
print("Train shape {}".format(train.shape))
print("Test shape {}".format(test.shape))

After going through other kernels I found out that we are having duplicate as well as constant values in columns. So, lets quickly get rid of these. 

In [None]:
#Code Credits - SRK and Scirpus
unique_df = train.nunique().reset_index()
unique_df.columns = ["col_name", "unique_count"]
constant_df = unique_df[unique_df["unique_count"]==1]
constant_cols = list(constant_df['col_name'].values)
train.drop(constant_cols, axis=1, inplace=True)
train=train.T.drop_duplicates().T

## Target Value Distribution

With cufflinks integration. Just call `.iplot` on the dataframe and your interactive plot is ready. As easy as this!

In [None]:
target = train['target'].astype('float64')
target.iplot(kind='hist',
             color='blue',
             title='Target Distribution Plot')

In [None]:
log_target = np.log(train['target'].astype('float64'))
log_target.iplot(kind='hist',
                 color='blue',
                 title='log(target) distribution')

In [None]:
box_cox_target = boxcox(train['target'].astype('float64'), 0.1)
box_cox_target.iplot(kind='hist',
                     color='darkblue',
                     title='boxcox(target) distribution')

Looks totally good! I have not explored in the model making. But, its sure good to see such distribution. Let me know what you think in the comments below.

## Exploring PCA and TSVD

### PCA Components Analysis

In [None]:
y = train.target
train.drop(['ID','target'], axis=1, inplace=True)

In [None]:
from sklearn.decomposition import PCA, FastICA, TruncatedSVD
from tqdm import tqdm
components_vars = pd.DataFrame()
comp_dict = dict()
for COMP in tqdm([1000,500,200,100,80,20]):
    pca = PCA(n_components=COMP)
    pca_ = pca.fit_transform(train)
    var= pca.explained_variance_ratio_
    var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
    comp_dict[str(COMP) + ' Components'] = var1

The handy progress bar is new to you.  
> **Fact of the day** : tqdm means "progress" in Arabic (taqadum, تقدّم) and an abbreviation for "I love you so much" in Spanish (te quiero demasiado). See [this repo](https://github.com/tqdm/tqdm) for more information on usage 

In [None]:
components_vars = pd.DataFrame.from_dict(comp_dict, orient='index')
components_vars = components_vars.transpose()
components_vars.iplot(theme='white', kind='bar', title='PCA Explain Variance for first 1000 components',
                     xTitle='Nth Component', yTitle='Cumilative Variance Ratio')

### Truncated SVD Explain Ratios

In [None]:
components_vars = pd.DataFrame()
comp_dict = dict()
for COMP in tqdm([400,200,100,80,20]):
    svd = TruncatedSVD(n_components=COMP)
    svd_ = svd.fit_transform(train)
    var1=np.cumsum(np.round(svd.explained_variance_ratio_, decimals=4)*100)
    comp_dict[str(COMP) + ' Components'] = var1

In [None]:
components_vars = pd.DataFrame.from_dict(comp_dict, orient='index')
components_vars = components_vars.transpose()
components_vars.iplot(theme='white', kind='bar', title='SVD Explain Variance for first 1000 components',
                     xTitle='Nth Component', yTitle='Cumilative Variance Ratio')

## Feature Importance

In [None]:
from sklearn.model_selection import train_test_split
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

In [None]:
lgbm = LGBMRegressor()
xgbm = XGBRegressor()

In [None]:
train = train.astype('float32') # For faster computation
xgbm.fit(train, log_target ,verbose=False)
lgbm.fit(train, log_target ,verbose=False)

In [None]:
LGBM_FEAT_IMP = pd.DataFrame({'Features':train.columns, "IMP": lgbm.feature_importances_}).sort_values(by='IMP', ascending=False)

XGBM_FEAT_IMP = pd.DataFrame({'Features':train.columns, "IMP": xgbm.feature_importances_}
                            ).sort_values(
                              by='IMP', ascending=False)

### Light GBM Top 10 feature importance

In [None]:
LGBM_FEAT_IMP.head(10).transpose()

### Light GBM Top 10 feature importance

In [None]:
XGBM_FEAT_IMP.head(10).transpose()

In [None]:
data = [go.Bar(
            x= LGBM_FEAT_IMP.head(50).Features,
            y= LGBM_FEAT_IMP.head(50).IMP, 
            marker=dict(color='green',))
       ]
layout = go.Layout(title = "LGBM Top 50 Feature Importances")
fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [None]:
data = [go.Bar(
            x= XGBM_FEAT_IMP.head(50).Features,
            y= XGBM_FEAT_IMP.head(50).IMP, 
            marker=dict(color='blue',))
       ]
layout = go.Layout(title = "XGBM Top 50 Feature Importances")
fig = go.Figure(data=data, layout=layout)
iplot(fig)

Extracting set of common important features from LGBM and XGBoost

In [None]:
cols_imp = list(set(LGBM_FEAT_IMP[LGBM_FEAT_IMP.IMP > 0 ].Features.values) & set(
    XGBM_FEAT_IMP[XGBM_FEAT_IMP.IMP > 0 ].Features.values))

In [None]:
DIFF_DESCRIBE = train[['f190486d6', '58e2e02e6', 'eeb9cd3aa', '15ace8c9f', '58e056e12',
       '9fd594eec', 'c47340d97', 'b43a7cfd5', 'c5a231d81', 'fb0f5dbfe',
       '58232a6fb', '2288333b4', 'f74e8f13d', '20aa07010', '66ace2992',
       '6eef030c1', '024c577b9', '26ab20ff9', '491b9ee45', '9306da53f']].describe().transpose() - test[['f190486d6', '58e2e02e6', 'eeb9cd3aa', '15ace8c9f', '58e056e12',
       '9fd594eec', 'c47340d97', 'b43a7cfd5', 'c5a231d81', 'fb0f5dbfe',
       '58232a6fb', '2288333b4', 'f74e8f13d', '20aa07010', '66ace2992',
       '6eef030c1', '024c577b9', '26ab20ff9', '491b9ee45', '9306da53f']].describe().transpose()

### Difference of train.describe() - test.describe() to see how two datasets are different

In [None]:
DIFF_DESCRIBE.style.format("{:.2f}").bar(align='mid', color=['#d65f5f', '#5fba7d'])

####  Kernel will be updated on weekly basis. I will make sure that the kernel stays unique and also summarize any interesting findings here...

Last updated - **29 June 2018**
### TODO

1.  Cross Validation Strategy
2. Ensembling
3. Stacking
4. More EDA...