# Santander Customer Transaction Prediction - Random Forest Details

In the Kaggle competition, the objective is to identify which customer will make a transaction in the future.

**Link to the competition**: https://www.kaggle.com/c/santander-customer-transaction-prediction/  
**Type of Problem**: Classification  
**Metric for evalution**: AOC (Area Under Curve)

This Python 3 environment comes with many helpful analytics libraries installed
It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import plot_partial_dependence
from sklearn import metrics

import matplotlib.pyplot as plt

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Step1: Read CSV
Read the train csv file and look at the data. There are 200K rows and 200 independent variables.

In [None]:
input_dir = '/kaggle/input/santander-customer-transaction-prediction/'
df_train = pd.read_csv(input_dir + 'train.csv')
df_train

Split the data into independent and dependent variables. This is required to train the model using sklearn.

In [None]:
var_columns = [c for c in df_train if c not in ['ID_code','target']]
X = df_train.loc[:,var_columns]
y = df_train.loc[:,'target']

## Step2: Create Random Forest Model
Use the parameters which are result of hyperparameter tuning

In [None]:
model_rf = RandomForestClassifier(class_weight='balanced',
                                  criterion='gini',
                                  max_depth=55,
                                  max_features='log2',
                                  min_samples_leaf=0.005,
                                  min_samples_split=0.005,
                                  n_estimators=190)
model_rf.fit(X, y)

## Step3: Variable Importance
Convert the variable importance into pandas dataframe. Also sort the variable list based on importance.

In [None]:
df_var_imp = pd.DataFrame({'Variable': var_columns,
                           'Importance': model_rf.feature_importances_}) \
                .sort_values(by='Importance', ascending=False) \
                .reset_index(drop=True)

Let us plot the variable importance as bar charts.

In [None]:
df_var_imp[:15].sort_values('Importance').plot('Variable','Importance', 'barh', figsize=(15,5), legend=False)

## Step4: Partial Dependence of Variables
`var_81`, `var_139` and `var_110` are the top variables on the basis of variable importance. Let us see how they relate to the dependent variable.

In [None]:
fig,ax = plt.subplots(figsize=(18, 4))
plot_partial_dependence(model_rf, X, ['var_81','var_139','var_110'],
                        grid_resolution=20, ax=ax);

For `var_81` and `var_139` , we can see that event rate is higher for lower values of the variable. For `var_110`, higher values leads to higher event rate. Also, seems like there can be a cut-off value which can be used for classification.  

Let us also see the distribution of data as histogram for the three variables.

In [None]:
fig,ax = plt.subplots(1, 3, figsize=(18, 4))
X['var_81'].hist(ax=ax[0], legend=True)
X['var_139'].hist(ax=ax[1], legend=True)
X['var_110'].hist(ax=ax[2], legend=True)

## Step5: Prediction on Test Data
Read the test and sample submission csv

In [None]:
df_test = pd.read_csv(input_dir + '/test.csv')
df_sample_submission = pd.read_csv(input_dir + '/sample_submission.csv')

df_test.shape, df_sample_submission.shape

Split the test data between independent variables and find predictions

In [None]:
X_test = df_test.loc[:,var_columns]

df_sample_submission['target'] = model_rf.predict_proba(X_test)[:,1]
df_sample_submission

## Step6: Confidence of prediction
While probability of prediction can be used to identify how confident we are about predictions for an observation, another way is to use standard deviation of predictions from different trees in the random forest.

In [None]:
y_test_pred_trees = np.stack([m.predict(X_test) for m in model_rf.estimators_])
y_test_pred_trees.shape

In [None]:
y_test_pred_std = y_test_pred_trees.std(0)

df_sample_submission['pred_prob'] = model_rf.predict_proba(X_test)[:,1]
df_sample_submission['pred_std'] = y_test_pred_std
df_sample_submission[:10]

## Step7: Export Predictions

In [None]:
output_dir = '/kaggle/working/'
df_sample_submission[['ID_code','target']].to_csv(output_dir + '02_random_forest_scores.csv', index=False)