# Santander Customer Transaction Prediction - GBM

In the Kaggle competition, the objective is to identify which customer will make a transaction in the future.

**Link to the competition**: https://www.kaggle.com/c/santander-customer-transaction-prediction/  
**Type of Problem**: Classification  
**Metric for evalution**: AOC (Area Under Curve)

This Python 3 environment comes with many helpful analytics libraries installed
It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

import matplotlib.pylab as plt

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Step1: Read Training Data from CSV
Use pandas `read_csv` function to read the data from train.csv into a pandas dataframe.  

Then the dataframe is split into train and test datasets using sklean's `train_test_split` function

In [None]:
input_dir = '/kaggle/input/santander-customer-transaction-prediction/'
df_train = pd.read_csv(input_dir + '/train.csv')
df_train

In [None]:
var_columns = [c for c in df_train.columns if c not in ['ID_code','target']]

X = df_train.loc[:,var_columns]
y = df_train.loc[:,'target']

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

## Step2: Create a simple GBM Model and evaluate performance

Let us look at meaning of some of the parameters which are passed to GradientBoostingClassifier:  
- `n_estimators`: **5000** will be the maximum number of trees in the model
- `learning_rate`: **0.05** will be weights assigned predictions from each tree in the model
- `max_depth`: **3** will be the maximum depth of any one tree in the model
- `subsample`: **50%** of the observations would be used for fitting individual trees
- `validation_fraction`: **10%** of observations would be used for validation
- `n_iter_no_change`: **20** is the stopping criteria for training. If no change is observed in performance for 20 iterations, training stops
- `max_features`: **log2(# features)** will be considered for finding best split

In [None]:
model_gbm = GradientBoostingClassifier(n_estimators=5000,
                                       learning_rate=0.05,
                                       max_depth=3,
                                       subsample=0.5,
                                       validation_fraction=0.1,
                                       n_iter_no_change=20,
                                       max_features='log2',
                                       verbose=1)
model_gbm.fit(X_train, y_train)

Look at how many estimators/trees were finally created during training.

In [None]:
len(model_gbm.estimators_)

In [None]:
y_train_pred = model_gbm.predict_proba(X_train)[:,1]
y_valid_pred = model_gbm.predict_proba(X_valid)[:,1]

print("AUC Train: {:.4f}\nAUC Valid: {:.4f}".format(roc_auc_score(y_train, y_train_pred),
                                                    roc_auc_score(y_valid, y_valid_pred)))

## Step3: Look at performance with respect to number of trees
`staged_predict_proba` function allows us to look at predictions at for different number of trees in the model

In [None]:
y_train_pred_trees = np.stack(list(model_gbm.staged_predict_proba(X_train)))[:,:,1]
y_valid_pred_trees = np.stack(list(model_gbm.staged_predict_proba(X_valid)))[:,:,1]

y_train_pred_trees.shape, y_valid_pred_trees.shape

In [None]:
auc_train_trees = [roc_auc_score(y_train, y_pred) for y_pred in y_train_pred_trees]
auc_valid_trees = [roc_auc_score(y_valid, y_pred) for y_pred in y_valid_pred_trees]

In [None]:
plt.figure(figsize=(12,5))

plt.plot(auc_train_trees, label='Train Data')
plt.plot(auc_valid_trees, label='Valid Data')

plt.title('AUC vs Number of Trees')
plt.ylabel('AUC')
plt.xlabel('Number of Trees')
plt.legend()

plt.show()

## Step4: Feature Importance
Low importance features can be removed from the model for simpler, faster and more stable model

In [None]:
pd.DataFrame({"Variable_Name":var_columns,
              "Importance":model_gbm.feature_importances_}) \
            .sort_values('Importance', ascending=False)

### Step5: Scoring for Test Data
First, read test.csv and sample_submissions.csv

In [None]:
df_test = pd.read_csv(input_dir + 'test.csv')
df_sample_submission = pd.read_csv(input_dir + 'sample_submission.csv')

df_test.shape, df_sample_submission.shape

In [None]:
X_test = df_test.loc[:,var_columns]

df_sample_submission['target'] = model_gbm.predict_proba(X_test)[:,1]
df_sample_submission

Save the output as a csv

In [None]:
output_dir = '/kaggle/working/'
df_sample_submission.to_csv(output_dir + '/03_gbm_scores.csv', index=False)