# Santander Customer Transaction Prediction - Decision Tree

In the Kaggle competition, the objective is to identify which customer will make a transaction in the future.

**Link to the competition**: https://www.kaggle.com/c/santander-customer-transaction-prediction/  
**Type of Problem**: Classification  
**Metric for evalution**: AOC (Area Under Curve)

This Python 3 environment comes with many helpful analytics libraries installed
It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Step1: Read the datasets

In [None]:
input_dir = '/kaggle/input/santander-customer-transaction-prediction/'

df_train = pd.read_csv(input_dir + 'train.csv')
df_train

We already know the profile of data based on the overview provide by Kaggle.  
Let us confirm the event rate for the training data. Event rate is approx 10%

In [None]:
df_train.groupby('target').size()

## Step2: Split the data into training and validation data
20% of data would be kept for validation

In [None]:
var_columns = [c for c in df_train.columns if c not in ['ID_code','target']]
X = df_train.loc[:,var_columns]
y = df_train.loc[:,'target']

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

## Step3: Simple decision tree

Let us try to use a simple decision tree to predict the target variable.  
Also plot the tree to make sure it looks fine.

In [None]:
model_tree = DecisionTreeClassifier(max_leaf_nodes=8, class_weight='balanced')
model_tree.fit(X_train, y_train)

In [None]:
#Create the figure
plt.figure(figsize=(20,10))

#Create the tree plot
plot_tree(model_tree,
           feature_names = var_columns, #Feature names
           class_names = ["0","1"], #Class names
           rounded = True,
           filled = True)

plt.show()

Let us look at the training and validation performance

In [None]:
y_train_pred = model_tree.predict(X_train)
y_valid_pred = model_tree.predict(X_valid)

In [None]:
auc_train = metrics.roc_auc_score(y_train, y_train_pred)
auc_valid = metrics.roc_auc_score(y_valid, y_valid_pred)

print("AUC Train = {}\nAUC Valid = {}".format(round(auc_train,4), round(auc_valid,4)))

## Step4: Iterate over number of leaf nodes
Let us iterate through the steps to find the appropriate level of tree depth (max leaf nodes)  
For that, we will write all steps as a function and call that function in loop

In [None]:
def tree_training(max_leaf_nodes, X_train, y_train, X_valid, y_valid):
    model_tree = DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes, class_weight='balanced')
    model_tree.fit(X_train, y_train)
    
    y_train_pred = model_tree.predict(X_train)
    y_valid_pred = model_tree.predict(X_valid)
    
    auc_train = metrics.roc_auc_score(y_train, y_train_pred)
    auc_valid = metrics.roc_auc_score(y_valid, y_valid_pred)
    
    print("Nodes:{}, Train:{:.4f}, Valid:{:.4f}, Diff:{:.4f}".format(max_leaf_nodes,
                                                                     auc_train,
                                                                     auc_valid,
                                                                     auc_train-auc_valid))
          

# Run few iterations to find which max_tree_nodes works best
for i in range(2, 20):
    tree_training(i, X_train, y_train, X_valid, y_valid)

The performance on validation data peaks with less number of nodes. It appears that we don't need very high number of leaf nodes.  

At `6 leaf nodes`, we are getting the highest validation AUC. Performance of the model on train and validation is virtually the same.

## Step5: k-fold cross validation

Looking at the result, I felt the need to perform a k-fold cross validation. Let us try `5-fold cross validation`

In [None]:
kfold = KFold(5, shuffle=True, random_state=1)

for idx_train, idx_valid in kfold.split(df_train):
    X_train = df_train.loc[idx_train, var_columns]
    y_train = df_train.loc[idx_train, 'target']
    
    X_valid = df_train.loc[idx_valid, var_columns]
    y_valid = df_train.loc[idx_valid, 'target']
    
    # Try 10 leaf nodes, we saw lot of leaf nodes don't increase performance
    print("Iteration Starts")
    for i in range(2, 16):
        tree_training(i, X_train, y_train, X_valid, y_valid)
    
    print("Iteration Ends\n-----------------------")

A better way to perform 5-fold cross validation is using the sklearn function `cross_val_score`  

I will iterate over the number of nodes and take average AUC for each iteration

In [None]:
# CV function requires a scorer of this form
def cv_roc_auc_scorer(model, X, y): return metrics.roc_auc_score(y, model.predict(X))

# Loop through multiple values of max_leaf_nodes to find best parameter
for num_leaf_node in range(2,16):
    model_tree = DecisionTreeClassifier(max_leaf_nodes=num_leaf_node, class_weight='balanced')
    kfold_scores = cross_validate(model_tree,
                                  X,
                                  y,
                                  cv=5,
                                  scoring=cv_roc_auc_scorer,
                                  return_train_score=True)

    # Find average train and test score
    train_auc_avg = np.mean(kfold_scores['train_score'])
    test_auc_avg = np.mean(kfold_scores['test_score'])

    print("Nodes:{}, Train:{:.4f}, Valid:{:.4f}, Diff:{:.4f}".format(num_leaf_node,
                                                                     train_auc_avg,
                                                                     test_auc_avg,
                                                                     train_auc_avg-test_auc_avg))

The best performance on validation set (with minimum number of trees) is for 8 nodes.  

## Step6: Final Model using Trees
The find model has `8 leaf nodes`. Let us create that model with entire training data and look at the output.

In [None]:
model_tree = DecisionTreeClassifier(max_leaf_nodes=8, class_weight='balanced')
model_tree.fit(X, y)

Print the final tree

In [None]:
plt.figure(figsize=(20,10))

plot_tree(model_tree,
          feature_names=var_columns,
          class_names = ['0','1'],
          rounded=True,
          filled=True)

plt.show()

Let us find the final AUC value on training data  
And also plot the AUC curve

In [None]:
y_pred = model_tree.predict(X)

fpr, tpr, threshold = metrics.roc_curve(y, y_pred)
metrics.auc(fpr, tpr)

In [None]:
zeros_probs = [0 for _ in range(len(y))]
fpr_zeros, tpr_zeros, _ = metrics.roc_curve(y, zeros_probs)

# Plot the roc curve for the model
plt.plot(fpr_zeros, tpr_zeros, linestyle='--', label='No Model')
plt.plot(fpr, tpr, marker='.', label='Model')

# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

# Add legend
plt.legend()

plt.show()

## Step7: Find Predictions for Test Data and store as final excel

In [None]:
df_test = pd.read_csv(input_dir + 'test.csv')
df_test

In [None]:
X_test = df_test.loc[:, var_columns]
y_test_pred  = model_tree.predict(X_test)

In [None]:
df_sample_subm = pd.read_csv(input_dir + 'sample_submission.csv')
df_sample_subm

In [None]:
df_sample_subm['target'] = y_test_pred
df_sample_subm

In [None]:
output_dir = '/kaggle/working/'
df_sample_subm.to_csv(output_dir + '/01_tree_scores.csv', index=False)