# Santander Customer Transaction Prediction - Random Forest Basics

In the Kaggle competition, the objective is to identify which customer will make a transaction in the future.

**Link to the competition**: https://www.kaggle.com/c/santander-customer-transaction-prediction/  
**Type of Problem**: Classification  
**Metric for evalution**: AOC (Area Under Curve)

This Python 3 environment comes with many helpful analytics libraries installed
It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

import matplotlib.pyplot as plt 

from sklearn import metrics

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Step1: Read Training Dataset

In [None]:
input_dir = '/kaggle/input/santander-customer-transaction-prediction/'

df_train = pd.read_csv(input_dir + '/train.csv')
df_train

## Step2: Split data into train and validation set

In [None]:
var_columns = [c for c in df_train.columns if c not in ('ID_code','target')]

X = df_train.loc[:, var_columns]
y = df_train.loc[:, 'target']

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

## Step3: Create a Random Forest Model

Define Model Parameters and create the model

In [None]:
num_trees = 150

model_rf = RandomForestClassifier(n_estimators=num_trees,
                                  max_depth=4,
                                  class_weight='balanced')
model_rf.fit(X_train, y_train)

Let us look at the performance on Training and Validation data

In [None]:
y_train_pred = model_rf.predict(X_train)
y_valid_pred = model_rf.predict(X_valid)

print('AUC Train: {:.4f}\nAUC Valid = {:.4f}'.format(metrics.roc_auc_score(y_train, y_train_pred),
                                                     metrics.roc_auc_score(y_valid, y_valid_pred)))

## Step 4: Understanding Probability of prediction

We can also look at the probability of prediction from the decision tree

In [None]:
y_valid_prob = model_rf.predict_proba(X_valid)

print("Probabilities",
      "\n",
      y_valid_prob[:10], 
      "\n\nPredictions\n",
      np.array(y_valid_pred[:10]))

In [None]:
model_rf.classes_

We can also find out the probability of prediction from each individual tree

In [None]:
y_train_prob_tree = np.stack([m.predict_proba(X_train)[:,1] for m in model_rf.estimators_])
y_valid_prob_tree = np.stack([m.predict_proba(X_valid)[:,1] for m in model_rf.estimators_])

y_train_prob_tree.shape, y_valid_prob_tree.shape

The mean value of the predictions from each tree can be used to find the probability score from Random Forest  
A **threshold value** of `0.5` is used to identify classes in this binary class problem  
In actual algo, the class label with the maximum probability is chosen as the predicted class  

In [None]:
y_train_pred_tree = (y_train_prob_tree.mean(0) > 0.5).astype(int)
y_valid_pred_tree = (y_valid_prob_tree.mean(0) > 0.5).astype(int)

y_train_pred_tree.shape, y_valid_pred_tree.shape

Let us compare our predicted classes from individual trees from the predictions from `model_rf.predict()`  
Notice below that the the class labels are exactly same for both training and validation datasets

In [None]:
sum(y_train_pred_tree-y_train_pred), sum(y_valid_pred_tree-y_valid_pred)

## Step5: Find the model performance with respect to number of trees

Let us repeat the earlier steps to find probabilities for each tree

In [None]:
# Repeating same code
y_train_prob_tree = np.stack([m.predict_proba(X_train)[:,1] for m in model_rf.estimators_])
y_valid_prob_tree = np.stack([m.predict_proba(X_valid)[:,1] for m in model_rf.estimators_])

# Find AUC for different levels of Trees
train_auc_trees = [metrics.roc_auc_score(y_train, (y_train_prob_tree[:i+1].mean(0) > 0.5).astype(int)) for i in range(num_trees)]
valid_auc_trees = [metrics.roc_auc_score(y_valid, (y_valid_prob_tree[:i+1].mean(0) > 0.5).astype(int)) for i in range(num_trees)]

len(train_auc_trees), len(valid_auc_trees)

Plot AUC with number of trees.  
Notice that the AUC is more or less stable after about 100 trees

In [None]:
plt.figure(figsize=(10,5))

plt.plot(train_auc_trees, label='Train AUC')
plt.plot(valid_auc_trees, label='Validation AUC')

plt.ylabel('Area under Curve (AUC)')
plt.xlabel('Number of Trees')

plt.legend()
plt.show()

## Step 6: Find predicted values for test data
Create the final model with entire training data

In [None]:
model_rf = RandomForestClassifier(n_estimators=100,
                                  max_depth=4,
                                  class_weight='balanced')
model_rf.fit(X, y)

First read the test and sample submission data

In [None]:
df_test = pd.read_csv(input_dir + '/test.csv')
df_sample_subm = pd.read_csv(input_dir + '/sample_submission.csv')

df_test.shape, df_sample_subm.shape

In [None]:
X_test = df_test.loc[:,var_columns]

df_sample_subm['target'] = model_rf.predict(X_test)
df_sample_subm

In [None]:
output_dir = '/kaggle/working/'
df_sample_subm.to_csv(output_dir + '02_random_forest_scores.csv', index=False)