# Random Forest
---
In this kernel we're going to use a Decission Tree as a base model and then we'll build a Random Forest classifier. We're going to build different models based on the previous analysis, by using different training datasets:

1. The initial dataset (cleaned and standarized)
2. Resampled / PCA data
3. Balanced data

We'll then compare the score of these models and determine what would be the best preprocessing strategy in this case.

# Imports

In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import scikitplot as skplt

In [None]:
# load the train and test data files
train_clean_standarized = pd.read_csv("../input/feature-exploration-and-dataset-preparation/train_clean_standarized.csv", index_col=0)
train_resampled_PCA = pd.read_csv("../input/pca-principal-component-analysis/train_PCA.csv", index_col=0)
train_resampled = pd.read_csv("../input/resampling/train_resampled.csv", index_col=0)
test = pd.read_csv("../input/santander-customer-satisfaction/test.csv", index_col=0)

# 1. Base model - Decision tree

## 1.1 First attempt - Using the initial dataset

We're going to start with a simple decision tree as our base model that we'll use to compare the rest of the models with. We'll first use the original standarized dataset and look at the model performance:

In [None]:
# get our train test split data (25% test data)
y = train_clean_standarized.TARGET
X = train_clean_standarized.drop("TARGET", axis=1)
data_train, data_test, target_train, target_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
# instantiate and fit the base model
# we're just picking some random hyperparameters
tree_clf = DecisionTreeClassifier(criterion='gini', max_depth=50) 
tree_clf.fit(data_train, target_train)

In [None]:
fea_imp = pd.DataFrame({'imp': tree_clf.feature_importances_, 'col': X.columns})
fea_imp = fea_imp[fea_imp.imp > .005].sort_values(['imp', 'col'], ascending=[True, False])
fea_imp.plot(kind='barh', x='col', y='imp', legend=None)
plt.title('Decision Tree - Feature importance')
plt.ylabel('Features')
plt.xlabel('Importance');

As commmented in other kernels in Kaggle, we find that the most important features are:

- var38: Mortgage
- var15: Customer age
- saldo_var30: This may correspond to the current account balance

Now let's have a look at the model performance. For this, we're going to look at the confusion matrix and the classification report for the test set predictions:

In [None]:
# calculate the test set predictions
pred = tree_clf.predict(data_test)
skplt.metrics.plot_confusion_matrix(target_test, pred);

In [None]:
print(classification_report(target_test, pred))

Here we can see the problem of having a highly unbalanced dataset: The number of false positives is very high in relation with the true negatives, so even though the accuracy of the model is high (93%), this is the result of the model predicting the majority class (0 - customer satisfied) in most cases. In cases like this, where the classes are imbalance, the f1-score is a better metric.

## 1.2 Second attempt using resampled dataset, stratify training/set data and weighting classes
In order to overcome the issues caused by the imbalanced dataset we'll:
1. Use the dataset where we performed a downsample of the majority class (satisfied customers).
2. Balance (stratify) the training / test datasets.
3. Assign different weights to both classes in our model (even though we have resampled our data, there's still a majority of satisfied customers).

In [None]:
# get our train test split data (25% test data)
y = train_resampled.TARGET
X = train_resampled.drop("TARGET", axis=1)

# we use stratify to balance our data sets
data_train, data_test, target_train, target_test = train_test_split(X, y, stratify=y, test_size=0.25, random_state=42)

# we assign a different weigth to the unsatisfied customer class (6.6 as there are 6.6 more satisfied customers in the dataset)
tree_clf_resampled = DecisionTreeClassifier(criterion='gini', max_depth=50, class_weight={0:1,1:6.6}) 
tree_clf_resampled.fit(data_train, target_train)

# calculate the test set predictions and display the confusion matrix and classification report
pred = tree_clf_resampled.predict(data_test)
skplt.metrics.plot_confusion_matrix(target_test, pred)
print(classification_report(target_test, pred))

## 1.3 Conclusions

We can see that although the accuracy has decreased a little (from 94% to 82%) the precission of the model has increased notably and we have reduced the number of false positives.

# 2. Random forest

Let's build a Random forest classifier and compare it with our base model. Initially, we'll pick some random hyperparameters. Later on, we'll run a Grid Search to find more optimal parameters:

In [None]:
# get our train test split data (25% test data)
y = train_resampled.TARGET
X = train_resampled.drop("TARGET", axis=1)

# we use stratify to balance our data sets
data_train, data_test, target_train, target_test = train_test_split(X, y, stratify=y, test_size=0.25, random_state=42)

# we assign a different weigth to the unsatisfied customer class (6.6 as there are 6.6 more satisfied customers in the dataset)
forest_clf = RandomForestClassifier(n_estimators=100, max_depth=50, class_weight={0:1,1:6.6}) 
forest_clf.fit(data_train, target_train)

# calculate the test set predictions and display the confusion matrix and classification report
pred = forest_clf.predict(data_test)
skplt.metrics.plot_confusion_matrix(target_test, pred)
print(classification_report(target_test, pred))

Comparing the Random Forest with our base model, we can see precission is very similar, and the accuracy has increased a little (from 82% to 87%)

## 2.1 Hyperparameter tunning
Before, we've just randomly tried different parameters to build our Random Forest. In this case, we're going to use a Grid Search to find the optimal values for these parameters. Note that this is computationally expensive, so we're not going to test many different parameters:

In [None]:
rf_param_grid = {
    'class_weight': [{0:1,1:3}, {0:1,1:6.6}, {0:1,1:10}],
    'max_depth': [None, 50, 100],
    'min_samples_split': [2, 40, 60],
    'min_samples_leaf': [1, 4]
}

In [None]:
rf_grid_search = GridSearchCV(RandomForestClassifier(n_estimators=100), rf_param_grid, cv=3, return_train_score=True)

In [None]:
rf_grid_search.fit(data_train, target_train)

In [None]:
rf_grid_search.best_params_

In [None]:
# calculate the test set predictions and display the confusion matrix and classification report
pred = rf_grid_search.best_estimator_.predict(data_test)
skplt.metrics.plot_confusion_matrix(target_test, pred)
print(classification_report(target_test, pred))

## 2.2 Random Forest with PCA

In [None]:
# get our train test split data (25% test data)
y = train_resampled_PCA.TARGET
X = train_resampled_PCA.drop("TARGET", axis=1)

# we use stratify to balance our data sets
data_train, data_test, target_train, target_test = train_test_split(X, y, stratify=y, test_size=0.25, random_state=42)

forest_clf_PCA = RandomForestClassifier(n_estimators=100, max_depth=50, class_weight={0:1,1:25}) 
forest_clf_PCA.fit(data_train, target_train)

# calculate the test set predictions and display the confusion matrix and classification report
pred = forest_clf_PCA.predict(data_test)
skplt.metrics.plot_confusion_matrix(target_test, pred)
print(classification_report(target_test, pred))

## 2.3 Conclusions
Comparing the Random Forest with our base Decission Tree, we can see that we have slightly improved our model score while maintaining a similar ration on the False Positives, which is the main issue given that our dataset is highly imbalanced. Let's compare the different models:

Model                      |    data      | accuracy | f1 avg | f1 weighted
---------------------------|--------------|----------|--------|------------ 
Base Model (Decision Tree) | standarized  | .93      | .54    | .93
Base Model (Decision Tree) |  resampled   | .81      | .60    | .81
Random Forest              |  resampled   | .85      | .62    | .84
GridSearch Random Forest   |  resampled   | .87      | .60    | .84
Random Forest              | standard PCA | .95      | .53    | .94

# 3. Submission
We're going to prepare the data to be submitted to the competition. However, we found that most models are not predicting any non satisfied customers. This is probably because we've dropped too many data points during the resampling phase:

In [None]:
# prepare submission test data
column_diff = np.setdiff1d(test.columns.values, train_resampled.columns.values)
test_clean = test.drop(column_diff, axis=1)

In [None]:
# Decision Tree predictions
pred = tree_clf.predict(test_clean)
submission = pd.DataFrame({"ID":test_clean.index, "TARGET":pred})
#submission.to_csv("submission_DecisionTree.csv", index=False)
submission.TARGET.value_counts(0)

In [None]:
# Decision Tree predictions (resampled data)
pred = tree_clf_resampled.predict(test_clean)
submission = pd.DataFrame({"ID":test_clean.index, "TARGET":pred})
#submission.to_csv("submission_DecisionTree_Resampled.csv", index=False)
submission.TARGET.value_counts(0)

In [None]:
# Random Forest predictions
pred = forest_clf.predict(test_clean)
submission = pd.DataFrame({"ID":test_clean.index, "TARGET":pred})
#submission.to_csv("submission_RandomFores.csv", index=False)
submission.TARGET.value_counts(0)

In [None]:
# Grid Search Random Forest predictions
pred = rf_grid_search.best_estimator_.predict(test_clean)
submission = pd.DataFrame({"ID":test_clean.index, "TARGET":pred})
#submission.to_csv("submission_GridSearchRandomForest.csv", index=False)
submission.TARGET.value_counts(0)