# Building different tree-based models

You have now built two basic decision trees.

In the lecture, we talked about the disadvantages of decision trees and how they are sometimes not very accurate in their predictions. There are a couple of ways of overcoming that.

Inaccuracy through overfitting can be reduced by pruning. Another way of improving accuracy is by using multiple trees at once - an ensemble of them. Ensemble learning is a whole area in machine learning which aims to improve the accuracy of a model by combining multiple of them.

In this activity we will build a random forest and a boosting model for the churn dataset. 
Let's first import the data.

## Dataset

We use our churn dataset again.

In [11]:
##### added line to ensure plots are showing
%matplotlib inline
#####

import pandas as pd
import numpy as np

df = pd.read_csv('churn_ibm.csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


You already know the pre-processing steps from before:

In [12]:
y = df['Churn']
X = df.drop(['Churn','customerID'],axis=1)

for column in X.columns:
    if X[column].dtype == object:
        X = pd.concat([X,pd.get_dummies(X[column], prefix=column, drop_first=True)],axis=1).drop([column],axis=1)
        
y = pd.get_dummies(y, prefix='churn', drop_first=True)

## Random forest

Remember random forests: they are ensembles of decision trees that are forced to use randomized predictors in their splits. They then all vote together on an outcome.

Scikit learn allows us to construct a random forest quite easily. Have a look at the documentation first here:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

rf = RandomForestClassifier(n_estimators=20) 
# pick your number of trees for n_estimators: as usual, more trees can be more accurate but 
# are computationally more expensive. There is no fixed "perfect" number that you can use, it'll depend on
# your data and your computer's capacities

# you'll also be able to specify your splitting citerion, maximum depth, minimum leaf sample number etc. as usual

# also note: the attribute "bootstrap" is default to "true". We talked about bootstrapping in the lecture
# as another way of introducing randomness by sampling with replacement from the data and generating 
# multiple training set that way




# Some algorithms need a transformed version of the dependent variable
# To this purpose, the data is reshaped using ravel()
rf.fit(X_train,y_train.values.ravel())
prediction = rf.predict(X_test)
print('Accuracy:', accuracy_score(y_test,prediction))


Accuracy: 0.7800947867298578


In [14]:
# Ravel is a pretty handy function from numpy: 
# https://numpy.org/doc/stable/reference/generated/numpy.ravel.html
# It returns our Y as a 1-dimensional array of values

# Check this out to see what it does:

y

Unnamed: 0,churn_Yes
0,0
1,0
2,1
3,0
4,1
...,...
7027,0
7028,0
7029,0
7030,1


In [15]:
y.values.ravel()

array([0, 0, 1, ..., 0, 1, 0], dtype=uint8)

There's a lot more to play around with in the RandomForestClassifier function. You can print the instance of ```RandomForestClassifier``` to learn about the default settings used.

This can be very important when reporting your results. For example, we've already said that bootstrapping is defaulted to TRUE.

In [16]:
print(rf.get_params())

{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 20, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


We can also change the parameters. Let's build a larger forest using more trees using parameter ```n_estimators```:

In [17]:
rf2 = RandomForestClassifier(n_estimators=100)
rf2.fit(X_train,y_train.values.ravel())
prediction = rf2.predict(X_test)

print('Accuracy:', accuracy_score(y_test,prediction))

Accuracy: 0.790521327014218


As expected, our accuracy did improve a little from 0.7801 to 0.7905.

Notice what is contained in the prediction array, it's all binary as our to-be-predicted class (churn) is measured as a yes/no question.

In [14]:
print('Prediction:',prediction[0:10])

Prediction: [0 0 0 0 1 0 0 0 0 0]


We can also output the probabilities for every class, as follows, and use it for AUC:

In [18]:
prediction_prob = rf2.predict_proba(X_test)
print('Prediction:',prediction_prob[0:10])
print('AUC:',roc_auc_score(y_test,prediction_prob[:,1]))

Prediction: [[0.99 0.01]
 [0.99 0.01]
 [0.68 0.32]
 [0.74 0.26]
 [0.84 0.16]
 [0.29 0.71]
 [1.   0.  ]
 [0.59 0.41]
 [0.29 0.71]
 [0.32 0.68]]
AUC: 0.8287332790746316


Remember our AUC from earlier? It was 0.7174 for the classification tree that we built in the first exercise.

We can see that our RF improved that quite a bit.

## Boosting

Boosting is a way of learning from past behavior. We are building a sequence of small trees, each of which will use the errors that the previous one did to improve its own performance.

sklearn allows us to implement two versions of the boosting method: AdaBoost and Gradient Boosting. Let's implement AdaBoost as an example.

In [20]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier()

# As usual, I would recommend checking out the documentation for this method. You'll find out which parameters
# are set or can be chosen by you.

# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

# You can give AdaBoost a tree to start with, or you can just let it do that on its own.
# The method learns by looking at past incorrect classifications and then weighting them more heavily in its
# next iteration.

ada.fit(X_train,y_train.values.ravel())
prediction = ada.predict(X_test)
prediction_prob = ada.predict_proba(X_test)


print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:',roc_auc_score(y_test,prediction_prob[:,1]))

Accuracy: 0.7943127962085308
AUC: 0.8448335637054754


Look at that accuracy and AUC - looks like we're improving just a little from our previous RF (Accuracy: 0.7905; AUC: 0.8287).

Apparently, AdaBoost is currently using 50 trees which we can find out by looking at its parameters.

In [21]:
print(ada.get_params())

{'algorithm': 'SAMME.R', 'base_estimator': None, 'learning_rate': 1.0, 'n_estimators': 50, 'random_state': None}


Let's increase that number of trees to check whether we can improve our performance even further.

In [22]:
ada2 = AdaBoostClassifier(n_estimators=100)
ada2.fit(X_train,y_train.values.ravel())
prediction = ada2.predict(X_test)
prediction_prob = ada.predict_proba(X_test)
print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:',roc_auc_score(y_test,prediction_prob[:,1]))

Accuracy: 0.7943127962085308
AUC: 0.8448335637054754


Neither accuracy, nor AUC are improved drastically, but AdaBoost performs better than the Random Forest we built.

What does that tell us about our data? There might be observations which are particularly difficult to pinpoint for the tree(s). Booster trees can perform well in such situations by focusing and weighting those mistakes heavier.

## Grid search

A lot of these efforts can be streamlined using GridSearch. Below, you can find code that tests different parameters using cross-validation for random forests:

In [23]:
from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.model_selection import GridSearchCV

# First we create a dictionary containing the parameters we want to test
# We include the values we want to test as lists 
parameters = {'min_samples_leaf':[1,5],'max_depth':[None,10]}

# Then, we bring together a classifier, the parameters, and set the number of folds for the CV
grid_search = GridSearchCV(RandomForestClassifier(n_estimators=20), parameters, cv=10)
grid_search.fit(X_train, y_train.values.ravel())

# The best predictor will be used for the prediction
prediction = grid_search.predict(X_test)
    
best_classifier = grid_search.best_estimator_

print('Best classifier:',best_classifier)
print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:',roc_auc_score(y_test,prediction))

Best classifier: RandomForestClassifier(min_samples_leaf=5, n_estimators=20)
Accuracy: 0.7985781990521327
AUC: 0.693830876535688


It seems that having a minimum of 5 samples per leaf, and a maximum depth of 10 are preferable.

## Feature importance

For all our models, we can calculate the feature importance. This is the (average) reduction in Gini impurity across all trees.

Feature importance can be an important way of interpreting your results. It's also very interesting to compare this across different methods and understand how different algorithms value different features.

In [24]:
# Random forest - 5 most important features
for c, column in enumerate(X_test.columns):
    if rf.feature_importances_[c] in sorted(rf.feature_importances_)[-5:]:
        print('Variable',column,rf.feature_importances_[c])

Variable tenure 0.16727768165830217
Variable MonthlyCharges 0.17709360868438845
Variable TotalCharges 0.19947377470287506
Variable InternetService_Fiber optic 0.04238898476420349
Variable PaymentMethod_Electronic check 0.03549385541802868


In [25]:
# AdaBoost - 5 most important features
for c, column in enumerate(X_test.columns):
    if ada.feature_importances_[c] in sorted(ada.feature_importances_)[-5:]:
        print('Variable',column,ada.feature_importances_[c])

Variable tenure 0.2
Variable MonthlyCharges 0.16
Variable TotalCharges 0.28
Variable InternetService_Fiber optic 0.04
Variable Contract_One year 0.04
Variable Contract_Two year 0.04
Variable PaymentMethod_Electronic check 0.04


It seems that both random forests and AdaBoost have exactly the same variables driving their Gini impurity down. Mostly the length of tenure, monthly charges, total charges, being connected through fiber and the contract length.

Having both methods value the same features can strengthen our recommendations made to the company.