# Mini Project: Tree-Based Algorithms

## The "German Credit" Dataset

### Dataset Details

This dataset has two classes (these would be considered labels in Machine Learning terms) to describe the worthiness of a personal loan: "Good" or "Bad". There are predictors related to attributes, such as: checking account status, duration, credit history, purpose of the loan, amount of the loan, savings accounts or bonds, employment duration, installment rate in percentage of disposable income, personal information, other debtors/guarantors, residence duration, property, age, other installment plans, housing, number of existing credits, job information, number of people being liable to provide maintenance for, telephone, and foreign worker status.

Many of these predictors are discrete and have been expanded into several 0/1 indicator variables (a.k.a. they have been one-hot-encoded).

This dataset has been kindly provided by Professor Dr. Hans Hofmann of the University of Hamburg, and can also be found on the UCI Machine Learning Repository.

## Decision Trees

 As we have learned in the previous lectures, Decision Trees as a family of algorithms (irrespective to the particular implementation) are powerful algorithms that can produce models with a predictive accuracy higher than that produced by linear models, such as Linear or Logistic Regression. Primarily, this is due to the fact the DT's can model nonlinear relationships, and also have a number of tuning paramters, that allow for the practicioner to achieve the best possible model. An added bonus is the ability to visualize the trained Decision Tree model, which allows for some insight into how the model has produced the predictions that it has. One caveat here, to keep in mind, is that sometimes, due to the size of the dataset (both in the sense of the number of records, as well as the number of features), the visualization might prove to be very large and complex, increasing the difficulty of interpretation.

To give you a very good example of how Decision Trees can be visualized and interpreted, we would strongly recommend that, before continuing on with solving the problems in this Mini Project, you take the time to read this fanstastic, detailed and informative blog post: http://explained.ai/decision-tree-viz/index.html

## Building Your First Decision Tree Model

So, now it's time to jump straight into the heart of the matter. Your first task, is to build a Decision Tree model, using the aforementioned "German Credit" dataset, which contains 1,000 records, and 62 columns (one of them presents the labels, and the other 61 present the potential features for the model.)

For this task, you will be using the scikit-learn library, which comes already pre-installed with the Anaconda Python distribution. In case you're not using that, you can easily install it using pip.

Before embarking on creating your first model, we would strongly encourage you to read the short tutorial for Decision Trees in scikit-learn (http://scikit-learn.org/stable/modules/tree.html), and then dive a bit deeper into the documentation of the algorithm itself (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). 

Also, since you want to be able to present the results of your model, we suggest you take a look at the tutorial for accuracy metrics for classification models (http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report) as well as the more detailed documentation (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).

Finally, an *amazing* resource that explains the various classification model accuracy metrics, as well as the relationships between them, can be found on Wikipedia: https://en.wikipedia.org/wiki/Confusion_matrix

(Note: as you've already learned in the Logistic Regression mini project, a standard practice in Machine Learning for achieving the best possible result when training a model is to use hyperparameter tuning, through Grid Search and k-fold Cross Validation. We strongly encourage you to use it here as well, not just because it's standard practice, but also becuase it's not going to be computationally to intensive, due to the size of the dataset that you're working with. Our suggestion here is that you split the data into 70% training, and 30% testing. Then, do the hyperparameter tuning and Cross Validation on the training set, and afterwards to a final test on the testing set.)

### Now we pass the torch onto you! You can start building your first Decision Tree model! :)

In [155]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, f1_score, make_scorer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

In [156]:
credit_dt = pd.read_csv('GermanCredit.csv')
print(credit_dt.columns)
print(credit_dt.shape)
print(credit_dt.head(10))

Index(['Duration', 'Amount', 'InstallmentRatePercentage', 'ResidenceDuration',
       'Age', 'NumberExistingCredits', 'NumberPeopleMaintenance', 'Telephone',
       'ForeignWorker', 'Class', 'CheckingAccountStatus.lt.0',
       'CheckingAccountStatus.0.to.200', 'CheckingAccountStatus.gt.200',
       'CheckingAccountStatus.none', 'CreditHistory.NoCredit.AllPaid',
       'CreditHistory.ThisBank.AllPaid', 'CreditHistory.PaidDuly',
       'CreditHistory.Delay', 'CreditHistory.Critical', 'Purpose.NewCar',
       'Purpose.UsedCar', 'Purpose.Furniture.Equipment',
       'Purpose.Radio.Television', 'Purpose.DomesticAppliance',
       'Purpose.Repairs', 'Purpose.Education', 'Purpose.Vacation',
       'Purpose.Retraining', 'Purpose.Business', 'Purpose.Other',
       'SavingsAccountBonds.lt.100', 'SavingsAccountBonds.100.to.500',
       'SavingsAccountBonds.500.to.1000', 'SavingsAccountBonds.gt.1000',
       'SavingsAccountBonds.Unknown', 'EmploymentDuration.lt.1',
       'EmploymentDuration.1.to

In [157]:
credit_dt.describe()
# No missing value
credit_dt.Class.replace(['Good','Bad'], [1,0], inplace=True)
credit_dt.Class.value_counts()
credit_dt.shape

(1000, 62)

In [158]:
from sklearn.feature_selection import VarianceThreshold
threshold_n=0.8
sel = VarianceThreshold(threshold=(threshold_n* (1 - threshold_n) ))
sel_var=sel.fit_transform(credit_dt)
credit_dt[credit_dt.columns[sel.get_support(indices=True)]] 
print(sel_var)

[[   6 1169    4 ...,    1    0    1]
 [  48 5951    2 ...,    1    0    1]
 [  12 2096    2 ...,    1    1    0]
 ..., 
 [  12  804    4 ...,    1    0    1]
 [  45 1845    4 ...,    0    0    1]
 [  45 4576    3 ...,    1    0    1]]


# Split traing and testing dataset

In [159]:
X = credit_dt.drop('Class', axis=1)
y = credit_dt['Class']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state =5)

In [160]:
# DecisionTreeClassifier
dtree_model = DecisionTreeClassifier()

scores = cross_val_score(dtree_model, X_train, y_train, cv=5, scoring='f1_macro')
scores.mean()

0.65346827247647721

In [161]:
# Fit the model
dtree_model.fit(X_train, y_train)

# Make predictions
train_predictions = dtree_model.predict(X_train)
test_predictions = dtree_model.predict(X_test)

print (classification_report(y_test, test_predictions))
scores = cross_val_score(dtree_model,X_train, y_train, cv= 5, scoring = 'f1_macro')
print('scores: ',scores)
scores.mean()

             precision    recall  f1-score   support

          0       0.43      0.48      0.45        82
          1       0.79      0.76      0.78       218

avg / total       0.69      0.68      0.69       300

scores:  [ 0.6229732   0.77632515  0.59528215  0.61353105  0.6713948 ]


0.65590126848529495

In [162]:
print('The Training F1 Score is', f1_score(train_predictions, y_train))
print('The Testing F1 Score is', f1_score(test_predictions, y_test))

# The model is overfitting as testing score is lower than training score.

The Training F1 Score is 1.0
The Testing F1 Score is 0.777517564403


In [164]:
max_depth = np.linspace(5, 10, 5)
parameters={'criterion': ['gini', 'entropy'], 'max_depth': max_depth}

scorer = make_scorer(f1_score)
#use gridsearch to test all values
dtree_gscv = GridSearchCV(dtree_model, parameters, scoring = scorer)

# fit model to data
dtree_gscv.fit(X_train, y_train)

print( dtree_gscv.best_params_)

best_dtree_model= dtree_gscv.best_estimator_

scores = cross_val_score(best_dtree_model, X_train, y_train, cv=5, scoring='f1_macro')
print(scores.mean())

# Fit the best model
best_dtree_model.fit(X_train, y_train)

# Make predictions using the new model.
best_train_predictions = best_dtree_model.predict(X_train)
best_test_predictions = best_dtree_model.predict(X_test)

# Calculate the f1_score of the new model.
print('The training F1 Score is', f1_score(best_train_predictions, y_train))
print('The testing F1 Score is', f1_score(best_test_predictions, y_test))

{'criterion': 'entropy', 'max_depth': 6.25}
0.650348459743
The training F1 Score is 0.878296146045
The testing F1 Score is 0.773455377574


### After you've built the best model you can, now it's time to visualize it!

Rememeber that amazing blog post from a few paragraphs ago, that demonstrated how to visualize and interpret the results of your Decision Tree model. We've seen that this can perform very well, but let's see how it does on the "German Credit" dataset that we're working on, due to it being a bit larger than the one used by the blog authors.

First, we're going to need to install their package. If you're using Anaconda, this can be done easily by running:

In [181]:
! pip install dtreeviz
#! pip install graphviz



If for any reason this way of installing doesn't work for you straight out of the box, please refer to the more detailed documentation here: https://github.com/parrt/dtreeviz

Now you're ready to visualize your Decision Tree model! Please feel free to use the blog post for guidance and inspiration!

In [184]:
from sklearn.datasets import *
from sklearn import tree
from dtreeviz.trees import *
viz = dtreeviz(best_dtree_model,
              credit_dt.columns[0:len(credit_dt.columns)-1],
              credit_dt.columns[-1],
              target_name='Class',
              feature_names=credit_dt.columns[0:len(credit_dt.columns)-1], 
              class_names=["Good", "Bad"],
              fancy=False )  # fance=False to remove histograms/scatterplots from decision nodes
              
viz.view() 

ModuleNotFoundError: No module named 'dtreeviz'

## Random Forests

As discussed in the lecture videos, Decision Tree algorithms also have certain undesireable properties. Mainly the have low bias, which is good, but tend to have high variance - which is *not* so good (more about this problem here: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).

Noticing these problems, the late Professor Leo Breiman, in 2001, developed the Random Forests algorithm, which mitigates these problems, while at the same time providing even higher predictive accuracy than the majority of Decision Tree algorithm implementations. While the curriculum contains two excellent lectures on Random Forests, if you're interested, you can dive into the original paper here: https://link.springer.com/content/pdf/10.1023%2FA%3A1010933404324.pdf.

In the next part of this assignment, your are going to use the same "German Credit" dataset to train, tune, and measure the performance of a Random Forests model. You will also see certain functionalities that this model, even though it's a bit of a "black box", provides for some degree of interpretability.

First, let's build a Random Forests model, using the same best practices that you've used for your Decision Trees model. You can reuse the things you've already imported there, so no need to do any re-imports, new train/test splits, or loading up the data again.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [187]:
rf_model = RandomForestClassifier().fit(X_train, y_train)

# Predict target variables y for test data
# Make predictions
train_predictions = rf_model.predict(X_train)
test_predictions = rf_model.predict(X_test)

print (classification_report(y_test, test_predictions))
scores = cross_val_score(rf_model,X_train, y_train, cv= 5, scoring = 'f1_macro')
print('scores: ',scores)
scores.mean()
print('The Training F1 Score is', f1_score(train_predictions, y_train))
print('The Testing F1 Score is', f1_score(test_predictions, y_test))

             precision    recall  f1-score   support

          0       0.48      0.45      0.47        82
          1       0.80      0.82      0.81       218

avg / total       0.71      0.72      0.71       300

scores:  [ 0.70267104  0.74468085  0.67058824  0.64030812  0.6198079 ]
The Training F1 Score is 0.991718426501
The Testing F1 Score is 0.807256235828


In [189]:
max_depth = np.linspace(5, 10, 5)
parameters={'criterion': ['gini', 'entropy'], 'max_depth': max_depth}

scorer = make_scorer(f1_score)
#use gridsearch to test all values
rf_gscv = GridSearchCV(rf_model, parameters, scoring = scorer)

# fit model to data
rf_gscv.fit(X_train, y_train)

print( rf_gscv.best_params_)

best_rf_model= rf_gscv.best_estimator_

scores = cross_val_score(best_rf_model, X_train, y_train, cv=5, scoring='f1_macro')
print(scores.mean())

# Fit the best model
best_rf_model.fit(X_train, y_train)

# Make predictions using the new model.
best_train_predictions = best_rf_model.predict(X_train)
best_test_predictions = best_rf_model.predict(X_test)

# Calculate the f1_score of the new model.
print('The training F1 Score is', f1_score(best_train_predictions, y_train))
print('The testing F1 Score is', f1_score(best_test_predictions, y_test))

{'criterion': 'entropy', 'max_depth': 10.0}
0.656803076735
The training F1 Score is 0.976506639428
The testing F1 Score is 0.826086956522


As mentioned, there are certain ways to "peek" into a model created by the Random Forests algorithm. The first, and most popular one, is the Feature Importance calculation functionality. This allows the ML practitioner to see an ordering of the importance of the features that have contributed the most to the predictive accuracy of the model. 

You can see how to use this in the scikit-learn documentation (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.feature_importances_). Now, if you tried this, you would just get an ordered table of not directly interpretable numeric values. Thus, it's much more useful to show the feature importance in a visual way. You can see an example of how that's done here: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-py

Now you try! Let's visualize the importance of features from your Random Forests model!

In [None]:
# Your code here

A final method for gaining some insight into the inner working of your Random Forests models is a so-called Partial Dependence Plot. The Partial Dependence Plot (PDP or PD plot) shows the marginal effect of a feature on the predicted outcome of a previously fit model. The prediction function is fixed at a few values of the chosen features and averaged over the other features. A partial dependence plot can show if the relationship between the target and a feature is linear, monotonic or more complex. 

In scikit-learn, PDPs are implemented and available for certain algorithms, but at this point (version 0.20.0) they are not yet implemented for Random Forests. Thankfully, there is an add-on package called **PDPbox** (https://pdpbox.readthedocs.io/en/latest/) which adds this functionality to Random Forests. The package is easy to install through pip.

In [70]:
! pip install pdpbox

Collecting pdpbox
[?25l  Downloading https://files.pythonhosted.org/packages/87/23/ac7da5ba1c6c03a87c412e7e7b6e91a10d6ecf4474906c3e736f93940d49/PDPbox-0.2.0.tar.gz (57.7MB)
[K    100% |████████████████████████████████| 57.7MB 1.1MB/s eta 0:00:01    40% |█████████████                   | 23.3MB 40.8MB/s eta 0:00:01    61% |███████████████████▋            | 35.4MB 42.3MB/s eta 0:00:01
Building wheels for collected packages: pdpbox
  Running setup.py bdist_wheel for pdpbox ... [?25ldone
[?25h  Stored in directory: /home/ubuntu/.cache/pip/wheels/7d/08/51/63fd122b04a2c87d780464eeffb94867c75bd96a64d500a3fe
Successfully built pdpbox
[31mmxnet-cu80 1.2.0 has requirement graphviz<0.9.0,>=0.8.1, but you'll have graphviz 0.13.2 which is incompatible.[0m
Installing collected packages: pdpbox
Successfully installed pdpbox-0.2.0
[33mYou are using pip version 10.0.1, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


While we encourage you to read the documentation for the package (and reading package documentation in general is a good habit to develop), the authors of the package have also written an excellent blog post on how to use it, showing examples on different algorithms from scikit-learn (the Random Forests example is towards the end of the blog post): https://briangriner.github.io/Partial_Dependence_Plots_presentation-BrianGriner-PrincetonPublicLibrary-4.14.18-updated-4.22.18.html

So, armed with this new knowledge, feel free to pick a few features, and make a couple of Partial Dependence Plots of your own!

In [None]:
# Your code here!

## (Optional) Advanced Boosting-Based Algorithms

As explained in the video lectures, the next generation of algorithms after Random Forests (that use Bagging, a.k.a. Bootstrap Aggregation) were developed using Boosting, and the first one of these were Gradient Boosted Machines, which are implemented in scikit-learn (http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting).

Still, in recent years, a number of variations on GBMs have been developed by different research amd industry groups, all of them bringing improvements, both in speed, accuracy and functionality to the original Gradient Boosting algorithms.

In no order of preference, these are:
1. **XGBoost**: https://xgboost.readthedocs.io/en/latest/
2. **CatBoost**: https://tech.yandex.com/catboost/
3. **LightGBM**: https://lightgbm.readthedocs.io/en/latest/

If you're using the Anaconda distribution, these are all very easy to install:

In [None]:
! conda install -c anaconda py-xgboost

In [86]:
#! conda install -c conda-forge catboost
! pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/97/c4/586923de4634f88a31fd1b4966e15707a912b98b6f4566651b5ef58f36b5/catboost-0.20.2-cp36-none-manylinux1_x86_64.whl (63.9MB)
[K    100% |████████████████████████████████| 63.9MB 1.0MB/s eta 0:00:01    56% |██████████████████              | 35.8MB 42.6MB/s eta 0:00:01
Collecting pandas>=0.24.0 (from catboost)
[?25l  Downloading https://files.pythonhosted.org/packages/52/3f/f6a428599e0d4497e1595030965b5ba455fd8ade6e977e3c819973c4b41d/pandas-0.25.3-cp36-cp36m-manylinux1_x86_64.whl (10.4MB)
[K    100% |████████████████████████████████| 10.4MB 5.5MB/s eta 0:00:01 0% |▎                               | 92kB 16.0MB/s eta 0:00:01
Collecting numpy>=1.16.0 (from catboost)
[?25l  Downloading https://files.pythonhosted.org/packages/62/20/4d43e141b5bc426ba38274933ef8e76e85c7adea2c321ecf9ebf7421cedf/numpy-1.18.1-cp36-cp36m-manylinux1_x86_64.whl (20.1MB)
[K    100% |████████████████████████████████| 20.2MB 2.6MB/s eta 

In [None]:
! conda install -c conda-forge lightgbm

Your task in this optional section of the mini project is to read the documentation of these three libraries, and apply all of them to the "German Credit" dataset, just like you did in the case of Decision Trees and Random Forests.

In [76]:
from sklearn.ensemble import GradientBoostingClassifier
X = credit_dt.drop('Class', axis=1)
y = credit_dt['Class']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state = 5)
gbclf = GradientBoostingClassifier().fit(X_train, y_train)

# Predict target variables y for test data
y_pred = gbclf.predict(X_test)
print (classification_report(y_test, y_predict))


             precision    recall  f1-score   support

          0       0.25      0.27      0.26        79
          1       0.73      0.71      0.72       221

avg / total       0.60      0.60      0.60       300



In [190]:
from catboost import CatBoostRegressor
X = credit_dt.drop('Class', axis=1)
y = credit_dt['Class']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
catclf = CatBoostRegressor().fit(X_train, y_train)

# Predict target variables y for test data
y_pred = catclf.predict(X_test)
print (classification_report(y_test, y_predict))

ModuleNotFoundError: No module named 'catboost'

The final deliverable of this section should be a table (can be a pandas DataFrame) which shows the accuracy of all the five algorthms taught in this mini project in one place.

Happy modeling! :)