# Mini Project: Tree-Based Algorithms

## The "German Credit" Dataset

### Dataset Details

This dataset has two classes (these would be considered labels in Machine Learning terms) to describe the worthiness of a personal loan: "Good" or "Bad". There are predictors related to attributes, such as: checking account status, duration, credit history, purpose of the loan, amount of the loan, savings accounts or bonds, employment duration, installment rate in percentage of disposable income, personal information, other debtors/guarantors, residence duration, property, age, other installment plans, housing, number of existing credits, job information, number of people being liable to provide maintenance for, telephone, and foreign worker status.

Many of these predictors are discrete and have been expanded into several 0/1 indicator variables (a.k.a. they have been one-hot-encoded).

This dataset has been kindly provided by Professor Dr. Hans Hofmann of the University of Hamburg, and can also be found on the UCI Machine Learning Repository.

## Decision Trees

 As we have learned in the previous lectures, Decision Trees as a family of algorithms (irrespective to the particular implementation) are powerful algorithms that can produce models with a predictive accuracy higher than that produced by linear models, such as Linear or Logistic Regression. Primarily, this is due to the fact the DT's can model nonlinear relationships, and also have a number of tuning paramters, that allow for the practicioner to achieve the best possible model. An added bonus is the ability to visualize the trained Decision Tree model, which allows for some insight into how the model has produced the predictions that it has. One caveat here, to keep in mind, is that sometimes, due to the size of the dataset (both in the sense of the number of records, as well as the number of features), the visualization might prove to be very large and complex, increasing the difficulty of interpretation.

To give you a very good example of how Decision Trees can be visualized and interpreted, we would strongly recommend that, before continuing on with solving the problems in this Mini Project, you take the time to read this fanstastic, detailed and informative blog post: http://explained.ai/decision-tree-viz/index.html

## Building Your First Decision Tree Model

So, now it's time to jump straight into the heart of the matter. Your first task, is to build a Decision Tree model, using the aforementioned "German Credit" dataset, which contains 1,000 records, and 62 columns (one of them presents the labels, and the other 61 present the potential features for the model.)

For this task, you will be using the scikit-learn library, which comes already pre-installed with the Anaconda Python distribution. In case you're not using that, you can easily install it using pip.

Before embarking on creating your first model, we would strongly encourage you to read the short tutorial for Decision Trees in scikit-learn (http://scikit-learn.org/stable/modules/tree.html), and then dive a bit deeper into the documentation of the algorithm itself (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). 

Also, since you want to be able to present the results of your model, we suggest you take a look at the tutorial for accuracy metrics for classification models (http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report) as well as the more detailed documentation (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).

Finally, an *amazing* resource that explains the various classification model accuracy metrics, as well as the relationships between them, can be found on Wikipedia: https://en.wikipedia.org/wiki/Confusion_matrix

(Note: as you've already learned in the Logistic Regression mini project, a standard practice in Machine Learning for achieving the best possible result when training a model is to use hyperparameter tuning, through Grid Search and k-fold Cross Validation. We strongly encourage you to use it here as well, not just because it's standard practice, but also becuase it's not going to be computationally to intensive, due to the size of the dataset that you're working with. Our suggestion here is that you split the data into 70% training, and 30% testing. Then, do the hyperparameter tuning and Cross Validation on the training set, and afterwards to a final test on the testing set.)

### Now we pass the torch onto you! You can start building your first Decision Tree model! :)

In [108]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [109]:
# Your code here! :)
# data = pd.read_csv('GermanCredit.csv.zip', dtype = {'POSTAL_CODE': str})
data = pd.read_csv('GermanCredit.csv.zip')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 62 columns):
 #   Column                                  Non-Null Count  Dtype 
---  ------                                  --------------  ----- 
 0   Duration                                1000 non-null   int64 
 1   Amount                                  1000 non-null   int64 
 2   InstallmentRatePercentage               1000 non-null   int64 
 3   ResidenceDuration                       1000 non-null   int64 
 4   Age                                     1000 non-null   int64 
 5   NumberExistingCredits                   1000 non-null   int64 
 6   NumberPeopleMaintenance                 1000 non-null   int64 
 7   Telephone                               1000 non-null   int64 
 8   ForeignWorker                           1000 non-null   int64 
 9   Class                                   1000 non-null   object
 10  CheckingAccountStatus.lt.0              1000 non-null   int64 
 11  Check

In [110]:
data.describe()

Unnamed: 0,Duration,Amount,InstallmentRatePercentage,ResidenceDuration,Age,NumberExistingCredits,NumberPeopleMaintenance,Telephone,ForeignWorker,CheckingAccountStatus.lt.0,...,OtherInstallmentPlans.Bank,OtherInstallmentPlans.Stores,OtherInstallmentPlans.None,Housing.Rent,Housing.Own,Housing.ForFree,Job.UnemployedUnskilled,Job.UnskilledResident,Job.SkilledEmployee,Job.Management.SelfEmp.HighlyQualified
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20.903,3271.258,2.973,2.845,35.546,1.407,1.155,0.596,0.963,0.274,...,0.139,0.047,0.814,0.179,0.713,0.108,0.022,0.2,0.63,0.148
std,12.058814,2822.736876,1.118715,1.103718,11.375469,0.577654,0.362086,0.490943,0.188856,0.446232,...,0.34612,0.211745,0.389301,0.383544,0.452588,0.310536,0.146757,0.4002,0.483046,0.355278
min,4.0,250.0,1.0,1.0,19.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,12.0,1365.5,2.0,2.0,27.0,1.0,1.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,18.0,2319.5,3.0,3.0,33.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
75%,24.0,3972.25,4.0,4.0,42.0,2.0,1.0,1.0,1.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
max,72.0,18424.0,4.0,4.0,75.0,4.0,2.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [111]:
data.head(5)

Unnamed: 0,Duration,Amount,InstallmentRatePercentage,ResidenceDuration,Age,NumberExistingCredits,NumberPeopleMaintenance,Telephone,ForeignWorker,Class,...,OtherInstallmentPlans.Bank,OtherInstallmentPlans.Stores,OtherInstallmentPlans.None,Housing.Rent,Housing.Own,Housing.ForFree,Job.UnemployedUnskilled,Job.UnskilledResident,Job.SkilledEmployee,Job.Management.SelfEmp.HighlyQualified
0,6,1169,4,4,67,2,1,0,1,Good,...,0,0,1,0,1,0,0,0,1,0
1,48,5951,2,2,22,1,1,1,1,Bad,...,0,0,1,0,1,0,0,0,1,0
2,12,2096,2,3,49,1,2,1,1,Good,...,0,0,1,0,1,0,0,1,0,0
3,42,7882,2,4,45,1,2,1,1,Good,...,0,0,1,0,0,1,0,0,1,0
4,24,4870,3,4,53,2,2,1,1,Bad,...,0,0,1,0,0,1,0,0,1,0


In [112]:
import numpy as np
y = data[['Class']]
y['Label'] = np.where(data['Class']=='Good', True, False)
y = y.drop(['Class'], axis=1)

In [113]:
X = data.drop(['Class'], axis=1)

In [114]:
# Check label distribution
y.value_counts()

Label
True     700
False    300
dtype: int64

In [115]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 61 columns):
 #   Column                                  Non-Null Count  Dtype
---  ------                                  --------------  -----
 0   Duration                                1000 non-null   int64
 1   Amount                                  1000 non-null   int64
 2   InstallmentRatePercentage               1000 non-null   int64
 3   ResidenceDuration                       1000 non-null   int64
 4   Age                                     1000 non-null   int64
 5   NumberExistingCredits                   1000 non-null   int64
 6   NumberPeopleMaintenance                 1000 non-null   int64
 7   Telephone                               1000 non-null   int64
 8   ForeignWorker                           1000 non-null   int64
 9   CheckingAccountStatus.lt.0              1000 non-null   int64
 10  CheckingAccountStatus.0.to.200          1000 non-null   int64
 11  CheckingAccountSta

In [116]:
# Split training and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

In [117]:
print(X_train.shape)
print(y_train.shape)

(700, 61)
(700, 1)


In [118]:
from sklearn.tree import DecisionTreeClassifier

clfr = DecisionTreeClassifier(random_state=10)
clfr.fit(X_train, y_train)
y_pred = clfr.predict(X_test)

In [119]:
y_pred[:10]

array([ True,  True,  True,  True,  True, False,  True,  True, False,
       False])

In [120]:
from sklearn.metrics import precision_score, recall_score, roc_auc_score

print('train accuracy={}'.format(clfr.score(X_train, y_train)))
print('test accuracy={}'.format(clfr.score(X_test, y_test)))
# same as sklearn.metrics.accuracy_score(y_test, y_pred)
# print('precision_score={}'.format(precision_score(y_test, y_pred, pos_label="Good")))
# print('recall_score={}'.format(recall_score(y_test, y_pred, pos_label="Good")))
print('precision_score={}'.format(precision_score(y_test, y_pred)))
print('recall_score={}'.format(recall_score(y_test, y_pred)))
print('auc_score={}'.format(roc_auc_score(y_test, y_pred)))

train accuracy=1.0
test accuracy=0.71
precision_score=0.7735849056603774
recall_score=0.8078817733990148
auc_score=0.6565182062871362


In [121]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.56      0.51      0.53        97
        True       0.77      0.81      0.79       203

    accuracy                           0.71       300
   macro avg       0.67      0.66      0.66       300
weighted avg       0.70      0.71      0.71       300



In [122]:
# View confusion matrix for test data and predictions
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

array([[ 49,  48],
       [ 39, 164]], dtype=int64)

#### From confusion matrix

||Predict Negative|Predict Positive|
|-----|-------|-------|
|Actual Negative|TN=49|FP=48|
|Actual Postive |FN=39|TP=164|

In [123]:
# False Positive Rate = False Positives / (False Positives + True Negatives)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
fpr = fp / (fp + tn)
print('False Positive Rate=' + str(fpr))

False Positive Rate=0.4948453608247423


### K-fold cross-validation & confusion matrices¶

In [124]:
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

clfr = DecisionTreeClassifier(random_state=10)

y_train_array = np.ravel(y_train)
y_train_pred = cross_val_predict(clfr, X_train, y_train_array, cv=5)
# tn, fp, fn, tp = confusion_matrix(y_train_array, y_train_pred).ravel()
a = confusion_matrix(y_train_array, y_train_pred).ravel()
print(a)

[ 93 110 115 382]


#### From confusion matrix

||Predict Negative|Predict Positive|
|-----|-------|-------|
|Actual Negative|TN=93|FP=110|
|Actual Postive |FN=115|TP=382|

In [125]:
from sklearn.metrics import precision_score, recall_score

# print('train accuracy={}'.format(clfr.score(X_train, y_train)))
print('precision_score={}'.format(precision_score(y_train_array, y_train_pred)))
print('recall_score={}'.format(recall_score(y_train_array, y_train_pred)))

precision_score=0.7764227642276422
recall_score=0.7686116700201208


### Using GridSearchCV
Refer to https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680

In [126]:
from sklearn.metrics import roc_curve, precision_recall_curve, auc, make_scorer, accuracy_score
# import matplotlib.pyplot as plt
# plt.style.use("ggplot")


clf = DecisionTreeClassifier(random_state=10)

param_grid = {
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'], 
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [None, 20, 30],
    'min_samples_split': [2, 4, 6],
    'class_weight': [None, 'balanced']
}

scorers = {
    'precision_score': make_scorer(precision_score),
    'recall_score': make_scorer(recall_score),
    'accuracy_score': make_scorer(accuracy_score),
    'roc_auc_score': make_scorer(roc_auc_score)
}

In [127]:
# from sklearn.model_selection import StratifiedKFold

def grid_search_wrapper(refit_score='precision_score'):
    """
    fits a GridSearchCV classifier using refit_score for optimization
    prints classifier performance metrics
    """
    grid_search = GridSearchCV(clf, param_grid, scoring=scorers, refit=refit_score, cv=5, return_train_score=True)
    grid_search.fit(X_train.values, y_train.values)

    # make the predictions
    y_pred = grid_search.predict(X_test.values)

    print('Best params for {}'.format(refit_score))
    print(grid_search.best_params_)

    # confusion matrix on the test data.
    print('\nConfusion matrix of Decision Tree optimized for {} on the test data:'.format(refit_score))
#     print(pd.DataFrame(confusion_matrix(y_test, y_pred), columns=['pred_neg', 'pred_pos'], index=['neg', 'pos']))
    print(confusion_matrix(y_test, y_pred))
    return grid_search

In [128]:
grid_search_clf = grid_search_wrapper(refit_score='precision_score')

Best params for precision_score
{'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': None, 'max_features': 'log2', 'min_samples_split': 4, 'splitter': 'random'}

Confusion matrix of Decision Tree optimized for precision_score on the test data:
[[ 49  48]
 [ 65 138]]


#### From confusion matrix

||Predict Negative|Predict Positive|
|-----|-------|-------|
|Actual Negative|TN=49|FP=48|
|Actual Postive |FN=65|TP=138|

In [130]:
grid_search_clf = grid_search_wrapper(refit_score='recall_score')

Best params for recall_score
{'class_weight': None, 'criterion': 'entropy', 'max_depth': None, 'max_features': 'log2', 'min_samples_split': 2, 'splitter': 'random'}

Confusion matrix of Decision Tree optimized for recall_score on the test data:
[[ 41  56]
 [ 47 156]]


#### From confusion matrix

||Predict Negative|Predict Positive|
|-----|-------|-------|
|Actual Negative|TN=41|FP=56|
|Actual Postive |FN=47|TP=156|

In [132]:
grid_search_clf = grid_search_wrapper(refit_score='accuracy_score')

Best params for accuracy_score
{'class_weight': None, 'criterion': 'entropy', 'max_depth': None, 'max_features': 'auto', 'min_samples_split': 2, 'splitter': 'random'}

Confusion matrix of Decision Tree optimized for accuracy_score on the test data:
[[ 49  48]
 [ 53 150]]


#### From confusion matrix

||Predict Negative|Predict Positive|
|-----|-------|-------|
|Actual Negative|TN=49|FP=48|
|Actual Postive |FN=53|TP=150|

In [134]:
grid_search_clf = grid_search_wrapper(refit_score='roc_auc_score')

Best params for roc_auc_score
{'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'log2', 'min_samples_split': 6, 'splitter': 'best'}

Confusion matrix of Decision Tree optimized for roc_auc_score on the test data:
[[ 35  62]
 [ 50 153]]


#### From confusion matrix

||Predict Negative|Predict Positive|
|-----|-------|-------|
|Actual Negative|TN=35|FP=62|
|Actual Postive |FN=50|TP=153|

## Model Evaluation
From all the confusion matrices, we want to have the lowest False positive or the highest precision score to get the Good prediction.  The first model without using tuning parameter seemed to provide the best results, with the most True positive and the least False positive.


### After you've built the best model you can, now it's time to visualize it!

Rememeber that amazing blog post from a few paragraphs ago, that demonstrated how to visualize and interpret the results of your Decision Tree model. We've seen that this can perform very well, but let's see how it does on the "German Credit" dataset that we're working on, due to it being a bit larger than the one used by the blog authors.

First, we're going to need to install their package. If you're using Anaconda, this can be done easily by running:

In [None]:
! pip install dtreeviz

If for any reason this way of installing doesn't work for you straight out of the box, please refer to the more detailed documentation here: https://github.com/parrt/dtreeviz

Now you're ready to visualize your Decision Tree model! Please feel free to use the blog post for guidance and inspiration!

In [None]:
# Your code here! :)

## Random Forests

As discussed in the lecture videos, Decision Tree algorithms also have certain undesireable properties. Mainly the have low bias, which is good, but tend to have high variance - which is *not* so good (more about this problem here: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).

Noticing these problems, the late Professor Leo Breiman, in 2001, developed the Random Forests algorithm, which mitigates these problems, while at the same time providing even higher predictive accuracy than the majority of Decision Tree algorithm implementations. While the curriculum contains two excellent lectures on Random Forests, if you're interested, you can dive into the original paper here: https://link.springer.com/content/pdf/10.1023%2FA%3A1010933404324.pdf.

In the next part of this assignment, your are going to use the same "German Credit" dataset to train, tune, and measure the performance of a Random Forests model. You will also see certain functionalities that this model, even though it's a bit of a "black box", provides for some degree of interpretability.

First, let's build a Random Forests model, using the same best practices that you've used for your Decision Trees model. You can reuse the things you've already imported there, so no need to do any re-imports, new train/test splits, or loading up the data again.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Your code here! :)

As mentioned, there are certain ways to "peek" into a model created by the Random Forests algorithm. The first, and most popular one, is the Feature Importance calculation functionality. This allows the ML practitioner to see an ordering of the importance of the features that have contributed the most to the predictive accuracy of the model. 

You can see how to use this in the scikit-learn documentation (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.feature_importances_). Now, if you tried this, you would just get an ordered table of not directly interpretable numeric values. Thus, it's much more useful to show the feature importance in a visual way. You can see an example of how that's done here: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-py

Now you try! Let's visualize the importance of features from your Random Forests model!

In [None]:
# Your code here

A final method for gaining some insight into the inner working of your Random Forests models is a so-called Partial Dependence Plot. The Partial Dependence Plot (PDP or PD plot) shows the marginal effect of a feature on the predicted outcome of a previously fit model. The prediction function is fixed at a few values of the chosen features and averaged over the other features. A partial dependence plot can show if the relationship between the target and a feature is linear, monotonic or more complex. 

In scikit-learn, PDPs are implemented and available for certain algorithms, but at this point (version 0.20.0) they are not yet implemented for Random Forests. Thankfully, there is an add-on package called **PDPbox** (https://pdpbox.readthedocs.io/en/latest/) which adds this functionality to Random Forests. The package is easy to install through pip.

In [None]:
! pip install pdpbox

While we encourage you to read the documentation for the package (and reading package documentation in general is a good habit to develop), the authors of the package have also written an excellent blog post on how to use it, showing examples on different algorithms from scikit-learn (the Random Forests example is towards the end of the blog post): https://briangriner.github.io/Partial_Dependence_Plots_presentation-BrianGriner-PrincetonPublicLibrary-4.14.18-updated-4.22.18.html

So, armed with this new knowledge, feel free to pick a few features, and make a couple of Partial Dependence Plots of your own!

In [None]:
# Your code here!

## (Optional) Advanced Boosting-Based Algorithms

As explained in the video lectures, the next generation of algorithms after Random Forests (that use Bagging, a.k.a. Bootstrap Aggregation) were developed using Boosting, and the first one of these were Gradient Boosted Machines, which are implemented in scikit-learn (http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting).

Still, in recent years, a number of variations on GBMs have been developed by different research amd industry groups, all of them bringing improvements, both in speed, accuracy and functionality to the original Gradient Boosting algorithms.

In no order of preference, these are:
1. **XGBoost**: https://xgboost.readthedocs.io/en/latest/
2. **CatBoost**: https://tech.yandex.com/catboost/
3. **LightGBM**: https://lightgbm.readthedocs.io/en/latest/

If you're using the Anaconda distribution, these are all very easy to install:

In [None]:
! conda install -c anaconda py-xgboost

In [None]:
! conda install -c conda-forge catboost

In [None]:
! conda install -c conda-forge lightgbm

Your task in this optional section of the mini project is to read the documentation of these three libraries, and apply all of them to the "German Credit" dataset, just like you did in the case of Decision Trees and Random Forests.

The final deliverable of this section should be a table (can be a pandas DataFrame) which shows the accuracy of all the five algorthms taught in this mini project in one place.

Happy modeling! :)