<img src="../src/img/h2o_banner2.png">

## License 

<span style="color:gray"> Copyright 2019 David Whiting and the H2O.ai team

<span style="color:gray"> Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

<span style="color:gray">     http://www.apache.org/licenses/LICENSE-2.0

<span style="color:gray"> Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

<span style="color:gray"> **DISCLAIMER:** This notebook is not legal compliance advice. </span>

<hr style="background-color: gray;height: 2.0px;"/>

# H2O.ai Lesson 2: Gradient Boosting Models 

This is the second in a series of instructional Jupyter notebooks on Sparkling Water. These notebooks are built to be run on the H2O.ai Aquarium training platform [https://aquarium.h2o.ai](https://aquarium.h2o.ai). 

_**Experienced modelers should find the content in these notebooks sufficient to be up-and-running in H2O Sparkling Water immediately**_. 

----

<div style="margin-left: 3em;">

### Intended Audience

The target audience for this training notebook is data scientists, machine learning engineers, and other experienced modelers. Technically advanced analysts may also find this training understandable. A working knowledge of Python and previous experience building statistical or machine learning models is assumed.

### Prerequisites

Successful completion of 
    
<ul style="list-style: none;">
    <li><input type="checkbox"><span style="color:blue">
        H2O.ai Lesson 1: Introduction to H2O Sparkling Water
        </span></li>
</ul> 

### Learning Outcomes

By the end of this notebook, you will be able to ...
<ul style="list-style: none;">
    <li><input type="checkbox" disabled ><span style="color:black">
    Explain at a high level how Gradient Boosting Models work, and how the GBM and XGBoost algorithms differ
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Build gradient boosting predictive models with H2O GBM and H2O XGBoost algorithms
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Use H2O Flow to investigate model builds and performance 
    </span></li>
</ul>

</div>

<hr style="background-color: black;height: 2.0px;"/>

# 1. Gradient Boosting Models

## 1.1. Introduction

These lessons assume some familiarity with statistical and machine learning models such as logistic regression, decision trees, random forests, gradient boosting models, etc. In this section, we will give a high-level overview of decision trees and gradient boosting models, concentrating on two specific implementations: H2O GBM and H2O XGBoost.

## 1.2. Decision Trees

At the heart of each GBM implementation is the concept of a **decision tree**.

A decision tree can be used for either

- _classification:_ assign observations to discrete groups 
- _regression:_ assign observations a predicted continuous outcome 

Observation assignment is made through _conditional control statements_ that form a tree-like structure. 

<table>
<tr>
<td style="width: 550px; text-align: left; vertical-align: top;">
    <h2>How does it work?</h2>
    <ul><li style="margin: 6px 0;">
        Search through all candidate predictors. Identify the variable <strong>split</strong> that yields the greatest predictive power. 
    </li><li style="margin: 6px 0;">
    For each created branch, follow the same process again.
    </li><li style="margin: 6px 0;">
        Repeat until <strong>stopping criteria</strong> are reached.
    </li></ul>
    <h2>Examples of splitting functions</h2>:
    <ul><li style="margin: 6px 0;">
    Gini function
    </li><li style="margin: 6px 0;">
    Information entropy
    </li></ul>
    <h2>Examples of stopping criteria</h2>:
    <ul><li style="margin: 6px 0;">
    Minimum number of observations needed at each node after splitting
    </li><li style="margin: 6px 0;">
    Entropy not reduced more than some cutoff
    </li><li style="margin: 6px 0;">
    Maximum layers of tree (i.e., depth)
    </li></ul> 
</td>
<td style="width: 500px; text-align: left;">
    <img src="../src/img/titanic_decision_tree.png" style="height:400px"></td>
</tr>
</table>

<h2>Decision tree strengths and weaknesses</h2>

<table align=left>
<tr>
<td style="width: 300px; text-align: left; vertical-align: top;">
    <h2>Strengths</h2>
    <ul><li style="margin: 6px 0;">
    Simple to understand, easy to interpret
    </li><li style="margin: 6px 0;">
    Robust to
    </li><ul><li style="margin: 6px 0;">
    nonlinear relationships
    </li><li style="margin: 6px 0;">
    correlated features
    </li><li style="margin: 6px 0;">
    feature distributions
    </li><li style="margin: 6px 0;">
    missing values
        </li></ul><li style="margin: 6px 0;">
    Fast to train
    </li><li style="margin: 6px 0;">
    Fast to score
    </li></ul> 
</td>
<td style="width: 300px; text-align: left; vertical-align: top;">
    <h2>Weaknesses</h2>
    <ul><li style="margin: 6px 0;">
    High variance (can easily overfit)
    </li><li style="margin: 6px 0;">
    Poor predictive accuracy
    </li><li style="margin: 6px 0;">
    Inefficient for linear relationships
    </li></ul> 
</td>
</tr>
</table>

## 1.3. Boosting

Boosting is an ensemble method that combines models sequentially, where each model is built on the residuals of a previous model

### Boosted trees

- Start by building a relatively shallow decision tree
- Sequentially build the next tree based on residuals from the previous tree
- The objective is to take an initial weak learner and gradually turn it into a strong learner by concentrating on where the model is not predicting well

### Be careful:

- In theory, GBM will continue to add to the number of trees to fit all the noise

## Boosting implementations in H2O

- For all random forest and boosting implementations, tree building is parallelized in H2O:
  - Each tree is built in parallel.
  - Categoricals can be split into groups (instead of just using Boolean splits).
  - Shared histograms calculate cut-points.
  - Greedy search of histogram bins, optimizing squared error.

<img src="../src/img/GBM_in_H2O.png" style="height:400px">

### XGBoost

XGBoost is very similar to GBM with the following modification:

- XGBoost employs a penalty term for the number of variables.
- That is, it contains regularization terms in the cost function. 
- Hence, trees are grown in **breadth** instead of **depth**.

### LightGBM

LightGBM builds trees as deep as necessary by repeatedly splitting the one leaf that gives the biggest gain

- Trees are grown in **depth** instead of **breadth**
- In theory, LightGBM is optimized for sparse data

(H2O does not implement LightGBM directly, but instead provides a method for emulating the approach using a certain set of options within XGBoost.)

## Boosting hyperparameters

- All boosting methods require the following hyperparameters:
  - Number of trees to be built
  - Shrinkage parameter (specifies rate at which model learns)
  - Depth of the boosting tree

- Note that simply adding trees in boosting approaches (without further restrictions) can lead to overfitting

- A grid search can aid the process of hyperparameter selection

<div class="alert alert-block alert-warning"><span style="color:black">

## Completed Learning Outcomes

<ul style="list-style: none;">
    <li><input type="checkbox" disabled checked>
    Explain at a high level how Gradient Boosting Models work, and how the GBM and XGBoost algorithms differ
    </li>
</ul>
</span>
</div>

# 2. Preliminaries

In this section, we set up the Sparkling Water environment and perform all of the preliminary data tasks that we developed in H2O Lesson 1.

### Environment preparation

In Lesson 1, we used the `spark` command to initiate a `SparkSession`. Because of the way we have configured the notebook `PySparkling` kernel, a `SparkSession` is actually created during startup of the Jupyter notebook. In other words, the `spark` command is not needed on this platform. 

In [None]:
from pysparkling import *
hc = H2OContext.getOrCreate()

<div class="alert alert-block alert-info"><span style="color:black">
Remember that how you start SparkSession and Sparkling Water will depend on your specific installation setup.
</span></div>

### Load the data


In [None]:
import h2o
import os
input_csv = "/home/h2o/data/lending_club/LoanStats3a.csv"

if not os.path.exists(input_csv):
    input_csv = "https://s3-us-west-2.amazonaws.com/h2o-tutorials/data/topics/lending/lending_club/LoanStats3a.csv"

loans = h2o.import_file(input_csv,
                        col_types = {"int_rate":"string", 
                                     "revol_util":"string", 
                                     "emp_length":"string", 
                                     "verification_status":"string"})

### Data munging and feature engineering


<div class="alert alert-block alert-info"><span style="color:black">
Please reference Lesson 1 for details on how and why we performed these steps.
</span></div>

In [None]:
ongoing_status = ["Current",
                  "In Grace Period",
                  "Late (16-30 days)",
                  "Late (31-120 days)",
                  "Does not meet the credit policy.  Status:Current",
                  "Does not meet the credit policy.  Status:In Grace Period"]
loans = loans[~loans["loan_status"].isin(ongoing_status)]

response = "bad_loan"
fully_paid = ["Fully Paid",
              "Does not meet the credit policy.  Status:Fully Paid"]
loans[response] = ~(loans["loan_status"].isin(fully_paid))
loans[response] = loans[response].asfactor()

loans["int_rate"] = loans["int_rate"].gsub(pattern = "%", replacement = "") # strip %
loans["int_rate"] = loans["int_rate"].trim() # trim whitespace
loans["int_rate"] = loans["int_rate"].asnumeric() # change to numeric 

loans["revol_util"] = loans["revol_util"].gsub(pattern="%", replacement="")
loans["revol_util"] = loans["revol_util"].trim()
loans["revol_util"] = loans["revol_util"].asnumeric()

loans["emp_length"] = loans["emp_length"].gsub(pattern="([ ]*+[a-zA-Z].*)|(n/a)", replacement="") 
loans["emp_length"] = loans["emp_length"].trim()
loans["emp_length"] = loans["emp_length"].gsub(pattern="< 1", replacement="0") # convert "< 1" to 0
loans["emp_length"] = loans["emp_length"].gsub(pattern="10\\+", replacement="10") # convert "10+" to 10
loans["emp_length"] = loans["emp_length"].asnumeric()

loans["verification_status"] = loans["verification_status"].sub(pattern="VERIFIED - income source", 
                                                                replacement="verified")
loans["verification_status"] = loans["verification_status"].sub(pattern="VERIFIED - income", 
                                                                replacement="verified")
loans["verification_status"] = loans["verification_status"].asfactor()

loans["credit_length"] = loans["issue_d"].year() - loans["earliest_cr_line"].year()
loans["issue_d_year"] = loans["issue_d"].year()
loans["issue_d_month"] = loans["issue_d"].month().asfactor()

### Create the predictor list used in modeling


Note that we have not excluded `int_rate` from the list of predictors, as you did in the Lesson 1 assignment.

In [None]:
cols_to_remove = ["initial_list_status",
                  "out_prncp",
                  "out_prncp_inv",
                  "total_pymnt",
                  "total_pymnt_inv",
                  "total_rec_prncp", 
                  "total_rec_int",
                  "total_rec_late_fee",
                  "recoveries",
                  "issue_d",
                  "collection_recovery_fee",
                  "last_pymnt_d", 
                  "last_pymnt_amnt",
                  "next_pymnt_d",
                  "last_credit_pull_d",
                  "collections_12_mths_ex_med" , 
                  "mths_since_last_major_derog",
                  "policy_code",
                  "loan_status",
                  "funded_amnt",
                  "funded_amnt_inv",
                  "mths_since_last_delinq",
                  "mths_since_last_record",
                  "id",
                  "member_id",
                  "desc",
                  "zip_code"]

predictors = list(set(loans.col_names) - set(cols_to_remove))

<div class="alert alert-block alert-success"><span style="color:black">
    <strong>Recall</strong>: Explain why we might want to include or exclude interest rate in a model. What are the different possible use cases?
</span></div>

<div class="alert alert-block alert-success"><span style="color:black">

If our loan underwriting is any good at all, then interest rate should be one of the most important predictors of default risk, since the interest rate a customer is offered is based in large part on risk estimation. (Customer demand may also come into play, especially if we are optimizing price.) Therefore ...

**Exclude interest rate**: when we are building a risk model and/or determining interest rate from the risk profile;

**Include interest rate**: when we are monitoring the effect of our pricing model, when we want to use it as an offset and see what other variables are important or might have been missed in building our risk model, etc.

</span></div>

# 3. Baseline Model Building: GBM and XGBoost

## 3.1. Splitting data

Splitting data into a training, validation, and testing set is the accepted standard for model building when your data size is sufficiently large. Alternatively, we can split data into 80% training and 20% test sets and use k-fold cross-validation on the training data. This is computationally more expensive but allows the model to see more data in training. 

<div class="alert alert-block alert-info"><span style="color:black">
The definition of "sufficiently large" is data and problem specific. We will demonstrate both approaches.  
    </span></div>

### Traditional train, validate, and test set splits

We split the data into three parts: 60% for training, 20% for validation, and 20% for final testing. 

In [None]:
train, valid, test = loans.split_frame(seed = 12345,
                                       ratios = [0.6, 0.2],
                                       destination_frames = ['train.hex', 'valid.hex', 'test.hex'])

### Train and test split for cross-validation

We can also split the data into two parts: 80% for training and 20% for final testing. 

In [None]:
train_cv, test_cv = loans.split_frame(seed = 12345,
                                      ratios = [0.8],
                                      destination_frames = ['train_cv.hex', 'test_cv.hex'])

## 3.2. Baseline GBM train-validate-test model

The first model we fit is a default GBM, trained on the 60% training split with default settings:

In [None]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

gbm = H2OGradientBoostingEstimator(seed = 12345)
gbm.train(x = predictors,
          y = response,
          training_frame = train,
          validation_frame = valid,
          model_id = "gbm_baseline")

The plot below shows the performance of the model as more trees are built.  This graph can help us see at what point our model begins overfitting.  

In [None]:
%matplotlib inline
gbm.plot()

Our data error rate stops improving at around 10-15 trees.

We can get a detailed model summary using

In [None]:
print(gbm)

### GBM model performance

Let's visualize the performance of the default GBM model across all splits. 

In [None]:
print("Train")
gbm.model_performance(train).plot()

print("Validation")
gbm.model_performance(valid).plot()

print("Test")
gbm.model_performance(test).plot()

To get the AUC on the validation data split, we enter

In [None]:
print(gbm.model_performance(valid).auc())

These results confirm what we saw in the scoring history plot: the GBM model is significantly overfit on the training set. (This is not unexpected.)

### GBM model interpretation

The variable importance plot shows us which variables are most important to predicting `bad_loan`.  We can use partial dependence plots to learn more about how these variables affect the prediction.

In [None]:
gbm.varimp_plot(20)

The partial dependence plot of the `int_rate` predictor shows us that as the interest rate increases, the likelihood of the loan defaulting also increases.

In [None]:
pdp = gbm.partial_plot(cols=["int_rate"], data=train)

* One-dimensional partial dependence plots (PDPs) show us the average behavior of a complex response function with respect to a single input
* They allow us to compare this average behavior to domain knowledge and expected behavior
* **Note: _The average behavior of PDPs can be misleading in the presence of strong interactions or for highly nonlinear response functions_**

## 3.3. Baseline XGBoost cross-validated model

Build a baseline XGBoost model using 5-fold cross-validation and the train/test data split:

In [None]:
from h2o.estimators import H2OXGBoostEstimator

xgb_cv = H2OXGBoostEstimator(nfolds = 5, seed = 12345)
xgb_cv.train(x = predictors,
             y = response,
             training_frame = train_cv,
             validation_frame = test_cv,
             model_id = "xgb_cv_baseline"
             )

Get the scoring history using

In [None]:
xgb_cv.plot()

Our error rate stops improving at around 10 trees.

We can get a detailed model summary using

In [None]:
print(xgb_cv)

### XGBoost model performance

The model performance metrics are given in a similar manner below. Note the slight difference in syntax we use here for cross-validated models.

In [None]:
print("Train")
xgb_cv.model_performance(train = True).plot()

print("Cross-Validation")
xgb_cv.model_performance(xval = True).plot()

print("Test")
xgb_cv.model_performance(valid = True).plot()

To get the AUC on the cross-validated data, we enter

In [None]:
print(xgb_cv.model_performance(xval = True).auc())

Like the GBM model, the results above indicate that the baseline XGBoost model is significantly overfit.

### XGBoost model interpretation

The variable importance plot shows us which variables are most important to predicting `bad_loan`.  We can use partial dependence plots to learn more about how these variables affect the prediction.

In [None]:
xgb_cv.varimp_plot(20)

The partial dependence plot of the `int_rate` predictor shows us that as the interest rate increases, the likelihood of the loan defaulting also increases.

In [None]:
pdp = xgb_cv.partial_plot(cols=["int_rate"], data=train)

<div class="alert alert-block alert-info"><span style="color:black">
    
A comparison of the variable importance plots for GBM and XGBoost demonstrates some of the differences in these two algorithms. For instance, the H2O GBM implementation can handle high-cardinality categorical variables (e.g., `addr_state`) directly, while XGBoost opts for one-hot encoding. A choice between algorithms often comes down to performance.
    
In either case, we could opt to build models after target-encoding high-cardinality categorical variables. We save such feature engineering approaches for a future lesson.    
</span></div>

<div class="alert alert-block alert-warning"><span style="color:black">

## Completed Learning Outcomes

<ul style="list-style: none;">
    <li><input type="checkbox" disabled checked><span style="color:gray">
    Explain at a high level how Gradient Boosting Models work, and how the GBM and XGBoost algorithms differ
        </span><li><input type="checkbox" disabled checked>
        Build gradient boosting predictive models with H2O GBM and H2O XGBoost algorithms
    </li>
</ul>
</span>
</div>

## 3.4. Using H2O Flow for model evaluation

Use the `Models` directory to list all models, or input directly using the `getModels` command. Your results should look something like

<img src="../src/img/flow_get_models_new.png" style="height:400px">

Note that this contains the GBM baseline model, the XGBoost baseline model, and the five XGBoost folds from our cross-validation. Clicking on a model name brings up

<img src="../src/img/flow_logloss_new.png" style="height:600px">

and other performance information such as variable importance

<img src="../src/img/flow_importance_new.png" style="height:600px">

Make sure you can find at very least

- Scoring history plots
- AUC metrics and ROC plots
- Variable importances
- Confusion matrices
- Model parameters

<div class="alert alert-block alert-info"><span style="color:black">
H2O Flow is a very convenient tool for interactive model investigation.
</span></div>

<div class="alert alert-block alert-success"><span style="color:black">
    
### YOUR TURN: Investigate the cross-validated XGBoost model using H2O Flow.

Find

- Scoring history plots
- AUC metrics and ROC plots
- Variable importances
- Confusion matrices
- XGBoost parameters

</span></div>

<div class="alert alert-block alert-warning"><span style="color:black">

## Completed Learning Outcomes

<ul style="list-style: none;">
    <li><input type="checkbox" disabled checked><span style="color:gray">
    Explain at a high level how Gradient Boosting Models work, and how the GBM and XGBoost algorithms differ
    </span><li><input type="checkbox" disabled checked><span style="color:gray">
    Build gradient boosting predictive models with H2O GBM and H2O XGBoost algorithms
    </span><li><input type="checkbox" disabled checked>    
    Use H2O Flow to investigate model builds and performance 
    </li>
</ul>
</span>
</div>

# 4. Assignment

<div class="alert alert-block alert-success"><span style="color:black">
    
## Part 1. Build and evaluate the following models:

###  1.1. Baseline GBM using 5-fold cross-validation 
    
- Name the model "gbm_cv"
- Add as many cells below as needed to complete.

    </span></div>

In [None]:
# Baseline GBM cross-validated model

<div class="alert alert-block alert-success"><span style="color:black">
    
### 1.2. Baseline XGBoost model using the 60% train, 20% validate, 20% test data
    
- Name the model "xgb"
- Add as many cells below as needed to complete.
    
</span></div>

In [None]:
# Baseline XGBoost train-validate-test model

<div class="alert alert-block alert-success"><span style="color:black">
    
## Part 2. Compare baseline cross-validated models

Compare the **performance** of the baseline cross-validated XGBoost model with the baseline cross-validated GBM model.

- Insert as many cells below as needed to complete.
</span></div>

In [None]:
# GBM vs. XGBoost cross-validated model

<div class="alert alert-block alert-success"><span style="color:black">
    
## Part 3. Use H2O Flow to compare variable importance
    
- What are the top 5 variables in the "gbm" model?
- What are the top 5 variables in the "xgb" model?

</span></div>

# 5. Shut down the Sparkling Water Cluster

In [None]:
h2o.cluster().shutdown()

Once your work is completed, shutting down the H2O cluster frees up the resources reserved by H2O.

<h1>CONGRATULATIONS! You have completed Lesson 2.</h1>