# POLI 175 - Lecture 14

## Tree-based models

# Tree-based methods II

## Tree-based methods

In [1]:
# Decision Tree Packages Install (take some time)
using Pkg
Pkg.add("MLJDecisionTreeInterface")
Pkg.add("DecisionTree")
Pkg.add("GraphViz")
Pkg.add("EvoTrees")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.9/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.9/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.9/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.9/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.9/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.9/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.9/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.9/Manifest.toml`


## Tree-based methods

Tree-based methods consist of segmenting the predictors' space into many regions, then use these regions to predict the target variable.
- We use a heuristic prediction here, such as the variable's mean in the region for *regression*.
- Or the most frequent observation in the region for *classification*.
    
This approach is called the `decision tree method`.

By itself it is terrible. But we will discuss many methods that improve efficiency considerably.

## Class Examples

In [2]:
## Packages Here
using DataFrames
using MLJ, MLJIteration
import MLJLinearModels, MLJBase, MLJModels
import MultivariateStats, MLJMultivariateStatsInterface
import CSV, Plots, GLM, StatsBase, Random
import LaTeXStrings, StatsPlots, Lowess, Gadfly, RegressionTables
import CovarianceMatrices, Econometrics, LinearAlgebra, MixedModelsExtras
import Missings, StatsAPI, FreqTables, EvalMetrics
import NearestNeighborModels

# Decision tree stuff
import MLJDecisionTreeInterface, DecisionTree, GraphViz, EvoTrees

# Adapted from @xiaodaigh: https://github.com/xiaodaigh/DataConvenience.jl
function onehot!(df::AbstractDataFrame, 
        col, cate = sort(unique(df[!, col])); 
        outnames = Symbol.(col, :_, cate))
    transform!(df, @. col => ByRow(isequal(cate)) .=> outnames)
end

onehot! (generic function with 2 methods)

## Class Examples

In [3]:
## Loading the data
chile = CSV.read(
    download("https://raw.githubusercontent.com/umbertomig/POLI175julia/main/data/chilesurvey.csv"), 
    DataFrame,
    missingstring = ["NA"]
); dropmissing!(chile)
chile.voteyes = ifelse.(chile.vote .== "Y", 1, 0)

# One-hot encoding (we will learn a better way to do it later)
onehot!(chile, :region);
onehot!(chile, :education);
onehot!(chile, :sex);

# Drop reference categories
select!(chile, Not(:region, :income, :population, :sex, :education, :region_C, :education_P, :sex_M))

# Checking
first(chile, 3)

Row,age,statusquo,vote,voteyes,region_M,region_N,region_S,region_SA,education_PS,education_S,sex_F
Unnamed: 0_level_1,Int64,Float64,String1,Int64,Bool,Bool,Bool,Bool,Bool,Bool,Bool
1,65,1.0082,Y,1,False,True,False,False,False,False,False
2,29,-1.29617,N,0,False,True,False,False,True,False,False
3,38,1.23072,Y,1,False,True,False,False,False,False,True


## Class Examples

In [4]:
## Education Expenditure Dataset
educ = CSV.read(download("https://raw.githubusercontent.com/umbertomig/POLI175julia/main/data/educexp.csv"), DataFrame)

# Processing
educ.educ_log = log.(educ.education);
educ.income_log = log.(educ.income)
educ.urban_log = log.(educ.urban)
educ.young_log = log.(educ.young)

# Checking
first(educ, 3)

Row,education,income,young,urban,states,educ_log,income_log,urban_log,young_log
Unnamed: 0_level_1,Int64,Int64,Float64,Int64,String3,Float64,Float64,Float64,Float64
1,189,2824,350.7,508,ME,5.24175,7.94591,6.23048,5.85993
2,169,3259,345.9,564,NH,5.1299,8.08918,6.33505,5.84615
3,230,3072,348.5,322,VT,5.43808,8.03008,5.77455,5.85364


## Decision Tree Regression in MLJ

In [5]:
y, X = unpack(
    educ[:, ["education", "income", "young", "urban"]], 
    ==(:education);              ## Target (all else features...)
    :education => Continuous,    ## Var types
    :income    => Continuous,
    :young     => Continuous,
    :urban     => Continuous
);

## Decision Tree Regression in MLJ

Train-test split like a boss: Use the indexes to do the split.

```julia
train, test = partition(
    eachindex(y),   ## Index with the eachindex(.) method
    0.7,            ## Proportion in the training set
    shuffle = true, ## Shuffle the data,
    rng = 12345     ## Random seed (ensure same results; not necessary)
) 
```

In [6]:
train, test = partition(
    eachindex(y),   ## Index with the eachindex(.) method
    0.7,            ## Proportion in the training set
    shuffle = true, ## Shuffle the data,
    rng = 12345     ## Random seed (ensure same results; not necessary)
);

## Decision Tree Regression in MLJ

In [7]:
# Instantiate the regressor
dtreg = MLJDecisionTreeInterface.DecisionTreeRegressor(max_depth = 5);

# Create the machine
mach = machine(dtreg, X, y);

# Fit the machine for the training set (note the rows parameter)
fit!(mach, rows = train);

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mTraining machine(DecisionTreeRegressor(max_depth = 5, …), …).


## Decision Tree Regression in MLJ

In [8]:
# Compute the predicted values
yhat = predict(mach, X[test, :]);

# Compute the MLJ default root mean squared error
rmse = rms(y[test], yhat);

# Print (note the $(var ^ 2)...it prints the square of the variable.)
println("Mean Squared Error: $(rmse ^ 2)")

# Print (note the $var...it prints the variable.)
println("\nRoot-Mean Squared Error (residual SE): $rmse")

Mean Squared Error: 1191.2627962962965

Root-Mean Squared Error (residual SE): 34.51467508606008


## Decision Tree Regression in MLJ

Let us check our decision tree:

In [9]:
tree_model = fitted_params(mach).raw_tree
DecisionTree.print_tree(tree_model)

Feature 1 < 2911.0 ?
├─ Feature 1 < 2606.0 ?
    ├─ 132.4 : 0/5
    └─ 161.75 : 0/8
└─ Feature 1 < 3742.0 ?
    ├─ Feature 3 < 739.5 ?
        ├─ 213.0 : 0/9
        └─ 188.16666666666666 : 0/6
    └─ 245.0 : 0/8


## Decision Tree Regression in MLJ

**Your Turn**: Fit a Decision Tree Classifier using MLJ to predict the vote for pinochet.

In [11]:
## Answers here

## Ensemble Learning

## Introduction to Ensemble Learning

### Weak Learners

**Definition**: A model that is only slightly better than random guessing.
- Usually a very simple model (e.g., a classification tree!).
- Very low accuracy.

Why would we want to ever use something like that for prediction?
- *Computational efficiency*: Imagine how easy it may be to fit a one-leaf classification tree.
- *Lower chance of overfitting*: They will perform poorly, and this guarantees low chance of overfitting.

## Introduction to Ensemble Learning

### Weak Learners

Combining multiple weak learners can lead to a powerful prediction, as opposed to the individual weak learners which are not very useful.

The keywork here is **combine**. How does that work?

## Introduction to Ensemble Learning

We do that by using **ensembles**.

**Definition**: Ensemble Learning is a technique where multiple "weak learners" are trained and combined to solve a specific problem.

The goal is to improve the overall accuracy, robustness, and performance of the prediction.

## Detour: Bootstrap

- To understand some ensemble techniques, we must first learn what a bootstrap is.

- [**Bootstrap**](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)): Technique to fit models empirically, without deriving theoretically the parameters of interest.
    + We use it a lot to find standard errors and run things like [*exact tests*](https://en.wikipedia.org/wiki/Exact_test) and [randomization inference](https://dimewiki.worldbank.org/Randomization_Inference).

- Very empirical!

## Detour: Bootstrap

**Algorithm:** Start with the number of repetitions, N. For each $1, 2, \cdots, N$ step:

1. Draw a sample of the dataset [**with replacement**](https://en.wikipedia.org/wiki/Resampling_(statistics)) that has the same size of the dataset.

2. Fit the model (e.g., a regression) in the randomly drawn dataset.

3. Save the coefficient of interest.

In the end, take the mean of the coefficient as the `bootstrapped` coefficient and the standard deviation as the standard error.

## Bagging

- Bagging stands for Bootstrap Aggregation.

- The idea is to fit each tree on a bootstrapped dataset, then take the average of all trees.

- Each tree performs poorly. However, the average performance of all of them is better!

$$ \hat{f}_{bag}(x) \ = \ \dfrac{1}{B}\sum_{b = 1}^B \hat{f}^b(x) $$

## Bagging

- We let trees grow wildly: no pruning!
    + Averaging them out is what reduces the variance!
    
- For continuous variables, we take averages as the predicted value.

- How about classification problems?
    + Majority vote for all trees!

## Bagging

![bag1](https://github.com/umbertomig/POLI175julia/blob/c9b0555e3e97778495bee72746aee43ddf3226d7/img/bag1.png?raw=true)

## Bagging

- This straightforward technique decreases the variance of a tree significantly.

- But how to interpret what the average of the trees means?
    - Well, we lose in terms of interpretation...

- One positive thing is that we can still find the **importance of each variable for the *bagging***

## Bagging

![bag2](https://github.com/umbertomig/POLI175julia/blob/c9b0555e3e97778495bee72746aee43ddf3226d7/img/bag2.png?raw=true)

## Bagging

- Cross-validation here can be improved by using something called **out-of-bag** errors:
    + The bootstrap process usually leaves out 1/3 of the sample.
    + We can take advantage of this left-out (or *out-of-bag* sample), to estimate our models.

- And we fit RMSE for continuous or accuracy for discrete.

## Random Forests

- It is **not** a place where data scientists go camping.

- Random forests intend to improve the effectiveness of our bagging estimates.

- Each bagging tree in the ensemble can be highly correlated with each other.
    - This messes up the prediction because it reduces the contribution of each tree.

- To fix that, we tweak the bagging to *decorrelate* the trees.

## Random Forests

- A simple way to do that is only to consider a subset of the predictors at each tree.
    - Why would we even want to do that?
    
- Let a strong predictor with a bunch of other weak ones. Then:
    1. All bagging trees will rely on the stronger predictor more than the others.
    2. Subsetting the number of variables, considering subsets where the strong predictor is not there, improves the usage of the weak predictors.
        - This *decorrelates* the trees!

- Rule of thumb: Use $m = \sqrt{p}$ predictors at a time.

## Random Forests

- Choose a small(er) $m$ if the predictors are all highly correlated.

![img](https://github.com/umbertomig/POLI175julia/blob/c9b0555e3e97778495bee72746aee43ddf3226d7/img/rf1.png?raw=true)

## Boosting

- Ensemble method that combines weak learners to form a stronger one.
    + Example: Regression tree that is only allowed to have one leaf!

- It builds on accumulation: Every predictor tries to improve the predecessor's job.
    - Work with the errors of the previous models, and update the fit slowly.

## Boosting

**Algorithm:** Start with a null model ($\hat{f}(x) = 0$), the residual equals to $r_i = y_i$, and a number $B$ of steps.

For each $b \in \{1, 2, \cdots, B\}$:

1. Fit a tree $\hat{f}^b(x)$ with $d$ splits (or d+1 terminal nodes).

2. Set:

$$ \hat{f}_{new}(x) = \hat{f}_{old}(x) + \lambda \hat{f}^b(x) $$

3. Set 

$$ r_{i_{new}} = r_{i_{old}} - \lambda \hat{f}^b(x) $$

At the end, you should define $\hat{f}(x)$ as:

$$ \hat{f}(x) = \sum_{b=1}^B \lambda \hat{f}^b(x) $$

## Boosting

- You can overfit using boosting. But only if $B$ is too large.

- $\lambda$: Controls the rate that your boosting algorithm is learning.
    + Small $\lambda$s require large $B$s

- $d$: Controls the complexity of each step. $d=1$ tends to work well!

## Boosting

![imgb](https://upload.wikimedia.org/wikipedia/commons/b/b5/Ensemble_Boosting.svg)

## Next Lecture

Next lecture we will fit:

- Bagging
- Random Forests
- Boosting

Using MLJ.

# Questions?

# See you next class
