
**<span style="color:#448844">Note</span>** This notebook is meant to be interactive. Launch this notebook in Jupyter to see its full potential.


Name: Aaron Palpallatoc

Section: S11

# Ensemble Models Exercise
This exercise will guide you in implementing 3 ensemble models: random forest (bagging), xgboost (boosting), and adaboost (boosting)


## Instructions
* Read each cell and implement the TODOs sequentially. The markdown/text cells also contain instructions which you need to follow to get the whole notebook working.
* Do not change the variable names unless the instructor allows you to.
* Answer all the markdown/text cells with "A: " on them. The answer must strictly consume one line only.
* You are expected to search how to some functions work on the Internet or via the docs. 
* There are commented markdown cells that have crumbs. Do not delete them or separate them from the cell originally directly below it.  
* You may add new cells for "scrap work" as long as the crumbs are not separated from the cell below it.
* When you are asked to tweak the parameters/code, make sure you bring it back to the originally requested code or place the tweaked code in a "scrap" cell.
* The notebooks will undergo a "Restart and Run All" command, so make sure that your code is working properly.
* You are expected to understand the data set loading and processing separately from this class.
* You may not reproduce this notebook or share them to anyone.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# %matplotlib inline
# plt.style.use('ggplot')

# plt.rcParams['figure.figsize'] = (12.0, 8.0) # set default size of plots
# plt.rcParams['image.interpolation'] = 'nearest'

# Fix the seed of the random number 
# generator so that your results will match ours
np.random.seed(1)

%load_ext autoreload
%autoreload 2

# Datasets

In this first section, we will load two datasets: <a href="https://archive-beta.ics.uci.edu/ml/datasets/186">the wine quality dataset</a> and <a href="https://archive-beta.ics.uci.edu/ml/datasets/20">the census income dataset</a>. Both datasets are available on animoopenspace, so please download those two files and make sure they are in the same directory as this notebook.

You can access both datasets and more in the UCI machine learning repository.

__Our regression dataset: wine quality__

In [None]:
df_wine_quality = pd.read_csv("wine_quality.csv", sep=";")

__Our classification dataset: census income dataset__

In [None]:
df_census_income = pd.read_csv("census_income.csv")

### Wine quality dataset
The wine dataset is composed of two files, one for red wine and another for white wine. We will only load the red dataset, but you can load both if you like. We want to know the quality of the wine and assign a score from 0-10. This can be treated as a classification or regression task. There are 1,599 instances in the red wine dataset, and 4,898 instances in the white dataset.

**Attribute Information:**
- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol
- quality: a score between 0-10 that we want to predict

In [None]:
df_wine_quality

### Census income dataset
This data was extracted from the US census bureau database found at https://archive-beta.ics.uci.edu/ml/datasets/census+income. The goal of the original study was to determine whether a person makes more than or less than USD 50,000 a year given some other info available in the census. There are technically two files, one each for the training data and the test data, but this notebook only loads the training data.

**Attribute Information:**
- age
- workclass
- fnlwgt (sampling weight -- will be removed)
- education
- educationnum
- maritalstatus
- occupation
- relationship
- race
- sex
- capitalgain
- capitalloss
- hoursperweek
- nativecountry
- label: either '<=50k' or '>50k'

This csv file does not include the column names, so let's add it in first:

In [None]:
df_census_income.columns = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship','race','sex','capitalgain','capitalloss', 'hoursperweek', 'nativecountry', 'label']
df_census_income

In [None]:
df_census_income = df_census_income.drop(["fnlwgt"], axis=1)
df_census_income

Decision trees can handle both categorical and numerical features in theory, but sklearn's implementation cannot handle categorical features. We will fix this later.

<hr>

# Making the regression models
We will make 3 regression models: a simple decision tree, a random forest regressor, and a `xgboost` model. All will be trained on the wine quality dataset.

Let's prepare out `X` feature dataset and `y` label vector. Extract the feature columns for `X`, and the label column for `y`

__Hint__ : For `X`, look up `pandas.drop()`. You can convert a DataFrame to a matrix using `your_dataframe.values`

In [None]:
# write code here
X_wine =  None
y_wine = None

print(X_wine.shape)
print(y_wine.shape)

Split `X_wine` and `y_wine` into training and test sets. Set the random state to `1`.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# write code here
X_train_wine, X_test_wine, y_train_wine, y_test_wine = None

print("wine train and test split")
print("X_train_wine: ", X_train_wine.shape)
print("y_train_wine: ", y_train_wine.shape)
print("X_test_wine: ", X_test_wine.shape)
print("y_test_wine: ", y_test_wine.shape)

### Training a simple decision tree

We'll train a decision tree regressor first, then compare its results to the ensemble models.

In [None]:
from sklearn.tree import DecisionTreeRegressor
# DecisionTreeRegressor?

Build a normal regression tree using the default hyperparameters, and train it with our training data

In [None]:
# write code here
dtr = None


Get the training predictions

In [None]:
# write code here
predictions_train = None

predictions_train

Note that at this point, the model interpreted our label as a categorical discrete value (1, 2, 3, ...), which is why we don't get decimal values. However, we know that this should be a numerical variable.

Calculate for the mean squared error. We will make a function for both of these because will be computing for the `mse` and `mae` multiple times in the notebook.

___

`compute_rmse()` will compute for the root mean squared error given two vectors of equal length

__Inputs:__
- `predictions`: A numpy array of shape `(N,)` consisting of `N` samples representing the predicted values
- `actual`: A numpy array of shape `(N,)` consisting of `N` samples representing the actual (target) values

__Outputs:__
- `mse`: A scalar representing the root mean squared error between `predictions` and `actual`

In [None]:
def compute_rmse(predictions, actual):
    # write code here
    return None

Compute the train RMSE of the model's predictions vs the ground truth labels

In [None]:
rmse = compute_rmse(predictions_train, y_train_wine)

print("Decision tree regressor training RMSE:", rmse)

**Sanity Check**: You should get an RMSE of 0.

**Question #1**: Why are we getting an RMSE of 0?

<!--crumb;qna;Question: Why are we getting an RMSE of 0?-->

A: 

__Question #2:__ In what situation can decision tree regressors not get an RMSE of 0 despite overfitting?

<!--crumb;qna;Question: In what situation can decision tree regressors not get an RMSE of 0 despite overfitting?-->

A: 

Let's test our model on the test set. Run predictions on the test set.

In [None]:
# write code here
predictions_test = None

predictions_test

Let's get the test performance

In [None]:
rmse = compute_rmse(predictions_test, y_test_wine)

print("Decision tree regressor test RMSE:", rmse)

**Sanity Check**: The RMSE here should be higher than the training data RMSE.

__Visualizing our decision tree regressor__

In [None]:
from sklearn import tree

tree.plot_tree(dtr)
plt.show()

### Training a random forest regression model

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
# RandomForestRegressor?

Train a random forest model with 300 base models. You can check out the other parameters in `RandomForestRegressor` and tweak it later. For now, create a random forest with `300` base models and train it. Set the random state to `42`.

In [None]:
# write code here
rfr = None


Run predictions on the train set

In [None]:
# write code here
predictions_train = None

predictions_train

**Questoin #2:** Why are we getting floats and not integers as our predictions?

A: 

Compute the train RMSE of the model's predictions vs the ground truth labels

In [None]:
rmse = compute_rmse(predictions_train, y_train_wine)

print("Random forest regressor test RMSE:", rmse)

**Sanity check:** The RMSE should be ~0.2152.

**Question #3:** Random forests are supposed to have a lower loss than decision trees. Why is our random forest's RMSE larger than our decision tree's RMSE?

<!--crumb;qna;Question: Random forests are supposed to have a lower loss than decision trees. Why is our random forest's RMSE larger than our decision tree's RMSE?-->

A: 

Let's now try our random forest's performance on our test set

In [None]:
# write code here
predictions_test = None

predictions_test

Get the test RMSE

In [None]:
rmse = compute_rmse(predictions_test, y_test_wine)

print("Random forest regressor test RMSE:", rmse)

Compare our random forest's RMSE compared to the decision tree's RMSE on the test set. The random forest should have a smaller test RMSE.

__Question #4:__ What is the test RMSE of our random forest regressor?

<!--crumb;qna;Question: What is the test RMSE of our random forest regressor?-->

A: 

**Question #5:** Why is our random forest's test RMSE smaller than the decision tree's RMSE?

<!--crumb;qna;Question: Why is our random forest's test RMSE smaller than the decision tree's RMSE?-->

A: 

**Let's visualize one random forest base model.** If you remember, in our lecture, we do not mind a overfit base model. Let's see the effect here now.

In the following cell, Get the fist base model from the random forest model

In [None]:
# write code here
estimator = None

estimator

The code below will generate one estimator/base model. Note that each model image will be around ~12MB.

```
from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='rf_regression_base_tree.dot', 
                feature_names = df_wine_quality.drop(columns="quality").columns,
                class_names = df_wine_quality["quality"].unique(),
                rounded = True, proportion = False, 
                precision = 2, filled = True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'rf_regression_base_tree.dot', '-o', 'rf_regression_base_tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'rf_regression_base_tree.png')
```

<font color="red"> __Note:__ You need to run the code above to answer the question below, __but make sure that you delete the code cell before you submit your notebook. Failure to delete it will result in major deductions__ </font>

__Question #6:__ How will you describe the figure shown above?

<!--crumb;qna;Question: How will you describe the figure shown above?-->

A: 

Applying what you have learned from the previous notebook, get the base estimator's number of nodes.

In [None]:
# write code here


Get the base estimator's max tree depth.

In [None]:
# write code here


__Question #7:__ How many nodes does this estimator have?

<!--crumb;qna;Question: How many nodes does this estimator have?-->

A: 

__Question #8:__ What is the max depth of this estimator?

<!--crumb;qna;Question: What is the max depth of this estimator?-->

A: 

<hr>

# Making the classifier models
We will make 3 models: a simple decision tree, a random forest classifier, and a adaboost model. All models will be trained on the census income dataset.

In [None]:
df_census_income

While decision trees can handle a mix of categorical and numerical feature data in theory, sklearn's implementation of decision trees and random forests can unfortunately only handle numerical features. 

To make our model still accept the entirety of our census income dataset, we will **label encode** our categorical data. **Label encoding** means that we will be assigning an integer to each possible class in one feature, and use these label-numbers as our new data.

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-1wig{font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-baqh{text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig">fruit</th>
    <th class="tg-1wig">label_encoded_fruit</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">apple</td>
    <td class="tg-baqh">1</td>
  </tr>
  <tr>
    <td class="tg-baqh">banana</td>
    <td class="tg-baqh">2</td>
  </tr>
  <tr>
    <td class="tg-baqh">orange</td>
    <td class="tg-baqh">3</td>
  </tr>
  <tr>
    <td class="tg-baqh">apple</td>
    <td class="tg-baqh">1</td>
  </tr>
  <tr>
    <td class="tg-baqh">apple</td>
    <td class="tg-baqh">1</td>
  </tr>
  <tr>
    <td class="tg-baqh">orange</td>
    <td class="tg-baqh">3</td>
  </tr>
  <tr>
    <td class="tg-baqh">banana</td>
    <td class="tg-baqh">2</td>
  </tr>
  <tr>
    <td class="tg-baqh">banana</td>
    <td class="tg-baqh">2</td>
  </tr>
  <tr>
    <td class="tg-baqh">banana</td>
    <td class="tg-baqh">2</td>
  </tr>
  <tr>
    <td class="tg-baqh">apple</td>
    <td class="tg-baqh">1</td>
  </tr>
  <tr>
    <td class="tg-baqh">apple</td>
    <td class="tg-baqh">1</td>
  </tr>
  <tr>
    <td class="tg-baqh">orange</td>
    <td class="tg-baqh">3</td>
  </tr>
</tbody>
</table>

The table above shows a column called `fruit`. After label encoding the`fruit` column, we assign each fruit to the following integers:
<center> apple: 1 </center>
<center> banana: 2 </center>
<center> orange: 3</center>
This gives us the new column `label_encoded_fruit`. 

Fortunately, sklearn has a pre-defined code called `LabelEncorder` to automatically do the assignment mapping for us.

### Preparing our dataset

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

Let's select the categorical features that will be transformed:

In [None]:
categorical_columns = df_census_income.select_dtypes(include=[object]).columns

Then, we'll call the `encoder.fit_transform()` function on each categorical column

In [None]:
encoded_columns = df_census_income[categorical_columns].apply(encoder.fit_transform)
encoded_columns

We can also call the encoder to fit transform each column manually. That way we can also see which feature categories are labelled as `0,..,n`

Since we applied `fit_transform` using pandas' `apply` function, we can only get the last column it converted. The cell bellow shows what `0` and `1` mean for the `label` column.

In [None]:
mapping = dict(zip(encoder.classes_, range(0, len(encoder.classes_)+1)))
mapping

We will set the all the categorical columns to have this newly transformed data

Assign the categorical columns to the new encoded data

In [None]:
# write code here


df_census_income

Separate our features from our labels

In [None]:
# write code here
X_census = None
y_census = None

And, split our data into training and test data. Set the random state to `42`.

In [None]:
# write code here
X_train_census, X_test_census, y_train_census, y_test_census = None

print("wine train and test split")
print("X_train_census: ", X_train_census.shape)
print("y_train_census: ", y_train_census.shape)
print("X_test_census: ", X_test_census.shape)
print("y_test_census: ", y_test_census.shape)

### Training a simple decision tree classifier

We'll train a decision tree first then compare its performance to the ensemble models.

In [None]:
from sklearn.tree import DecisionTreeClassifier

Train a decision tree with the default hyperparameters

In [None]:
# write code here
dtc = None


Run predictions on the train set

In [None]:
# write code here
predictions_train = None

predictions_train

We will be computing for the accuracy multiple times in this notebook, so let's create a function for this.

`compute_accuracy()` will compute for the accuracy given two vectors of equal length

__Inputs:__
- `predictions`: A numpy array of shape `(N,)` consisting of `N` samples representing the predicted values
- `actual`: A numpy array of shape `(N,)` consisting of `N` samples representing the actual (target) values

__Outputs:__
- `accuracy`: A scalar representing the percentage of elements where `predictions` and `actual` match out of the total number of elements

In [None]:
def compute_accuracy(predictions, actual):
    # write code here
    return None

Let's get the training accuracy

In [None]:
acc = compute_accuracy(predictions_train, y_train_census)

print("Decision tree classifier train accuracy:", acc, "%")

**Sanity check:** The accuracy should be ~98%.

Let's try our model on the test set

In [None]:
# write code here
predictions_test = None

predictions_test

Get the test accuracy

In [None]:
acc = compute_accuracy(predictions_test, y_test_census)

print("Decision tree classifier test accuracy:", acc, "%")

__Question #9:__ What is the decision tree classifier's test accuracy?

<!--crumb;qna;Question: What is the decision tree classifier's test accuracy?-->

A: 

### Training a random forest classifier
We will use the `sklearn.ensemble.RandomForestClssifier` library to make a random forest classifier.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# RandomForestClassifier?

Train a random forest model with `300` base models. You can check out the other parameters in `RandomForestClassifier` and tweak it later. Set the random state to `42`.

In [None]:
# write code here
rfc = None


In [None]:
# write code here
predictions_train = None

predictions_train

Get the training performance

In [None]:
acc = compute_accuracy(predictions_train, y_train_census)

print("Random forest classifier train accuracy:", acc, "%")

The random forest classifier's train accuracy should be lower than the decision tree classifier's train accuracy.

**Question #10:** What will happen to the training accuracy if we have a lower number of base models?

<!--crumb;qna;Question: What will happen to the training accuracy if we have a lower number of base models?-->

A: 

Let's now try our random forest's performance on our test set

In [None]:
# write code here
predictions_test = None

predictions_test

Get the test performance

In [None]:
acc = compute_accuracy(predictions_test, y_test_census)

print("Random forest classifier test accuracy:", acc, "%")

Compare our random forest's accuracy compared to the decision tree's accuracy on the test set. The random forest should have a higher test accuracy.

__Feature importance.__ Get the feature importance detected by the random forest classifier

In [None]:
# write code here
feature_importance = None

feature_importance

__Sanity check:__ You should see a vector of length `13`, one for each of our features

The following code will allow us to see the feature importance next to the feature name

In [None]:
df_rfc_importance = pd.DataFrame(data=feature_importance, index=df_census_income.drop(["label"], axis=1).columns, columns=["importance"])
df_rfc_importance

__Question #11:__ What are the top 4 most discriminating features? Order them from most important to least important.

<!--crumb;qna;Question: What are the top 4 most discriminating features? Order them from most important to least important.-->

A: 

__Question #12:__ What are the top 2 least discriminating features? Least important first.

<!--crumb;qna;Question: What are the top 2 least discriminating features? Least important first.-->

A: 

__Question #13:__ What can you, as a modeller, do with this list of feature importance?

<!--crumb;qna;Question: What can you, as a modeller, do with this list of feature importance?-->

A: 

### Training an adaboost classifier
We can use the `sklearn.ensemble.AdaBoostClassifier` library to code our adaboost classifier

In [None]:
from sklearn.ensemble import AdaBoostClassifier

In [None]:
# AdaBoostClassifier?

Train a adaboost model with `300` base models. You can check out the other parameters in `AdaBoostClassifier` and tweak it later. Set the random state to `42`.

In [None]:
# write code here
abc = None


Get the training predictions

In [None]:
# write code here
predictions_train = None

predictions_train

Get the training accuracy

In [None]:
acc = compute_accuracy(predictions_train, y_train_census)

print("Adaboost classifier train accuracy:", acc, "%")

**Sanity check:** The accuracy should be ~87%.

**Question #14:** What will happen to the train accuracy if our adaboost model has a lower number of base models?

<!--crumb;qna;Question: What will happen to the train accuracy if our adaboost model has a lower number of base models?-->

A: 

Let's now try our adaboost models's performance on our test set

In [None]:
# write code here
predictions_test = None

predictions_test

Get the test accuracy

In [None]:
acc = compute_accuracy(predictions_test, y_test_census)

print("Adaboost classifier test accuracy:", acc, "%")

__Question #15:__ What is the Adaboost classifier's test accuracy?

<!--crumb;qna;Question: What is the Adaboost classifier's test accuracy?-->

A: 

__Question #16:__ Why is it expected for the decision tree classifier to equal or outperform the ensemble models in terms of training performance?

<!--crumb;qna;Question: Why is it expected for the decision tree classifier to equal or outperform the ensemble models in terms of training performance?-->

A: 

__Question #17:__ Why did both ensemble models outperform the decision tree classifier in terms of test performance?

<!--crumb;qna;Question: Why did both ensemble models outperform the decision tree classifier in terms of test performance?-->

A: 

**Let's visualize one adaboost base model.** Get the fist base model from the Adaboost model

In [None]:
estimator = abc.estimators_[0]

estimator

In [None]:
from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='ab_classification_base_tree.dot', 
                feature_names = df_census_income.drop(columns="label").columns,
                class_names = encoder.classes_,
                rounded = True, proportion = False, 
                precision = 2, filled = True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'ab_classification_base_tree.dot', '-o', 'ab_classification_base_tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'ab_classification_base_tree.png')

**Note:** You can edit the .dot files to configure how the tree will be graphed.

**Note:** You can also change the estimator index to select the base model you want to visualize.

__Question #18:__ How will you describe the tree/chart above?

<!--crumb;qna;Question: How will you describe the tree/chart above?-->

A: 

__Question #19:__ What feature was used in the __first__ decision stump?

<!--crumb;qna;Question: What feature was used in the first decision stump?-->

A: 

<hr>

# Summary

* Ensemble models generally have a lower a training error than its overfit single decision tree counterparts

* Their advantage of using ensemble models is its ability to lower the test (and validation) error through the way it "manipulated" the bias-variance decomposition

* We did not make a new model, we just used a simple decision tree as our base models, and built a meta learning algorithm over it.

* The benefits of using ensemble models may seem small right now, but the effect is clearer with more complex datasets.

**References:**
- P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
  Modeling wine preferences by data mining from physicochemical properties.
  In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
  
- Ron Kohavi (1996). Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.

## <center>fin</center>


<!-- DO NOT MODIFY OR DELETE THIS -->

<sup>made/compiled by daniel stanley tan & courtney anne ngo 🐰 & thomas james tiam-lee</sup> <br>
<sup>for comments, corrections, suggestions, please email:</sup><sup> danieltan07@gmail.com & courtneyngo@gmail.com & thomasjamestiamlee@gmail.com</sup><br>
<sup>please cc your instructor, too</sup>
<!-- DO NOT MODIFY OR DELETE THIS -->