# POLI 175 - Quiz 05

In this quiz, you will fit a few Ensemble models.

Due date: Mar 12, 2024

Again: The grading for the quiz is:

$$ 0.7 \times \text{TRY} + 0.3 \times \text{CORRECT} $$

The points below refer to the correctness part.

## Running Dataset

### [Chile Survey](https://en.wikipedia.org/wiki/Chile)

In 1988, the [Chilean Dictator](https://en.wikipedia.org/wiki/Military_dictatorship_of_Chile) [Augusto Pinochet](https://en.wikipedia.org/wiki/Augusto_Pinochet) conducted a [referendum to whether he should step out](https://en.wikipedia.org/wiki/1988_Chilean_presidential_referendum).

The [FLACSO](https://en.wikipedia.org/wiki/Latin_American_Faculty_of_Social_Sciences) in Chile conducted a surver on 2700 respondents. We are going to build a model to predict their voting intentions.

| **Variable** | **Meaning** |
|:---:|---|
| region | A factor with levels:<br>- `C`, Central; <br>- `M`, Metropolitan Santiago area; <br>- `N`, North; <br>- `S`, South; <br>- `SA`, city of Santiago. |
| population | The population size of respondent's community. |
| sex | A factor with levels: <br>- `F`, female; <br>- `M`, male. |
| age | The respondent's age in years. |
| education | A factor with levels: <br>- `P`, Primary; <br>- `S`, Secondary; <br>- `PS`, Post-secondary. |
| income | The respondent's monthly income, in Pesos. |
| statusquo | A scale of support for the status-quo. |
| voteyes | A dummy variable with one<br>meaning a vote in favor of Pinochet |

Let me pre-process the data a bit for you.

In [1]:
## Loading the packages (make sure you have those installed)
using DataFrames
using MLJ, MLJIteration
using MLJModels
import MLJLinearModels, MLJBase
import MultivariateStats, MLJMultivariateStatsInterface
import CSV, Plots, GLM, StatsBase, Random
import LaTeXStrings, StatsPlots, Lowess, Gadfly, RegressionTables
import CovarianceMatrices, Econometrics, LinearAlgebra, MixedModelsExtras
import Missings, StatsAPI, FreqTables, EvalMetrics
import DecisionTree, MLJDecisionTreeInterface
import XGBoost, MLJXGBoostInterface

# Solver
solver = MLJLinearModels.NewtonCG()

## Loading the data
chile = CSV.read(
    download("https://raw.githubusercontent.com/umbertomig/POLI175julia/main/data/chilesurvey.csv"), 
    DataFrame,
    missingstring = ["NA"]
); dropmissing!(chile)

## Process target variable
chile.voteyes = ifelse.(chile.vote .== "Y", 1, 0)

# Pre-process numeri cariables (log them)
chile.income_log = log.(chile.income);
chile.pop_log = log.(chile.population);

select!(chile, Not(:vote, :income, :population));

In [2]:
# Adapted from @xiaodaigh: https://github.com/xiaodaigh/DataConvenience.jl
function onehot!(df::AbstractDataFrame, 
        col, cate = sort(unique(df[!, col])); 
        outnames = Symbol.(col, :_, cate))
    transform!(df, @. col => ByRow(isequal(cate)) .=> outnames)
end

# One-hot encoding (we will learn a better way to do it later)
onehot!(chile, :region);
onehot!(chile, :education);
onehot!(chile, :sex);

# Drop reference categories
select!(chile, Not(:region, :sex, :education, :region_C, :education_P, :sex_M));

And to facilitate, I will create three feature groups for you. One for each question.

In [3]:
# Full Specification
y, X = unpack(
    chile,
    ==(:voteyes),
    c -> true;
    :voteyes      => Multiclass,
    :income_log   => Continuous,
    :statusquo    => Continuous,
    :pop_log      => Continuous,
    :age          => Continuous,
    :region_M     => Multiclass,
    :region_N     => Multiclass,
    :region_S     => Multiclass,
    :region_SA    => Multiclass,
    :sex_F        => Multiclass,
    :education_S  => Multiclass,
    :education_PS => Multiclass,
);

In [4]:
# Target
FreqTables.freqtable(y)

2-element Named Vector{Int64}
Dim1  │ 
──────┼─────
0     │ 1595
1     │  836

In [5]:
# Features
first(X, 3)

Row,age,statusquo,income_log,pop_log,region_M,region_N,region_S,region_SA,education_PS,education_S,sex_F
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…,Cat…,Cat…,Cat…,Cat…,Cat…,Cat…
1,65.0,1.0082,10.4631,12.0725,False,True,False,False,False,False,False
2,29.0,-1.29617,8.92266,12.0725,False,True,False,False,True,False,False
3,38.0,1.23072,9.61581,12.0725,False,True,False,False,False,False,True


### Helpers

To save you time, I am instantiating below:

1. Decision Tree Classifier (`tree_model`).
1. AdaBoostStupClassifier (`adaboost_model`): AdaBoost Classifier using a stump (decision tree with just one leaf, i.e., a binary tree).
1. A Extreme Gradient Boost Classifier (`xgboost_model`)
1. A Random Forest model (`rf_model`).
1. A 70%-30% train-test split (`train`, `test`). It uses the indexes on `y`.

In [6]:
## Initialized for you

# Decision tree
tree_model = MLJDecisionTreeInterface.DecisionTreeClassifier();

# AdaBoostStump
adaboost_model = MLJDecisionTreeInterface.AdaBoostStumpClassifier();

# XGBoost
xgboost_model = MLJXGBoostInterface.XGBoostClassifier(num_round = 50);

# Random forest
rf_model = MLJDecisionTreeInterface.RandomForestClassifier();

# Train-test split
train, test = partition(eachindex(y), 0.7, shuffle = true, stratify = y, rng = 98765);

## Question 01: Run a Bagging Classifier (2 pts)

1. Instantiate the Bagging Ensemble Model with 50 bags (0.5 pts)

1. Fit the model in the training set (0.5 pts)

1. Compute the cross-validated (testing set) `accuracy`, `confusion_matrix`, and `f1score` (0.5 pts)

1. Print the ROC curve (testing set) (0.5 pts)

In [7]:
# Your answers here

## Question 02: Run a Random Forest Ensemble Classifier (2pts)

1. The model is instantiated for you. You should use the one I provided. (0.5 pts)

1. Fit the model in the training set (0.5 pts)

1. Compute the cross-validated (testing set) `accuracy`, `confusion_matrix`, and `f1score` (0.5 pts)

1. Print the ROC curve (testing set) (0.5 pts)

In [16]:
# Your answers here

## Question 03: Run an AdaBoost Classifier (2pts)

1. I have instantiated the AdaBoost model for you. You should use the one I provided. (0.5 pts)

1. Fit the model in the training set (0.5 pts)

1. Compute the cross-validated (testing set) `accuracy`, `confusion_matrix`, and `f1score` (0.5 pts)

1. Print the ROC curve (testing set) (0.5 pts)

In [24]:
# Your answers here

## Question 04: Run a eXtreme Gradient Boost Classifier (2pts)

1. I have instantiated the `xgboost_model` for you. You should use the one I provided. (0.5 pts)

1. Fit the model in the training set (0.5 pts)

1. Compute the cross-validated (testing set) `accuracy`, `confusion_matrix`, and `f1score` (0.5 pts)

1. Print the ROC curve (testing set) (0.5 pts)

**Hint:** It shows the history of the training. No need to panic 🙂.

In [32]:
# Your answers here

## Question 05: Put all the classifiers in the same ROC curve (2 pts)

1. (0.5 pts) Save the predicted values as:
    1. `y_pred_prob_q1`: Bagging Classifier
    1. `y_pred_prob_q2`: Random Forest Classifier
    1. `y_pred_prob_q3`: AdaBoosting Classifier
    1. `y_pred_prob_q4`: XGBoost Classifier

1. Print the ROC curves in the same plot (1.0 pts)

1. Which model is the best? (0.5 pts)

**Hint**: Here is some code to get you started:

```julia
EvalMetrics.rocplot(
    [
        Vector{Int64}(y[test]), 
        Vector{Int64}(y[test]), 
        Vector{Int64}(y[test]), 
        Vector{Int64}(y[test])
    ], 
    [
        y_pred_prob_q1, 
        y_pred_prob_q2, 
        y_pred_prob_?, 
        y_pred_prob_?
    ], 
    label = ["?" "??" "???" "????";], 
    diagonal = true)
```

In [40]:
# Your answers here

**Great work!**