# POLI 175 - Quiz 04

In this quiz, you will run a K-Nearest Neighborhood Classifier and perform Cross Validation to find the best $K$.

Due date: Feb 23, 2024

Again: The grading for the quiz is:

$$ 0.7 \times \text{TRY} + 0.3 \times \text{CORRECT} $$

The points below refer to the correctness part.

## Running Dataset

### [Chile Survey](https://en.wikipedia.org/wiki/Chile)

In 1988, the [Chilean Dictator](https://en.wikipedia.org/wiki/Military_dictatorship_of_Chile) [Augusto Pinochet](https://en.wikipedia.org/wiki/Augusto_Pinochet) conducted a [referendum to whether he should step out](https://en.wikipedia.org/wiki/1988_Chilean_presidential_referendum).

The [FLACSO](https://en.wikipedia.org/wiki/Latin_American_Faculty_of_Social_Sciences) in Chile conducted a surver on 2700 respondents. We are going to build a model to predict their voting intentions.

| **Variable** | **Meaning** |
|:---:|---|
| region | A factor with levels:<br>- `C`, Central; <br>- `M`, Metropolitan Santiago area; <br>- `N`, North; <br>- `S`, South; <br>- `SA`, city of Santiago. |
| population | The population size of respondent's community. |
| sex | A factor with levels: <br>- `F`, female; <br>- `M`, male. |
| age | The respondent's age in years. |
| education | A factor with levels: <br>- `P`, Primary; <br>- `S`, Secondary; <br>- `PS`, Post-secondary. |
| income | The respondent's monthly income, in Pesos. |
| statusquo | A scale of support for the status-quo. |
| voteyes | A dummy variable with one<br>meaning a vote in favor of Pinochet |

Let me pre-process the data a bit for you.

In [1]:
## Loading the packages (make sure you have those installed)
using DataFrames
using MLJ, MLJIteration
import MLJLinearModels, MLJBase
import MultivariateStats, MLJMultivariateStatsInterface
import CSV, Plots, GLM, StatsBase, Random
import LaTeXStrings, StatsPlots, Lowess, Gadfly, RegressionTables
import CovarianceMatrices, Econometrics, LinearAlgebra, MixedModelsExtras
import Missings, StatsAPI, FreqTables, EvalMetrics
import NearestNeighborModels

## Loading the data
chile = CSV.read(
    download("https://raw.githubusercontent.com/umbertomig/POLI175julia/main/data/chilesurvey.csv"), 
    DataFrame,
    missingstring = ["NA"]
); dropmissing!(chile)

## Process target variable
chile.voteyes = ifelse.(chile.vote .== "Y", 1, 0)

# Pre-process numeri cariables (log them)
chile.income_log = log.(chile.income);
chile.pop_log = log.(chile.population);

select!(chile, Not(:vote, :income, :population))
first(chile, 3) |> pretty

┌─────────┬─────────┬───────┬───────────┬────────────┬─────────┬────────────┬────────────┐
│[1m region  [0m│[1m sex     [0m│[1m age   [0m│[1m education [0m│[1m statusquo  [0m│[1m voteyes [0m│[1m income_log [0m│[1m pop_log    [0m│
│[90m String3 [0m│[90m String1 [0m│[90m Int64 [0m│[90m String3   [0m│[90m Float64    [0m│[90m Int64   [0m│[90m Float64    [0m│[90m Float64    [0m│
│[90m Textual [0m│[90m Textual [0m│[90m Count [0m│[90m Textual   [0m│[90m Continuous [0m│[90m Count   [0m│[90m Continuous [0m│[90m Continuous [0m│
├─────────┼─────────┼───────┼───────────┼────────────┼─────────┼────────────┼────────────┤
│ N       │ M       │ 65    │ P         │ 1.0082     │ 1       │ 10.4631    │ 12.0725    │
│ N       │ M       │ 29    │ PS        │ -1.29617   │ 0       │ 8.92266    │ 12.0725    │
│ N       │ F       │ 38    │ P         │ 1.23072    │ 1       │ 9.61581    │ 12.0725    │
└─────────┴─────────┴───────┴───────────┴────────────┴─────────┴

In [2]:
# Adapted from @xiaodaigh: https://github.com/xiaodaigh/DataConvenience.jl
function onehot!(df::AbstractDataFrame, 
        col, cate = sort(unique(df[!, col])); 
        outnames = Symbol.(col, :_, cate))
    transform!(df, @. col => ByRow(isequal(cate)) .=> outnames)
end

# One-hot encoding (we will learn a better way to do it later)
onehot!(chile, :region);
onehot!(chile, :education);
onehot!(chile, :sex);

# Drop reference categories
select!(chile, Not(:region, :sex, :education, :region_C, :education_P, :sex_M))

# Checking
first(chile, 3)

Row,age,statusquo,voteyes,income_log,pop_log,region_M,region_N,region_S,region_SA,education_PS,education_S,sex_F
Unnamed: 0_level_1,Int64,Float64,Int64,Float64,Float64,Bool,Bool,Bool,Bool,Bool,Bool,Bool
1,65,1.0082,1,10.4631,12.0725,False,True,False,False,False,False,False
2,29,-1.29617,0,8.92266,12.0725,False,True,False,False,True,False,False
3,38,1.23072,1,9.61581,12.0725,False,True,False,False,False,False,True


And to facilitate, I will create three feature groups for you. One for each question.

In [3]:
# Full Specification
y, X = unpack(
    chile,
    ==(:voteyes),
    c -> true;
    :voteyes      => Multiclass,
    :income_log   => Continuous,
    :pop_log      => Continuous,
    :age          => Continuous,
    :statusquo    => Continuous,
    :region_M     => Multiclass,
    :region_N     => Multiclass,
    :region_S     => Multiclass,
    :region_SA    => Multiclass,
    :sex_F        => Multiclass,
    :education_S  => Multiclass,
    :education_PS => Multiclass,
);

In [4]:
# Target
FreqTables.freqtable(y)

2-element Named Vector{Int64}
Dim1  │ 
──────┼─────
0     │ 1595
1     │  836

In [5]:
# Features
first(X, 3)

Row,age,statusquo,income_log,pop_log,region_M,region_N,region_S,region_SA,education_PS,education_S,sex_F
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…,Cat…,Cat…,Cat…,Cat…,Cat…,Cat…
1,65.0,1.0082,10.4631,12.0725,False,True,False,False,False,False,False
2,29.0,-1.29617,8.92266,12.0725,False,True,False,False,True,False,False
3,38.0,1.23072,9.61581,12.0725,False,True,False,False,False,False,True


## Question 01: Split sample (2 pts)

Split the sample, following the rules below:

1. (0.5 pts) Using the `partition` function, split the data into training and testing sets.
1. (0.5 pts) The training set has to have 75% of the data.
1. (0.5 pts) Stratify using the $y$ variable (voting for Pinochet). This step ensures that we have the same proportion of yays and nays in both datasets.
1. (0.5 pts) Save the objects with the names `X_train`, `X_test`, `y_train`, and `y_test`.

Use `12345` as seed (not strictly necessary, but helps the results to be closer to mine).

In [6]:
# Your answers here

## Question 02: 5-Nearest Neighborhood (2pt)

1. (0.5pts) Run a 5-Nearest Neighborhood model.
1. (0.5pts) Compute the cross-validated classification accuracy. Use a 5-Fold Cross-Validation to evaluate your results (I build that for you already). Use the training set.
1. (0.5pts) Fit the model in the entire training set data.
1. (0.5pts) Evaluate its accuracy in the testing set.

**Hint:** The metrics you should use now are different than the ones we use for regression. Check this source here: https://juliaai.github.io/StatisticalMeasures.jl/dev/auto_generated_list_of_measures/#aliases. Accuracy should be one of the first.

In [8]:
# 5-Fold CV
cv5 = CV(
    nfolds = 5,
    rng = 54321
);

In [9]:
# Your answers here

## Question 03: ROC Curves (2pts)

1. (1.0 pt) Compute the ROC curve on the training set
1. (1.0 pt) Compute the ROC curve on the testing set

What did you find?

In [14]:
# Your answers here

## Question 04: 20-Nearest Neighborhood (2pts)

1. (0.5 pts) Run a 20-Nearest Neighborhood model on the data
1. (0.5 pts) Compute the cross-validated classification accuracy
1. (0.5 pts) Compute the ROC curves for the training and testing sets
1. (0.5 pts) Compare your results with the results in the previous questions. 

What did you learn from this exercise?

In [18]:
# Your answers here

## Question 05: Search for Best $K$

In this exercise, we will search the best $K$ for our $K$-Nearest Neighborhood model.

1. (0.25 pts) Instantiate the model.
1. (0.25 pts) Instantiate the range from 1 to 101.
1. (0.5 pts) Create the self-tunning KNN using the function `TunedModel`. Use `accuracy` as the optimization measure.
1. (0.5 pts) Search for the $K$ that maximizes the cross-validated accuracy in the training set using $K$ between 1 and 101. Set the Grid to have resolution of 100.
1. (0.25 pts) Build a plot of the $K$ (x-axis) versus the cross-validated accuracy (y-axis).
1. (0.25 pts) Deploy the best model to predict the testing set.

**Hint 1:** Check [Lecture 12](https://github.com/umbertomig/POLI175julia/blob/main/lectures/jupyternb/lecture12.ipynb)

**Hint 2:** I used some of the documentation in [here](https://alan-turing-institute.github.io/MLJ.jl/dev/tuning_models/#Overview) to build this problem. It even has a KNN tuning in it.

**Hint 3:** Because of randomness, your best $K$ may be different than mine. This is fine. In all my tests, $K$ was between 5 and 25. For me, in most cases, it was 19.

In [26]:
# Your answers here

**Great work!**