# POLI 175 - Quiz 03

In this quiz, you will run a classification model in Julia

Due date: Feb 16, 2024

Again: The grading for the quiz is:

$$ 0.7 \times \text{TRY} + 0.3 \times \text{CORRECT} $$

The points below refer to the correctness part.

## Running Dataset

### [Chile Survey](https://en.wikipedia.org/wiki/Chile)

In 1988, the [Chilean Dictator](https://en.wikipedia.org/wiki/Military_dictatorship_of_Chile) [Augusto Pinochet](https://en.wikipedia.org/wiki/Augusto_Pinochet) conducted a [referendum to whether he should step out](https://en.wikipedia.org/wiki/1988_Chilean_presidential_referendum).

The [FLACSO](https://en.wikipedia.org/wiki/Latin_American_Faculty_of_Social_Sciences) in Chile conducted a surver on 2700 respondents. We are going to build a model to predict their voting intentions.

| **Variable** | **Meaning** |
|:---:|---|
| region | A factor with levels:<br>- `C`, Central; <br>- `M`, Metropolitan Santiago area; <br>- `N`, North; <br>- `S`, South; <br>- `SA`, city of Santiago. |
| population | The population size of respondent's community. |
| sex | A factor with levels: <br>- `F`, female; <br>- `M`, male. |
| age | The respondent's age in years. |
| education | A factor with levels: <br>- `P`, Primary; <br>- `S`, Secondary; <br>- `PS`, Post-secondary. |
| income | The respondent's monthly income, in Pesos. |
| statusquo | A scale of support for the status-quo. |
| voteyes | A dummy variable with one<br>meaning a vote in favor of Pinochet |

Let me pre-process the data a bit for you.

In [1]:
## Loading the packages (make sure you have those installed)
using DataFrames
using MLJ, MLJIteration
import MLJLinearModels, MLJBase
import MultivariateStats, MLJMultivariateStatsInterface
import CSV, Plots, GLM, StatsBase, Random
import LaTeXStrings, StatsPlots, Lowess, Gadfly, RegressionTables
import CovarianceMatrices, Econometrics, LinearAlgebra, MixedModelsExtras
import Missings, StatsAPI, FreqTables, EvalMetrics
import NearestNeighborModels

## Loading the data
chile = CSV.read(
    download("https://raw.githubusercontent.com/umbertomig/POLI175julia/main/data/chilesurvey.csv"), 
    DataFrame,
    missingstring = ["NA"]
); dropmissing!(chile)

## Process target variable
chile.voteyes = ifelse.(chile.vote .== "Y", 1, 0)

# Pre-process numeri cariables (log them)
chile.income_log = log.(chile.income);
chile.pop_log = log.(chile.population);

select!(chile, Not(:vote, :income, :population))
first(chile, 3) |> pretty

┌─────────┬─────────┬───────┬───────────┬────────────┬─────────┬────────────┬────────────┐
│[1m region  [0m│[1m sex     [0m│[1m age   [0m│[1m education [0m│[1m statusquo  [0m│[1m voteyes [0m│[1m income_log [0m│[1m pop_log    [0m│
│[90m String3 [0m│[90m String1 [0m│[90m Int64 [0m│[90m String3   [0m│[90m Float64    [0m│[90m Int64   [0m│[90m Float64    [0m│[90m Float64    [0m│
│[90m Textual [0m│[90m Textual [0m│[90m Count [0m│[90m Textual   [0m│[90m Continuous [0m│[90m Count   [0m│[90m Continuous [0m│[90m Continuous [0m│
├─────────┼─────────┼───────┼───────────┼────────────┼─────────┼────────────┼────────────┤
│ N       │ M       │ 65    │ P         │ 1.0082     │ 1       │ 10.4631    │ 12.0725    │
│ N       │ M       │ 29    │ PS        │ -1.29617   │ 0       │ 8.92266    │ 12.0725    │
│ N       │ F       │ 38    │ P         │ 1.23072    │ 1       │ 9.61581    │ 12.0725    │
└─────────┴─────────┴───────┴───────────┴────────────┴─────────┴

In [2]:
# Adapted from @xiaodaigh: https://github.com/xiaodaigh/DataConvenience.jl
function onehot!(df::AbstractDataFrame, 
        col, cate = sort(unique(df[!, col])); 
        outnames = Symbol.(col, :_, cate))
    transform!(df, @. col => ByRow(isequal(cate)) .=> outnames)
end

# One-hot encoding (we will learn a better way to do it later)
onehot!(chile, :region);
onehot!(chile, :education);
onehot!(chile, :sex);

# Drop reference categories
select!(chile, Not(:region, :sex, :education, :region_C, :education_P, :sex_M))

# Checking
first(chile, 3)

Row,age,statusquo,voteyes,income_log,pop_log,region_M,region_N,region_S,region_SA,education_PS,education_S,sex_F
Unnamed: 0_level_1,Int64,Float64,Int64,Float64,Float64,Bool,Bool,Bool,Bool,Bool,Bool,Bool
1,65,1.0082,1,10.4631,12.0725,False,True,False,False,False,False,False
2,29,-1.29617,0,8.92266,12.0725,False,True,False,False,True,False,False
3,38,1.23072,1,9.61581,12.0725,False,True,False,False,False,False,True


And to facilitate, I will create three feature groups for you. One for each question.

In [3]:
# Full Specification
y, X_full = unpack(
    chile,
    ==(:voteyes),
    c -> true;
    :voteyes      => Multiclass,
    :income_log   => Continuous,
    :pop_log      => Continuous,
    :age          => Continuous,
    :statusquo    => Continuous,
    :region_M     => Multiclass,
    :region_N     => Multiclass,
    :region_S     => Multiclass,
    :region_SA    => Multiclass,
    :sex_F        => Multiclass,
    :education_S  => Multiclass,
    :education_PS => Multiclass,
);

# Q1: statusquo only
X_q1 = select(X_full, :statusquo);

# Q2: continuous only
X_q2 = select(X_full, :income_log, :pop_log, :age, :statusquo);

In [4]:
# Target
FreqTables.freqtable(y)

2-element Named Vector{Int64}
Dim1  │ 
──────┼─────
0     │ 1595
1     │  836

In [5]:
# Question 1 feature
first(X_q1, 3)

Row,statusquo
Unnamed: 0_level_1,Float64
1,1.0082
2,-1.29617
3,1.23072


In [6]:
# Question 2 features
first(X_q2, 3)

Row,income_log,pop_log,age,statusquo
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,10.4631,12.0725,65.0,1.0082
2,8.92266,12.0725,29.0,-1.29617
3,9.61581,12.0725,38.0,1.23072


In [7]:
# Questions 3 to 5 features
first(X_full, 3)

Row,age,statusquo,income_log,pop_log,region_M,region_N,region_S,region_SA,education_PS,education_S,sex_F
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…,Cat…,Cat…,Cat…,Cat…,Cat…,Cat…
1,65.0,1.0082,10.4631,12.0725,False,True,False,False,False,False,False
2,29.0,-1.29617,8.92266,12.0725,False,True,False,False,True,False,False
3,38.0,1.23072,9.61581,12.0725,False,True,False,False,False,False,True


## Question 01: Logistic Regression with `statusquo` only (2pts)

Run a logistic regression with only the `statusquo` variable. Evaluate the predictive power of your regression.

**Hint 1:**

Remember the steps:

1. Instantiate the model
1. Build and fit the model
1. Make predictions and analyze them

For predictions and evaluations, I suggest you to check (see Lectures 08 and 09):

- Confusion Matrix
- Accuracy
- F1-Score
- ROC Curve

**Hint 2:**

Save your machine and your predictions with `_q1` in their object names. This will come handy later.

**Hint 3:**

Use the `X_q1`as your feature dataset.

In [8]:
# Your answers here

## Question 02: Logistic Regression with Continuous Variables (2pts)

Run a logistic regression with all the continuous variables. Evaluate the predictive power of your regression. Any improvements compared with Question 01?

**Hint 1:**

Remember the steps:

1. Instantiate the model
1. Build and fit the model
1. Make predictions and analyze them

For predictions and evaluations, I suggest you to check (see Lectures 08 and 09):

- Confusion Matrix
- Accuracy
- F1-Score
- ROC Curve

**Hint 2:**

Save your machine and your predictions with `_q2` in their object names. This will come handy later.

**Hint 3:**

Use the `X_q2`as your feature dataset.

In [15]:
# Your answers here

## Question 03: Logistic Regression with Full Model Specification (2pt)

Run a logistic regression with all variables. Evaluate the predictive power of your regression. Any improvements compared with Question 01?

**Hint 1:**

Remember the steps:

1. Instantiate the model
1. Build and fit the model
1. Make predictions and analyze them

For predictions and evaluations, I suggest you to check (see Lectures 08 and 09):

- Confusion Matrix
- Accuracy
- F1-Score
- ROC Curve

**Hint 2:**

Save your machine and your predictions with `_q3` in their object names. This will come handy later.

**Hint 3:**

Use the `X_full`as your feature dataset.

In [22]:
# Your answers here

## Question 04: Linear Discriminant Analysis and 5-Nearest Neighborhood Classifier (2 pt)

Run a LDA and a 5-NN classifiers with all variables. Evaluate the predictive power of your regression. Any improvements compared with Question 01?

**Hint 1:**

Remember the steps:

1. Instantiate the model
1. Build and fit the model
1. Make predictions and analyze them

For predictions and evaluations, I suggest you to check (see Lectures 08 and 09):

- Confusion Matrix
- Accuracy
- F1-Score
- ROC Curve

**Hint 2:**

Save your machine and your predictions with `_q4_lda` (LDA) and `q4_5nn` (5-NN) in their object names. This will come handy later.

**Hint 3:**

Use the `X_full`as your feature dataset.

In [29]:
# Your answers here

## Question 05: Model Selection (2pts)

Which model is best? Plot the ROC curve for all the models, and find the best.

In [41]:
# Your answers here

**Great work!**