# POLI 175 - Lecture 10

## Resampling

# Resampling

## Resampling

Involve repeatedly drawing `samples` for a `training dataset` to obtain fitting information.

`Samples`: A randomly selected fraction of the original data. Do not mistake it for a different sample from a population.
    
`Training`: Training the model means to fit the model and find out the model's parameters. In a regression, this means identify the $\beta$'s.

## Resampling

### Why not fit the model into the actual data?

We need a measure of how well a model is doing.

In the end, this matters! And it matters especially for the data that we did not train the model!

Resampling is a clever trick to see how the model would do in the `real world`, without actually deploying it in the real world.

## Resampling

Helps us to:

1. Evaluate the performance of the model (`Model assessment`).
2. Select the proper flexibility for our model (`Model selection`).

**Drawback:** Resampling methods are computationally intensive. Resampling involves refitting the model again and again.
    
We are going to discuss the following:

- `Cross-validation`: Measures the performance and selects appropriate flexibility (bias-variance trade-off).
- `Bootstrap`: Measures the accuracy of parameters.

## Class Examples

1. Education expenditure dataset

1. Pinochet voting dataset

Let us load them all:

In [1]:
## Packages Here
using DataFrames
using MLJ, MLJIteration
import MLJLinearModels, MLJBase
import MultivariateStats, MLJMultivariateStatsInterface
import CSV, Plots, GLM, StatsBase, Random
import LaTeXStrings, StatsPlots, Lowess, Gadfly, RegressionTables
import CovarianceMatrices, Econometrics, LinearAlgebra, MixedModelsExtras
import Missings, StatsAPI, FreqTables, EvalMetrics
import NearestNeighborModels

# Adapted from @xiaodaigh: https://github.com/xiaodaigh/DataConvenience.jl
function onehot!(df::AbstractDataFrame, 
        col, cate = sort(unique(df[!, col])); 
        outnames = Symbol.(col, :_, cate))
    transform!(df, @. col => ByRow(isequal(cate)) .=> outnames)
end

onehot! (generic function with 2 methods)

## Class Examples

In [2]:
## Loading the data
chile = CSV.read(
    download("https://raw.githubusercontent.com/umbertomig/POLI175julia/main/data/chilesurvey.csv"), 
    DataFrame,
    missingstring = ["NA"]
); dropmissing!(chile)
chile.voteyes = ifelse.(chile.vote .== "Y", 1, 0)

# One-hot encoding (we will learn a better way to do it later)
onehot!(chile, :region);
onehot!(chile, :education);
onehot!(chile, :sex);

# Drop reference categories
select!(chile, Not(:region, :sex, :education, :region_C, :education_P, :sex_M))

# Checking
first(chile, 3)

Row,population,age,income,statusquo,vote,voteyes,region_M,region_N,region_S,region_SA,education_PS,education_S,sex_F
Unnamed: 0_level_1,Int64,Int64,Int64,Float64,String1,Int64,Bool,Bool,Bool,Bool,Bool,Bool,Bool
1,175000,65,35000,1.0082,Y,1,False,True,False,False,False,False,False
2,175000,29,7500,-1.29617,N,0,False,True,False,False,True,False,False
3,175000,38,15000,1.23072,Y,1,False,True,False,False,False,False,True


## Class Examples

In [65]:
## Education Expenditure Dataset
educ = CSV.read(download("https://raw.githubusercontent.com/umbertomig/POLI175julia/main/data/educexp.csv"), DataFrame)

# Processing
educ.educ_log = log.(educ.education);
educ.income_log = log.(educ.income)
educ.urban_log = log.(educ.urban)
educ.young_log = log.(educ.young)

# Drop reference categories
select!(educ, Not(:education, :income, :urban, :urban, :young))

# Little unpacking
educ_y, educ_X = unpack(
    educ[:, Not(:states)],
    ==(:educ_log);
    :educ_log => Continuous, 
    :income_log => Continuous,
    :urban_log => Continuous,
    :young_log => Continuous
);    

# Checking
first(educ, 3)

Row,states,educ_log,income_log,urban_log,young_log
Unnamed: 0_level_1,String3,Float64,Float64,Float64,Float64
1,ME,5.24175,7.94591,6.23048,5.85993
2,NH,5.1299,8.08918,6.33505,5.84615
3,VT,5.43808,8.03008,5.77455,5.85364


## Cross-Validation

In Class 02, we discussed two ideas, derived from splitting the data in two portions:

1. `Training error rate`: The error when fitting the model to data that was used to train the parameters, and
1. `Testing error rate`: The error associated with fitting the model to ***unseen*** data.

As it should be intuitive, we can perform multiple types of cross-validation, all depending on how we split our data.

## Cross-Validation

### Validation Set Approach

Randomly split the data into two sets:

- `Training set`: The data used to fit the model
- `Testing set`: The data used to test the performance of the fitted model.

One example is to split the sample in half-training--half-testing and running the estimation:

![img vsa](https://github.com/umbertomig/POLI175julia/blob/c9b0555e3e97778495bee72746aee43ddf3226d7/img/cv1.png?raw=true)

## Cross-Validation

### Validation Set Approach

In Julia, we can use the command partition:

```julia
train, test = partition(data_frame, 
    share_train, 
    share_validation_optional,
    rng = random_seed)
```

We are going to do the "boring way" first (i.e., like in Python's scikit-learn):

In [13]:
educ_train, educ_test = partition(
    educ, 
    0.5, 
    rng = 123
);

## Cross-Validation

### Validation Set Approach

Now we `unpack` (*create the target and the features*):

In [26]:
educ_train_y, educ_train_X = unpack(
    educ_train[:, Not(:states)],
    ==(:educ_log);
    :educ_log => Continuous, 
    :income_log => Continuous,
    :urban_log => Continuous,
    :young_log => Continuous
);

## Cross-Validation

### Validation Set Approach

Now we `unpack` (*create the target and the features*):

In [27]:
educ_test_y, educ_test_X = unpack(
    educ_test[:, Not(:states)],
    ==(:educ_log);
    :educ_log => Continuous, 
    :income_log => Continuous,
    :urban_log => Continuous,
    :young_log => Continuous
);

## Cross-Validation

### Validation Set Approach

Now, let us check how the following model performs:

$$ \log (education_i) = \alpha + \beta_1 \log (income_i) + \beta_2 \log (young_i) + \varepsilon_i $$ 

In [42]:
## With 50% split (no urban_log)
reg = MLJLinearModels.LinearRegressor()
machine1 = machine(reg, 
    educ_train_X[:,Not(:urban_log)], 
    educ_train_y, 
    scitype_check_level = 0);
MLJ.fit!(machine1);
ypred1 = MLJ.predict(machine1, educ_test_X[:,Not(:urban_log)]);

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mTraining machine(LinearRegressor(fit_intercept = true, …), …).
[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mSolver: MLJLinearModels.Analytical
[36m[1m│ [22m[39m  iterative: Bool false
[36m[1m└ [22m[39m  max_inner: Int64 200


In [66]:
# Mean-Squared Error (what do we want from it?)
sum((ypred1 .- educ_test_y) .^ 2)

0.3753283828349154

## Cross-Validation

### Validation Set Approach

Now, let us check how the full model performs:

$$ \log (education_i) = \alpha + \beta_1 \log (income_i) + \beta_2 \log (young_i) + \beta_3 \log (urban_i) + \varepsilon_i $$ 

In [67]:
## 50% split CV (full model)
machine2 = machine(reg, 
    educ_train_X, 
    educ_train_y, 
    scitype_check_level = 0);
MLJ.fit!(machine2);
ypred2 = MLJ.predict(machine2, educ_test_X);

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mTraining machine(LinearRegressor(fit_intercept = true, …), …).
[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mSolver: MLJLinearModels.Analytical
[36m[1m│ [22m[39m  iterative: Bool false
[36m[1m└ [22m[39m  max_inner: Int64 200


In [45]:
# Mean-Squared Error (what do we want from it?)
sum((ypred2 .- educ_test_y) .^ 2)

0.30179359775968656

## Cross-Validation

### Validation Set Approach

What happened in here?

To make things clear, your turn to try:

In [6]:
## Check in: Check the MSE when removing income_log. Is it better?

## Cross-Validation

### Leave-One-Out Cross-Validation

- It does what it says: leaves one observation out and fits the model with $n-1$ cases.

- Then, it predicts the results in the case left out.

- **Great** for small datasets and when prediction is critical.

- **Bad** in terms of computational time.

$$ CV_n \ = \ \dfrac{1}{n}\sum_i MSE_i $$

## Cross-Validation

### Leave-One-Out Cross-Validation

![img](https://github.com/umbertomig/POLI175julia/blob/c9b0555e3e97778495bee72746aee43ddf3226d7/img/cv2.png?raw=true)

## Cross-Validation

### Leave-One-Out Cross-Validation

One cool thing about Julia is that we can work with indexes, instead of creating new training and testing sets.

This frees up considerable memory.

For example, to do LOOCV, we can do:

```julia
loo = CV(
    nfolds = size(data_set)[1]-1, # All data but one observation
    shuffle = true,
    rng = 123
)
```

In [59]:
## LOOCV
loo = CV(
    nfolds = size(educ)[1]-1,
    shuffle = true,
    rng = 123
);

## Cross-Validation

### Leave-One-Out Cross-Validation

Evaluate function: provides the results rightaway, without the need for a machine (but you can do with a machine if you want to...):

In [74]:
# No log income:
evaluate(reg, 
    educ_X[:, Not(:income_log)], 
    educ_y, 
    resampling = loo,
    measure = [rms, l2], 
    verbosity = 0)

PerformanceEvaluation object with these fields:
  model, measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows, resampling, repeats
Extract:
┌────────────────────────┬───────────┬─────────────┬─────────┬──────────────────
│[22m measure                [0m│[22m operation [0m│[22m measurement [0m│[22m 1.96*SE [0m│[22m per_fold       [0m ⋯
├────────────────────────┼───────────┼─────────────┼─────────┼──────────────────
│ RootMeanSquaredError() │ predict   │ 0.23        │ 0.042   │ [0.049, 0.0921, ⋯
│ LPLoss(                │ predict   │ 0.0527      │ 0.0252  │ [0.0024, 0.0084 ⋯
│   p = 2)               │           │             │         │                 ⋯
└────────────────────────┴───────────┴─────────────┴─────────┴──────────────────
[36m                                                                1 column omitted[0m


## Cross-Validation

### Leave-One-Out Cross-Validation

In [75]:
# Full model
evaluate(reg, 
    educ_X, 
    educ_y, 
    resampling = loo,
    measure = [rms, l2], 
    verbosity = 0)

PerformanceEvaluation object with these fields:
  model, measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows, resampling, repeats
Extract:
┌────────────────────────┬───────────┬─────────────┬─────────┬──────────────────
│[22m measure                [0m│[22m operation [0m│[22m measurement [0m│[22m 1.96*SE [0m│[22m per_fold       [0m ⋯
├────────────────────────┼───────────┼─────────────┼─────────┼──────────────────
│ RootMeanSquaredError() │ predict   │ 0.135       │ 0.0233  │ [0.217, 0.266,  ⋯
│ LPLoss(                │ predict   │ 0.0183      │ 0.00597 │ [0.047, 0.0706, ⋯
│   p = 2)               │           │             │         │                 ⋯
└────────────────────────┴───────────┴─────────────┴─────────┴──────────────────
[36m                                                                1 column omitted[0m


## Cross-Validation

### Leave-One-Out Cross-Validation

In [10]:
## Your turn: compare the model with x without logs
## Note: the target has to be the same!

## Cross-Validation

### Metrics

To do the comparison, you need a metric.

Let's check the metrics we have:

In [78]:
## Lots of stats to compute the error:
measures()

LittleDict{Any, Any, Vector{Any}, Vector{Any}} with 55 entries:
  LPLoss                    => (aliases = ("l1", "l2", "mae", "mav", "mean_abso…
  MultitargetLPLoss         => (aliases = ("multitarget_l1", "multitarget_l2", …
  LPSumLoss                 => (aliases = ("l1_sum", "l2_sum"), consumes_multip…
  MultitargetLPSumLoss      => (aliases = ("multitarget_l1_sum", "multitarget_l…
  RootMeanSquaredError      => (aliases = ("rms", "rmse", "root_mean_squared_er…
  MultitargetRootMeanSquar… => (aliases = ("multitarget_rms", "multitarget_rmse…
  RootMeanSquaredLogError   => (aliases = ("rmsl", "rmsle", "root_mean_squared_…
  MultitargetRootMeanSquar… => (aliases = ("multitarget_rmsl", "multitarget_rms…
  RootMeanSquaredLogPropor… => (aliases = ("rmslp1",), consumes_multiple_observ…
  MultitargetRootMeanSquar… => (aliases = ("multitarget_rmslp1",), consumes_mul…
  RootMeanSquaredProportio… => (aliases = ("rmsp",), consumes_multiple_observat…
  MultitargetRootMeanSquar… => (aliases = ("m

## Cross-Validation

### Metrics

And the LP loss with $p=2$ is the mean squared error:

In [84]:
measures("LPLoss")

LittleDict{Any, Any, Vector{Any}, Vector{Any}} with 3 entries:
  LPLoss            => (aliases = ("l1", "l2", "mae", "mav", "mean_absolute_err…
  MultitargetLPLoss => (aliases = ("multitarget_l1", "multitarget_l2", "multita…
  LPSumLoss         => (aliases = ("l1_sum", "l2_sum"), consumes_multiple_obser…

## Cross-Validation

### Metrics

In [12]:
## Your turn: Find and use R-squared as the parameter for a
## LOOCV. What is the difference?

## Cross-Validation

### K-Fold Cross-Validation

- Leaves $k$ groups out and fits the model with the observations outside each group.

- Then, it predicts the results in the cases left out.

- **Great** in most cases.

- **Bad** *sometimes* computationally expensive.

$$ CV_k \ = \ \dfrac{1}{k}\sum_i MSE_i $$

## Cross-Validation

### K-Fold Cross-Validation

![img](https://github.com/umbertomig/POLI175julia/blob/c9b0555e3e97778495bee72746aee43ddf3226d7/img/cv3.png?raw=true)

## Cross-Validation

### K-Fold Cross-Validation

In [86]:
## 5-Fold CV
cv5 = CV(
    nfolds = 5,
    rng = 123
);

## Cross-Validation

### K-Fold Cross-Validation

In [88]:
# No log urban:
evaluate(reg, educ_X[:, Not(:urban_log)], educ_y, resampling = cv5, measure = [rms, l2], verbosity = 0)

PerformanceEvaluation object with these fields:
  model, measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows, resampling, repeats
Extract:
┌────────────────────────┬───────────┬─────────────┬─────────┬──────────────────
│[22m measure                [0m│[22m operation [0m│[22m measurement [0m│[22m 1.96*SE [0m│[22m per_fold       [0m ⋯
├────────────────────────┼───────────┼─────────────┼─────────┼──────────────────
│ RootMeanSquaredError() │ predict   │ 0.144       │ 0.0271  │ [0.167, 0.151,  ⋯
│ LPLoss(                │ predict   │ 0.0208      │ 0.00728 │ [0.0277, 0.0227 ⋯
│   p = 2)               │           │             │         │                 ⋯
└────────────────────────┴───────────┴─────────────┴─────────┴──────────────────
[36m                                                                1 column omitted[0m


In [89]:
# Full Model
evaluate(reg, educ_X, educ_y, resampling = cv5, measure = [rms, l2], verbosity = 0)

PerformanceEvaluation object with these fields:
  model, measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows, resampling, repeats
Extract:
┌────────────────────────┬───────────┬─────────────┬─────────┬──────────────────
│[22m measure                [0m│[22m operation [0m│[22m measurement [0m│[22m 1.96*SE [0m│[22m per_fold       [0m ⋯
├────────────────────────┼───────────┼─────────────┼─────────┼──────────────────
│ RootMeanSquaredError() │ predict   │ 0.134       │ 0.00863 │ [0.126, 0.13, 0 ⋯
│ LPLoss(                │ predict   │ 0.0179      │ 0.00233 │ [0.0158, 0.017, ⋯
│   p = 2)               │           │             │         │                 ⋯
└────────────────────────┴───────────┴─────────────┴─────────┴──────────────────
[36m                                                                1 column omitted[0m


## Cross-Validation

### K-Fold Cross-Validation

In [15]:
## Your turn: Run a 10-fold CV? Any differences?

## Cross-Validation

### Bias-Variance Trade-Off

**K-Fold CV** is more computationally efficient than LOOCV. But how about Bias-Variance Trade-offs?

Larger fractions in a two-split leads to high bias: over-estimates the error rates.

**LOOCV**: Leaves just one, so it gives an unbiased estimate of the testing error rates; Very good for bias reduction!

## Cross-Validation

### Bias-Variance Trade-Off

**LOOCV**: High variance: Almost the same observations at each run!
- Very bad for variance.
    
**K-Fold CV**:
- Each subset is a *bit more different* than the other.
- Leads to less correlation between each fold.
- Good balance usually with $k=5$ or $k=10$.

## Cross-Validation

### Bias-Variance Trade-off

![img](https://github.com/umbertomig/POLI175julia/blob/c9b0555e3e97778495bee72746aee43ddf3226d7/img/cv4.png?raw=true)

## Cross-Validation

### CV on Classification Problems

When we have a classification, we must change how we evaluate the error.

With classification, the LOOCV would look like this:

$$ CV_n \ = \ \dfrac{1}{n} \sum_i I(y_i \neq \widehat{y}_i) $$

And the `accuracy` measure will be $I(y_i = \widehat{y}_i)$, so we need to subtract 1.

## Cross-Validation

### CV on Classification Problems

![img](https://github.com/umbertomig/POLI175julia/blob/c9b0555e3e97778495bee72746aee43ddf3226d7/img/cv5.png?raw=true)

# Questions?

# See you next class
