# Unit 10: Introduction to supervised learning 

***
## Simple (univariate) linear regression

Simple linear model with single independent variable $x$:
$$
\begin{aligned}
\text{Econometrics:} \qquad y_i &= \alpha + \beta x_i + \epsilon_i \\
\text{ML:} \qquad y_i &= b + w x_i + \epsilon_i
\end{aligned}
$$

#### Terminology

- $y$: dependent variable, response variable, outcome, target
- $x$: independent variables, features, covariates, predictors
- $\epsilon$: error term
- $\alpha$, $b$: intercept or bias (ML)
- $\beta$, $w$: slope coefficient or weight (ML)

#### Linearity assumption

- Model is assumed to be linear **in coefficients**, not in $x$
- Linear models include the following:
    $$
    \begin{aligned}
     y_i &= \alpha + \beta_1 x_i + \beta_2 x_i^2 + \epsilon_i \\
     \log y_i &= \alpha + \beta \log x_i + \epsilon_i
    \end{aligned}
    $$

***
### Estimation

- Goal: Minimize loss function is given by **sum of squared residuals**:
    $$
    L(\alpha, \beta) = \frac{1}{N}
        \sum_i^N \Bigl(y_i - \alpha - \beta x_i \Bigr)^2
    $$
- **Estimates:** Parameters that minimise $L$ are often denoted $\widehat{\alpha}$, $\widehat{\beta}$
- **Predicted values** for given $x_i$:
    $$
    \widehat{y}_i = \widehat{\alpha} + \widehat{\beta} x_i
    $$
- **Prediction error** for given $y_i$, $x_i$:
    $$
    \begin{aligned}
    \widehat{\epsilon}_i &= y_i - \widehat{y_i} \\
        &= y_i - \widehat{\alpha} - \widehat{\beta} x_i
    \end{aligned}
    $$

***
### Example: Univariate linear regression with scikit-learn

Assume true model is given as follows:
$$
\begin{aligned}
y_i &= 1 + \frac{1}{2} x_i + \epsilon_i \\
\epsilon &\stackrel{\text{iid}}{\sim} N(0, 0.7^2)
\end{aligned}
$$
![Simple model](images/unit10_simple.png)

#### Simulate data

- Create random sample of $N=30$ observations
- $x$ uniformly spaced on $[0, 10]$
- Use `seed=123`

#### Estimate linear regression

- Estimate linear model with [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html):
    1. Create model instance
    2. Fit model with `fit()` method
        - Estimated intercept stored in `intercept_` attribute
        - Estimated coefficients stored in `coef_` attribute
    3. Compute predicted values with `predict()`

#### Plot sample, true and estimated model

***
## Training and test samples

- ML mostly deals with prediction
- Estimate model on **training sample**
- Evaluate prediction on **test sample**
- Avoids **overfitting** on training sample

### Example: Ames housing data

- Target variable `SalePrice` (in USD)
- Explanatory variable `LivingArea` (in m²)
- Simple linear model:
    $$
    SalePrice_i = \alpha + \beta \cdot LivingArea_i + \epsilon_i
    $$

In [4]:
# Uncomment this to use files in the local data/ directory
DATA_PATH = '../../data'

# Load data directly from GitHub (for Google Colab)
# DATA_PATH = 'https://raw.githubusercontent.com/richardfoltyn/MLFP-ECON5130/main/data'

#### Manually creating training and test samples

1. Randomly assign 90% of data to training sample
2. Rest assigned to test sample

#### Estimate univariate model

- Estimation performed on training sample
- Interpretation of coefficients?

#### Prediction errors

- Predicted sale price:
    $$
    \widehat{SalePrice}_i = \widehat{\alpha} + \widehat{\beta} \cdot LivingArea_i
    $$
- Prediction error:
    $$
    \widehat{\epsilon}_i = SalePrice_i - \widehat{SalePrice}_i
    $$
- Plot prediction error against dependent or independent variable

#### Automatically creating training and test samples

- Use [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
    - Specify `test_size` or `train_size`
    - Set `random_state` for reproducibility

***
## Evaluating the model fit

- Need metric (score) to evaluate model fit!
- **Mean squared error (MSE):**
    $$
    MSE = \frac{1}{N} \sum_{i=1}^N \bigl(y_i - \widehat{y}_i\bigr)^2
    $$
- **Root mean squared error (RMSE)**:
    $$
    RMSE = \sqrt{MSE} = \left(\frac{1}{N} \sum_{i=1}^N \bigl(y_i - \widehat{y}_i\bigr)^2 \right)^{\frac{1}{2}}
    $$
- **Coefficients of determination ($R^2$)**: bounded within $[0,1]$ on training sample
    $$
    R^2 = 1 - \frac{MSE}{\widehat{Var}(y)}
    $$

Convenience functions in `scikit-learn`:

- [`mean_squared_error()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error)
- [`r2_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score)

***
## Multivariate linear regression

### Data with several explanatory variables

$$
y_i = \mathbf{x}_i'\mathbf{\beta} + \epsilon_i
$$

- Regressors are now given as **vector** $\mathbf{x}$
- Coefficient **vector** $\beta$ to be estimated

***
### Example: Ames housing data

- Target variable `SalePrice` (in USD)
- Explanatory variables `LivingArea` (in m²), `LotArea` (in m²)
$$
SalePrice_i = \alpha + \beta_0 LivingArea_i + \beta_1 LotArea_i + \epsilon_i
$$

#### Create train/test sample split

- [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) with `test_size=123`, `random_state=123`

#### Estimate multivariate model

- Interpretation of coefficients?

#### Plot prediction errors

- Plot against response variable

***
### Polynomial features

- Example: Linear model with cubic polynomial in $x$ as explanatory variable:
    $$
    y_i = \alpha + \beta_0 x_i + \beta_1 x_i^2 + \beta_2 x_i^3 + \epsilon_i
    $$

#### Example: Ames housing data with interactions

- Polynomials with interactions of $x$ and $z$:
    $$
    p(x,z) = \beta_0 + \beta_1 x + \beta_2 z + \beta_3 x^2 + \beta_4 x \cdot z + \beta_5 z^2
    $$
    - $x$: `LivingArea`
    - $z$: `LotArea`
- Model given by
    $$
    SalePrice_i = p(LivingArea_i, LotArea_i) + \epsilon_i
    $$

##### Create polynomial features

- Create with [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
- Pass `include_bias=True` to include intercept (constant)
- If polynomial has constant, fit linear model with `fit_intercept=False`
- Exponents stored in `powers_` attribute

##### Estimate model

##### Plot prediction errors

***
### Using scikit-learn pipelines

- Manually preprocessing variables is error prone 
- Can be automated using **pipelines**:
    1. Create an instance of the 
        [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) class
    2. Use the 
        [`make_pipeline()`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) convenience function

    Pipeline names can be accessed using `named_steps` attribute

#### Estimate polynomial model using pipelines

***
## Optimising hyperparameters with cross-validation

- Hyperparameter: additional parameters that are **not** estimated, e.g., polynomial degree from earlier
- How do we find optimal values for such parameters?
    1. Estimate model for different values of hyperparameters
    2. Pick best-performing model

### Illustration of cross-validation

![CV split](../../lectures/images/cv_split.svg)

***
### Example: Tuning of the polynomial degree with manual cross-validation

- Same setting as earlier:
    $$
    p(x,z) = \beta_0 + \beta_1 x + \beta_2 z + \beta_3 x^2 + \beta_4 x \cdot z + \beta_5 z^2
    $$
    - $x$: `LivingArea`
    - $z$: `LotArea`
- Find best polynomial degree $d = 0,\dots,4$ using K-fold CV with 10 folds
- Use [`KFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold) to split sample


#### Cross-validation

#### Plot RMSE against polynomial degree

***
### Automating cross-validation

- Helper function [`cross_val_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html):
    - Select number of folds: `cv=10`
    - Select metric: `scoring='neg_root_mean_squared_error'`
    - List of available metrics (scores): `sklearn.metrics.get_scorer_names()`

***
## Linear models with regularisation: Ridge regression

- Additional penalty term in loss function:
$$
L(\mu, \mathbf{\beta}) = 
    \underbrace{\sum_{i=1}^N \Bigl(
    y_i - \mu - \mathbf{x}_i'\mathbf{\beta}\Bigr)^2}_{\text{Sum of squared errors}}
    + 
    \underbrace{\alpha \sum_{k=1}^K\beta_k^2}_{\text{L2 penalty}}
$$

### Example: Polynomial approximation

- True model: trigonometric function
    $$
    \begin{aligned}
    y_i &= \cos\left( \frac{3}{2}\pi x_i \right) + \epsilon_i \\
        \epsilon &\stackrel{\text{iid}}{\sim} N(0, 0.25)
    \end{aligned}
    $$
- Goal: Approximate with polynomial:
    $$
    y_i \approx \mu + \beta_1 x_i + \beta_2 x_i^2 + \cdots + \beta_K x_i^K 
    $$
    $K$ is a hyperparameter, but let's fix it at $K=15$

#### Plot true relationship

- Sample of $N=100$ created with `seed=1234`

#### Estimating linear regression model

Steps:

1. Create the polynomial in $x$ with
[`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
2. (Optional) Demean and normalise features with [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
3. Estimate with [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
3. Combine steps in [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)


#### Estimating the Ridge model

Steps:

1. Create the polynomial in $x$ with
[`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
2. Demean and normalise features with [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
3. Estimate with [`Ridge`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) with $\alpha=3$ (will be cross-validated later)
3. Combine steps in [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)


#### Compare linear regression vs. Ridge

***
### Tuning the regularisation parameter via cross-validation

- Can be automated with [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html)
- Takes grid of candidate $\alpha$ as argument
    - Use `logspace()` to put more alphas at lower end of grid
    - Argument `store_cv_values=True` stores scores for all alphas and all folds
- **Important:** `RidgeCV` does **not** support pipelines, need to apply transformation manually

#### Perform cross-validation

- Optimal $\alpha$ stored in `alpha_` attribute
- Best score stored in `best_score_` attribute
- All scores are stored in `cv_values_` attribute (only if `store_cv_values=True`)

#### Plot MSE against alphas

#### Re-estimate with optimal alpha

***
## Linear models with regularisation: Lasso

- Additional penalty term in loss function (but different from Ridge!):
$$
L(\mu, \mathbf{\beta}) = 
    \underbrace{\sum_{i=1}^N \Bigl(
    y_i - \mu - \mathbf{x}_i'\mathbf{\beta}\Bigr)^2}_{\text{Sum of squared errors}}
    + 
    \underbrace{\alpha \sum_{k=1}^K |\beta_k|}_{\text{L1 penalty}}
$$

### Example: Polynomial approximation

- Exact same setting as for Ridge

#### Estimating the Lasso

Steps:

1. Create the polynomial in $x$ with
[`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
2. Demean and normalise features with [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
3. Estimate with [`Lasso`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) with $\alpha=0.015$ (will be cross-validated later)
    - Might need to increasing the `max_iter` parameter (from the default of 1,000).
4. Combine steps in [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

#### Number of non-zero coefficients 

- Lasso creates **sparse** models: only a few coefficients are non-zero
- Example: fit models for $\alpha$ on the interval $[5 \times 10^{-3}, 1]$

### Tuning the regularisation parameter via cross-validation

#### Perform cross-validation

- Performed via [`LassoCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html)
- Need to specify either of the following:
    - Grid of candidate $\alpha$ (similar to `RidgeCV`)
    - Fraction $\epsilon = \frac{\alpha_{min}}{\alpha_{max}}$ (default: $10^{-3}$) and the grid size (default: 100)

        Resulting grid of $\alpha$ is stored in `alphas_` attribute.
- Might need to increasing the `max_iter` parameter (from the default of 1,000).
- Results stored in attributes `alpha_` and `mse_path_`

#### Plot MSE against alpha

#### Re-estimate with optimal alpha