# POLI 175

## Class 02 - What is Machine Learning?

Dr. Umberto Mignozzetti

UCSD

(Slides here follow ISL closely)

## Introduction

- Suppose you are hired as a consultant to help design campaign expenditures for a firm.

- And they ask you: Where should we spend our resources? The options are: `TV`, `radio`, and `newspaper`.

- They want to maximize the sales revenue.

- Where would you spend the money?

## Introduction

- Let me give you a bit more info: here are the previous advertising expenditures and their effects on sales:

![image](img/sales.png)

## Introduction

- Did it help?

- Some people would say yes, I'd say *not really*.

![image](img/sales.png)

## Introduction

Let's formalize the ideas:

- $X$: Matrix of predictors ($X_1$: TV expenditures, $X_2$: radio, $X_3$: newspaper)

- $Y$: Response variable

- $f(.)$: Unknown function that connects the predictors with the response variable.

- $\varepsilon$: Random error term

$$ Y \ = \ f(X) + \varepsilon $$

## Introduction

Another example: Do you think your years of study will reflect into a better salary in the future?

- $Y$: Future salary

- $X$: Years of study

![image](img/educ.png)

## Why estimate $f$?

- Our job when doing ML is to estimate $f$. But why do we do that?

1. **Prediction**: We want to predict the values of $Y$: $\hat{Y} = f(\hat{X})$
    
$$ E(Y − \hat{Y})^2 \ = \ E[f(X) + \varepsilon - \hat{f}(X)]^2 = \underbrace{[f(X) - \hat{f}(X)]^2}_{\text{Reducible}} + \underbrace{Var(\varepsilon)}_{\text{Non-reducible}}
$$

## Why estimate $f$?

2. **Inference**: We want, as scientists, to understand how $Y$ is related with a set of $X$s.
    
    1. *Which predictors are associated with the response?*
    
    2. *What is the relationship between the response and each predictor?*
    
    3. *Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?*

## How do we estimate $f$?

- Let a set of $n$ observations, $(Y_1, X_1)$, ..., $(Y_n, X_n)$.

- We will call these observations the **training set**, since we will use these to estimate the function $f$.

- Broadly speaking we have two methods to estimate the $f$ function:

    1. Parametric

    2. Non-parametric

## How do we estimate $f$?

**Parametric**:

1. We make an assumption about the functional form, e.g., that the f.f. is linear:

$$ Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p $$

2. After the f.f. is selected, we fit (train) the model using the data.

## How do we estimate $f$?

**Parametric**:

$$ \text{income} \approx \beta_0 + \beta_1 \times \text{education} + \beta_2 \times \text{seniority} $$

![image](img/linreg.png)

## How do we estimate $f$?

**Parametric**:

- This parametric approach has advantages. The main one is that it is straightforward to estimate.

- However, it is not very flexible, and it does not capture more complex relationships.

- We can estimate more flexible relations, but we may *overfit* our estimates.

- We can always conjecture the wrong $f$!

- In any case, in the parametric models we need to make assumptions regarding the f.f. of $f$.

## How do we estimate $f$?

**Non-parametric**:

- Does not assume the f.f. of $f$.

- Seek an estimate of $f$ that gets as close to the data points as possible, without being too rough or wiggly.

- Requires lots of observations.

- *Overfitting* becomes a more salient problem.

**Overfitting:** The estimation do well in the training set, but when you apply it to other observations, it does poorly.

## How do we estimate $f$?

**Non-parametric**: Thin-plate splines

![spline](img/spline.png)

# Estimation of $f$

**Trade-offs:** Flexibility x Interpretability

- *Why would we ever choose to use a more restrictive method instead of a very flexible approach?*

- If you are a scientist, you may want to interpret the results more than have a flexible but hard-to-understand approach.

- Thus, when **inference** is the goal, we may choose a more restrictive model.

- When **prediction** is the goal, we may use a more flexible model. It captures more nuanced relationships.

- Think self-driving Teslas: you need to predict when to turn, not explain to me.

- But the interpretability problem does not go away: think about why some people complain about self-driving Teslas?

## Estimation of $f$

**Trade-offs:** Flexibility x Interpretability

![flexint](img/flexint.png)

## Estimation of $f$

**Approaches:** Supervised x Unsupervised Machine Learning

- The machine learning techniques roughly divide into *Supervised* and *Unsupervised* methods

- **Supervised:** For each observation $i$, we have a target $Y_i$.

- **Unsupervised:** We have **no** target $Y_i$. Only $X_i$s, and we want to make sense of it.

- **Semi-Supervised:** We know a few $Y_i$, but we want to predict the $Y_i$s for the majority of the data.

## Estimation of $f$

**Unsupervised approach:**

![unsup](img/unsup.png)

## Model Accuracy

- Too many methods... How to choose?

- *There is no free lunch in statistics*: **no one method dominates all others over all possible data sets.**.

- We will spend some time choosing methods, and then, choosing the best *tunning* parameters for these methods.

- One criterion: 

**Mean Squared Error (MSE)**

$$ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i − \hat{f}(x_i))^2 $$

## Model Accuracy
**Mean Squared Error (MSE)**

$$ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i − \hat{f}(x_i))^2 $$

- We can compute the MSE on the *training* data, but what we really want to know is how the MSE performs in *unseen* data.

- That's why for most training purposes, we will split our dataset into two parts: *training* and *testing*.

- We want to compute the MSE in this *testing* data: it is our best shot at knowing how it is going to behave in real-world applications!

## Model Accuracy

**Mean Squared Error (MSE)**

![bvt](img/bvt.png)

## Model Accuracy

**Mean Squared Error (MSE)**

![bvt](img/bvt2.png)

## Model Accuracy

- This trade-off is called **Bias-Variance Trade-off**.

- When we adopt a more flexible approach, we **decrease** the bias (distance between $f$ and $\hat{f}$).

    - This means that the training MSE decreases.

- However, when we adopt a more flexible approach, we **increase** the variance (think overfitting).

$$ E(y_0 - \hat{f}(x_0))^2 \ = \ Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\varepsilon) $$

- Our job is to fit a model that has **low bias** and **low variance**.

## Model Accuracy

**Bias-Variance Trade-off**

![bvt](img/bvt3.png)

## Questions?

## See you in the next class!