In [1]:
import numpy as np
import pandas as pd
import scipy.stats as scs

import matplotlib.pyplot as plt

# Predictive Linear Regression

Brandon Martin-Anderson June 2019, with great debt to Moses Marsh, Matt Drury

### Learning goals

* fill this out

### The task of predicting quantities: in story form

Let's start a business: we estimate the fuel economy of cars. People describe a car and pay us money, and we tell them the fuel economy (in miles per gallon) of their car.

* The first customer comes in, and says, "This is a very competitive field. Why should we pay you?"
* You say, "We are very good at what we do."
* They say, "Your competitor says the same thing. Do you have evidence?"

Note two things.
1. **You are competing against other models.**
2. **Your model will be evaluated quantitatively.**

So you say: "We have analyzed all of our competitors's predictions. All their predictions have some error; the mean of their predictive errors is 5 mpg. We can do better."

* Them: "Ah, interesting. Prove it."
* You: "Describe a car."
* Them: "A 2020 hybrid Ford F250"
* You: "Can't predict - that **that car isn't in our records**"
* Them: "What good is that?"
* You: "If you ask about an old car, we can predict it perfectly."
* Them: "And...why would we pay for that?"
* You: "I don't know, but we are perfect."
* Them: "We'll go to your competitor instead."


3. **The only performance that matters is predictive performance on unseen examples.**



### The task of predicting quantities: in math form

There exists a quantitative response $Y$ (e.g., fuel economy) which is equal to the function of some properties $X$ (e.g., properties of cars) plus irreducable noise $\epsilon$ (e.g., manufacturing errors, vehicle wear, variations in vehicle usage).

$$Y = f(X) + \epsilon$$

The function $f(X)$ is generally known only to Nature - it is almost always hidden from us.

It is often possible to obtain a specific $y_i = f(x_i)$ through a relatively expensive process (e.g., building a car with certain properties and then driving it around and measuring its fuel economy). Instead of engaging in an expensive process, we're interested in finding an estimate $\hat{Y}$ using a _predictive model_ $\hat{f}$:

$$\hat{Y} = \hat{f}(X)$$

In contrast to $f$, $\hat{f}$ is known to us, because it's a function that we choose.

The formal goal of prediction is to find $\hat{f}$ that **minimizes some error function $E(Y, \hat{Y})$ on a collection of samples $X$ hidden when $\hat{f}$ is chosen.**

In the example above, $E$ was simply the mean absolute difference

$$E(Y, \hat{Y}) = \frac{\sum_{i=1}^{n}{|y_i - \hat{y_i}|}}{n}$$

The choice of error function (often called _loss_) is ultimately chosen to model the business logic of the prediction task, but probaby the most common is the _mean square error_:

$$E(Y, \hat{Y}) = \frac{\sum_{i=1}^{n}{(y_i - \hat{y_i})^2}}{n}$$

For a number of reasons, chiefly:
* It's convex.
* It's decomposable into bias, variance, and irreduceable error terms.
* There exists a closed form solution to minimize it in the case of a linear $\hat{f}$.

That last one is _extremely convenient_. It's why we're talking about linear regression today.



## A very simple predictive model

We wish to devise a predictive model $\hat{f}$ for our fuel-economy-estimation business. We start with a CSV we found on the internet.

In [2]:
cars = pd.read_csv("data/cars.csv")

In [3]:
cars.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


Let's start with a very simple model: **always predict the average mpg**.

## Evaluating our model

Let's evaluate our model.

Because our metric for success is performance on unseen data, we need to keep some our data unseen.

In [4]:
n = len(cars)
n_holdout = int(n*0.2)
print( f"{n} records total, holding out {n_holdout}" )

398 records total, holding out 79


Shuffle the cars; or else we'd learn just about the 8-cylinder cars...

In [5]:
cars = cars.sample(len(cars))
cars_test = cars.iloc[:n_holdout]
cars_train = cars.iloc[n_holdout:]

In [6]:
cars_train.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model,origin,car_name
139,14.0,8,302.0,140.0,4638.0,16.0,74,1,ford gran torino (sw)
290,15.5,8,351.0,142.0,4054.0,14.3,79,1,ford country squire (sw)
181,33.0,4,91.0,53.0,1795.0,17.5,75,3,honda civic cvcc
43,13.0,8,400.0,170.0,4746.0,12.0,71,1,ford country squire (sw)
391,36.0,4,135.0,84.0,2370.0,13.0,82,1,dodge charger 2.2


Creating the "model" simply involves finding the mean, and returning it for every row in $X$.

In [7]:
mean_mpg = cars_train.mpg.mean()

In [8]:
mean_mpg

23.330721003134798

In [9]:
fhat = lambda X: np.ones(len(X))*mean_mpg

In [10]:
yhat = fhat(cars_test)

In [11]:
yhat

array([23.330721, 23.330721, 23.330721, 23.330721, 23.330721, 23.330721,
       23.330721, 23.330721, 23.330721, 23.330721, 23.330721, 23.330721,
       23.330721, 23.330721, 23.330721, 23.330721, 23.330721, 23.330721,
       23.330721, 23.330721, 23.330721, 23.330721, 23.330721, 23.330721,
       23.330721, 23.330721, 23.330721, 23.330721, 23.330721, 23.330721,
       23.330721, 23.330721, 23.330721, 23.330721, 23.330721, 23.330721,
       23.330721, 23.330721, 23.330721, 23.330721, 23.330721, 23.330721,
       23.330721, 23.330721, 23.330721, 23.330721, 23.330721, 23.330721,
       23.330721, 23.330721, 23.330721, 23.330721, 23.330721, 23.330721,
       23.330721, 23.330721, 23.330721, 23.330721, 23.330721, 23.330721,
       23.330721, 23.330721, 23.330721, 23.330721, 23.330721, 23.330721,
       23.330721, 23.330721, 23.330721, 23.330721, 23.330721, 23.330721,
       23.330721, 23.330721, 23.330721, 23.330721, 23.330721, 23.330721,
       23.330721])

Find the MSE without loops using vectorized numpy-style operations:

In [12]:
mse = ((cars_test.mpg - yhat)**2).mean()
mse

60.235310670236366

But it satisfies two important requirements:

* It is a quantitative, comparable error metric,
* computed on unseen data.

In a very simple way, we've done our job. Simple models like this are called **benchmarks**; they're very useful for setting a lower bound on the performance of more complex models.

Though MSE is mathematically elegant, it can be difficult to interpet. It is also common to look at the square root of the MSE, which has the same units as the targer variable.

In [13]:
mse**0.5

7.761141067538739

This can be interpreted as how far off the model is on average.