# Inferential Statistics Vs. Predictive Statistics

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## Agenda

SWBAT:

- Describe the hallmarks of inferential statistics, and to contrast them with the hallmarks of predictive statistics;
- Relate the goals of model-building to expected value, bias and variance
- Define error as a function of prediction error and irreducible error
- Define prediction error as a combination of bias and variance

## Inferential Statistics in a Nutshell

In Phase 1 we looked at *descriptive* statistics: starting with a dataset and making various observations (overall shape, histogram, outliers, etc.) as well as calculations of quantities that can characterize the dataset as a whole (mean, median, mode, variance, standard deviation, quartiles, percentiles, etc.).

At the beginning of Phase 2 we moved into inferential statistics. The main idea here is to imagine that *we don't have* (or anyway cannot *measure*) all the data of interest.

And this is, of course, the typical situation. Consider:

- A zoologist wanting to know the typical lifespan of a Siberian tiger
- A cosmologist wanting to know the mass of a normal white dwarf star
- A businesswoman wanting to know how many M&M's her customers should expect to find in their Party Size bags
- A botanist wanting to know how tall California redwoods usually grow

The zoologist could, in principle:

1. keep track of every currently existing Siberian tiger;
2. record their (more or less) exact ages at their moments of death;
3. add up those ages and divide by the number of tigers to calculate an average lifespan

––But **only** in principle. In all of these situations, there is no realistic or practical opportunity to check each relevant data point.

What we can do, however, is to check *some* of the data points we want to check. That is, we'll draw a *sample* of data from our *population* of interest. We can then use the techniques of descriptive statistics to characterize our sample.

The hope, then, is that our sample will be *representative* of the population as a whole, which would justify our using facts about the sample to ***infer*** things about the population as a whole. Naturally we'll expect a certain amount of **error**: If I take the mean of a sample, $\bar{x}$ and project it as an estimate of the mean of the whole population, $\mu$, the estimate is bound to be imperfect. Etc. etc.

Inferential statistics makes all this precise. And that has been the bulk of the content of Phase 2.

Classically speaking, inference is a form of learning or of *increasing our knowledge*. So when conducting exercises in inferential statistics, the goal is ultimately **understanding**. If I am conducting a linear regression in an inferential mode, then:

- I will be very interested in the values of the coefficients, since these represent the effect of the associated factors on the target in question;
- the more data I use to build the regression the better;
- the fewer transformations of my data the better, since lots of transformations will impede transparency and comprehensibility;
- fewer predictors may be better than more;
- I will be very interested in respecting the assumptions of linear regression;
- I'll probably choose `statsmodels` if working in Python.

## Predictive Statistics in a Nutshell

The focus for predictive statistics is a bit different.

First, the goal is less on understanding and more (of course!) on making good *predictions* of future cases.

That means that I want the patterns I pick up on (in some dataset) to be patterns that will *recur* (in a similar dataset) in the future.

Needless to say, researchers may care *both* about understanding a process and about predicting the future. Science seems to be involve both. If I am performing a linear regression on data about cigarette smoking paired with data about lifespans, then I may want both to understand exactly how smoking affects lifespan and to make predictions for the future about how long smokers can expect to live.

Nevertheless, the difference of emphasis can make for a difference in practice. If I am conducting a linear regression in a predictive mode, then:

- I won't particularly care about the values of the coefficients;
- I may want to have two different datasets: one on which to build ("train") the regression and another on which to **evaluate** ("test") the regression. (*Did* the patterns that I picked up on in the first dataset recur in the second?)
- I won't particularly care about whether or how the data has been modified or transformed before subjecting it to regression analysis;
- more predictors are probably better than fewer;
- I won't care as much about respecting the assumptions of linear regression
- I'll probably choose `sklearn` if working in Python, since predictive statistics is at the heart of machine learning.

Of course, to the extent that we give up on actually trying to *understand* the phenomenon that we are modeling, to that extent we are happy to let our models be **black boxes**. As we move deeper into the course and our models get ever more sophisticated, they will also become ever more like black boxes, for better or for worse.

## Predictive Modeling Theory

![which model is better](img/which_model_is_better.png)

[Netflix example](https://towardsdatascience.com/cultural-overfitting-and-underfitting-or-why-the-netflix-culture-wont-work-in-your-company-af2a62e41288)

### What is a “Model”?

 - A “model” is a general specification of relationships among variables. 
     - e.g. a linear regression, such as: $ Price = \beta_1*Time +  \beta_0 (+ \epsilon)$

 - A “trained model” is a particular model that has been built using some training data.
 a. If the model is **parametric** (like a linear regression), then it has parameters that have been calculated using the training data;
 b. If the model is **non-parametric**, then it has (not parameters but) an algorithm that has been constructed using the training data.

### What makes a model good?

- We don’t ultimately care about how well your model fits your data.

- What we really care about is how well your model describes the process that generated your data.

- Why? Because the data set you have is but one sample from a universe of possible data sets, and you want a model that would work for any data set from that universe.



## Return to Expected Value

- The expected value of a quantity is the weighted average of that quantity across all possible samples

![6 sided die](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/6sided_dice.jpg/600px-6sided_dice.jpg)

- for a 6 sided die, another way to think about the expected value is the arithmetic mean of the rolls of a very large number of independent samples.  

### The expected value of a 6-sided die is:

In [None]:
probs = 1/6
rolls = range(1,7)

expected_value = sum([probs * roll for roll in rolls])
expected_value

## Defining model bias and variance

- Let's imagine we create a model that always predicts a roll of **3**.

    - **The *bias* is the difference between the average prediction of our model and the average roll of the die as we roll more and more times**.
        - What is the bias of a model that always predicts 3?
        <details>
    <summary> Answer below
    </summary>
    0.5
    </details>
    - **The *variance* is the average difference between each individual prediction and the average prediction of our model as we roll more and more times**.
        - What is the variance of that model?
        <details>
    <summary> Answer below
    </summary>
    0
    </details>

## Defining Error: prediction error and irreducible error



### Regression fit statistics are often called “error”
 - Sum of Squared Errors (SSE)
 $ {\displaystyle \operatorname {SSE} =\sum _{i=1}^{n}(Y_{i}-{\hat {Y_{i}}})^{2}.} $
 - Mean Squared Error (MSE) 
 
 $ {\displaystyle \operatorname {MSE} ={\frac {1}{n}}\sum _{i=1}^{n}(Y_{i}-{\hat {Y_{i}}})^{2}.} $
 
 - Root Mean Squared Error (RMSE)  
 $ {\displaystyle \operatorname 
  {RMSE} =\sqrt{MSE}} $

 All are calculated using residuals    

![residuals](img/residuals.png)


### Exercise

 - Fit a quick and dirty linear regression model
 - Store predictions in the y_hat variable using predict() from the fit model
 - Handcode SSE
 - Divide by the length of array to find Mean Squared Error
 - Check that your MSE equals sklearn's mean_squared_error function 

In [None]:
df = pd.read_csv('data/king_county.csv', index_col='id')
df = df.iloc[:, :12]

In [None]:
df.head()

In [None]:
X = df.drop('price', axis=1)
y = df.price

# Build the regression
lr = LinearRegression()
lr.fit(X, y)

In [None]:
# Calculate error
y_hat = None
sse = None
mse = None
rmse = None

# Compare with sklearn
print(rmse)
print(np.sqrt(mean_squared_error(y, y_hat)))

## Defining prediction error as a combination of bias and variance

$\Large Total\ Error\ = Prediction\ Error+ Irreducible\ Error$

Our prediction error can be further broken down into error due to bias and error due to variance.

$\Large Total\ Error = Model\ Bias^2 + Model\ Variance + Irreducible\ Error$

**Model Bias** is the expected prediction error of the expected trained model.

> In other words, if you were to train multiple models on different samples, what would be the average difference between the prediction and the real value?

**Model Variance** is the expected variation in predictions, relative to your expected trained model.

> In other words, what would be the average difference between any one model's prediction and the average of all the predictions?

**Bias vs. variance refers ultimately to the *accuracy* vs. *consistency* of the models trained by your algorithm.**

![target_bias_variance](img/target.png)

http://scott.fortmann-roe.com/docs/BiasVariance.html

## Coming up next

It goes without saying that we would generally like our models to have both low bias and low variance. But what is not so obvious is that, unfortunately, as one tends to go down, the other tends to go up. Moreover, we shall often be able to tweak model **hyperparameters** with the purpose of decreasing the bias (even if that also means increasing the variance) or of decreasing the variance (even if that also means decreasing the bias). And so we shall soon come to appreciate the ***bias-variance tradeoff*** as it applies to machine learning models.