<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Inferential-Statistics-Vs.-Predictive-Statistics" data-toc-modified-id="Inferential-Statistics-Vs.-Predictive-Statistics-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Inferential Statistics Vs. Predictive Statistics</a></span><ul class="toc-item"><li><span><a href="#Learning-Objectives" data-toc-modified-id="Learning-Objectives-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Learning Objectives</a></span></li></ul></li><li><span><a href="#Inferential-Statistics-in-a-Nutshell" data-toc-modified-id="Inferential-Statistics-in-a-Nutshell-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Inferential Statistics in a Nutshell</a></span></li><li><span><a href="#Predictive-Statistics-in-a-Nutshell" data-toc-modified-id="Predictive-Statistics-in-a-Nutshell-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Predictive Statistics in a Nutshell</a></span><ul class="toc-item"><li><span><a href="#In-Predictive-Statistics" data-toc-modified-id="In-Predictive-Statistics-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>In Predictive Statistics</a></span></li></ul></li><li><span><a href="#Predictive-Modeling-Theory" data-toc-modified-id="Predictive-Modeling-Theory-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Predictive Modeling Theory</a></span><ul class="toc-item"><li><span><a href="#What-Is-a-“Model”?" data-toc-modified-id="What-Is-a-“Model”?-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>What Is a “Model”?</a></span></li><li><span><a href="#What-Makes-a-Model-Good?" data-toc-modified-id="What-Makes-a-Model-Good?-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>What Makes a Model Good?</a></span></li><li><span><a href="#Return-to-Expected-Value" data-toc-modified-id="Return-to-Expected-Value-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Return to Expected Value</a></span><ul class="toc-item"><li><span><a href="#The-expected-value-of-a-6-sided-die-is:" data-toc-modified-id="The-expected-value-of-a-6-sided-die-is:-4.3.1"><span class="toc-item-num">4.3.1&nbsp;&nbsp;</span>The expected value of a 6-sided die is:</a></span><ul class="toc-item"><li><span><a href="#Lets-Model-the-Die-Roll" data-toc-modified-id="Lets-Model-the-Die-Roll-4.3.1.1"><span class="toc-item-num">4.3.1.1&nbsp;&nbsp;</span>Lets Model the Die Roll</a></span></li></ul></li></ul></li><li><span><a href="#Defining-model-bias-and-variance" data-toc-modified-id="Defining-model-bias-and-variance-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Defining model bias and variance</a></span></li><li><span><a href="#Defining-Error:-prediction-error-and-irreducible-error" data-toc-modified-id="Defining-Error:-prediction-error-and-irreducible-error-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Defining Error: prediction error and irreducible error</a></span><ul class="toc-item"><li><span><a href="#Regression-fit-statistics-are-often-called-“error”" data-toc-modified-id="Regression-fit-statistics-are-often-called-“error”-4.5.1"><span class="toc-item-num">4.5.1&nbsp;&nbsp;</span>Regression fit statistics are often called “error”</a></span></li><li><span><a href="#Exercise" data-toc-modified-id="Exercise-4.5.2"><span class="toc-item-num">4.5.2&nbsp;&nbsp;</span>Exercise</a></span></li></ul></li><li><span><a href="#Defining-prediction-error-as-a-combination-of-bias-and-variance" data-toc-modified-id="Defining-prediction-error-as-a-combination-of-bias-and-variance-4.6"><span class="toc-item-num">4.6&nbsp;&nbsp;</span>Defining prediction error as a combination of bias and variance</a></span></li></ul></li><li><span><a href="#Coming-up-next" data-toc-modified-id="Coming-up-next-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Coming up next</a></span></li></ul></div>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Inferential Statistics Vs. Predictive Statistics

## Learning Objectives

- Describe the hallmarks of inferential statistics, and to contrast them with the hallmarks of predictive statistics
- Relate the goals of model-building to expected value, bias and variance
- Define error as a function of prediction error and irreducible error
- Define prediction error as a combination of bias and variance

# Inferential Statistics in a Nutshell

In Phase 1 we looked at **descriptive** statistics: starting with a dataset and making various observations (overall shape, histogram, outliers, etc.) as well as calculations of quantities that can characterize the dataset as a whole (mean, median, mode, variance, standard deviation, quartiles, percentiles, etc.).

At the beginning of Phase 2 we moved into **inferential** statistics. The main idea here is to imagine that **we don't have** (or anyway cannot *measure*) all the data of interest.

And this is, of course, the typical situation. Consider:

- A zoologist wanting to know the typical lifespan of a Siberian tiger
- A cosmologist wanting to know the mass of a normal white dwarf star
- A businesswoman wanting to know how many M&M's her customers should expect to find in their Party Size bags
- A botanist wanting to know how tall California redwoods usually grow

![](images/tiger.jpg)


The zoologist could, in principle:

1. keep track of every currently existing Siberian tiger
2. record their (more or less) exact ages at their moments of death
3. add up those ages and divide by the number of tigers to calculate an average lifespan

––But **only** in principle. In all of these situations, there is no realistic or practical opportunity to check each relevant data point.

![](images/sampling.png)

What we can do, however, is to check *some* of the data points we want to check. That is, we'll draw a *sample* of data from our *population* of interest. We can then use the techniques of descriptive statistics to characterize our sample.

The hope, then, is that
- our sample will be *representative* of the population as a whole, 
- which would justify our using facts about the sample to ***infer*** things 
- Naturally we'll expect a certain amount of **error**: 

Inferential statistics makes all this precise. And that has been the bulk of the content of Phase 2.

Classically speaking, inference is a form of learning or of *increasing our knowledge*. So when conducting exercises in inferential statistics, the goal is ultimately **understanding**. If I am conducting a linear regression in an inferential mode, then:

- I will be very interested in the values of the coefficients, since these represent the effect of the associated factors on the target in question
- the more data I use to build the regression the better
- the fewer transformations of my data the better, since lots of transformations will impede transparency and comprehensibility
- fewer predictors may be better than more
- I will be very interested in respecting the assumptions of linear regression


# Predictive Statistics in a Nutshell

The focus for predictive statistics is a bit different.

First, the goal is less on understanding and more (of course!) on making good *predictions* of future cases.

That means that I want the patterns I pick up on (in some dataset) to be patterns that will *recur* (in a similar dataset) in the future.

![](images/crystall_ball.png)

<a href="https://commons.wikimedia.org/wiki/File:743-crystal-ball-1.svg">Vincent Le Moign</a>, <a href="https://creativecommons.org/licenses/by/4.0">CC BY 4.0</a>, via Wikimedia Commons

## In Predictive Statistics
- I won't particularly care about the values of the coefficients
- I may want to have two different datasets: 
  - train the model
  - test the model 
- I won't particularly care about whether or how the data has been modified or transformed before subjecting it to regression analysis
- more predictors are probably better than fewer
- I won't care as much about respecting the assumptions of linear regression
- I'll probably choose `sklearn` if working in Python, since predictive statistics is at the heart of machine learning

Of course, to the extent that we give up on actually trying to *understand* the phenomenon that we are modeling, to that extent we are happy to let our models be **black boxes**.
- As we move deeper into the course and our models get ever more sophisticated, they will also become ever more like black boxes, for better or for worse.

**LESS MANUAL CODING**

# Predictive Modeling Theory

![which model is better](images/which_model_is_better.png)

## What Is a “Model”?

In [33]:
import random
random.seed = 42
random.randint(1,1000)

366

- A “model” is a general specification of relationships among variables. 
    + e.g. a linear regression, such as: $ Price = \beta_1*Time +  \beta_0 (+ \epsilon)$
- A “trained model” is a particular model that has been built using some training data.
    + If the model is **parametric** (like a linear regression), then it has parameters that have been calculated using the training data;
    + If the model is **non-parametric**, then it has (not parameters but) an algorithm that has been constructed using the training data.

## What Makes a Model Good?

- We don’t ultimately care about how well your model fits your data.
- What we really care about is how well your model describes the process that generated your data.
- Why? Because the data set you have is but one sample from a universe of possible data sets, and you want a model that would work for any data set from that universe.

## Return to Expected Value

- The expected value of a quantity is the weighted average of that quantity across all possible samples

![6 sided die](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/6sided_dice.jpg/600px-6sided_dice.jpg)

- for a 6 sided die, another way to think about the expected value is the arithmetic mean of the rolls of a very large number of independent samples.  

### The expected value of a 6-sided die is:

In [1]:
probs = 1/6
rolls = range(1, 7)

expected_value = sum([probs * roll for roll in rolls])
expected_value

3.5

#### Lets Model the Die Roll

Expected Value = 3.5<br>
Guess Value    = Random Guess b/w 1 and 6<br>

Roll = Die Roll Value

Compute Diffs between Die Roll and 
   - Expected Value
   - Guess Value 
   
Over a large sample size, which will be closer?


In [66]:
import random
expd = 3.0

sumg=0
sume=0
nrolls=0
toroll=100000
dice = {}

for xx in range(toroll):
  roll = random.randint(1,6)
  guess = random.randint(1,6)
  diffg = guess-roll
  diffe = expd - roll
  sumg+=abs(diffg)
  sume+=abs(diffe)
  if roll in dice:
      dice[roll]+=1
  else:
      dice[roll]=1
  nrolls+=1
    
print("Expected Diffs : ",sume/nrolls)
print("Guess   Diffs : ",sumg/nrolls)
for dd in sorted(dice.keys()):
    print(dd,dice[dd])

Expected Diffs :  1.50343
Guess   Diffs :  1.9513
1 16644
2 16653
3 16620
4 16570
5 16707
6 16806


## Defining model bias and variance

- Let's imagine we create a model that always predicts a roll of **3**.

- **The *bias* is the difference between the average prediction of our model and the average roll of the die as we roll more and more times**.
    - What is the bias of a model that always predicts 3?
    
- **The *variance* is the average difference between each individual prediction and the average prediction of our model as we roll more and more times**.
    - What is the variance of that model?
    

## Defining Error: prediction error and irreducible error

### Regression fit statistics are often called “error”

- Sum of Squared Errors (SSE)

 $ {\displaystyle \operatorname {SSE} =\sum _{i=1}^{n}(Y_{i}-{\hat {Y_{i}}})^{2}.} $
 
 
- Mean Squared Error (MSE) 
 
 $ {\displaystyle \operatorname {MSE} ={\frac {1}{n}}\sum _{i=1}^{n}(Y_{i}-{\hat {Y_{i}}})^{2}.} $
 
 
- Root Mean Squared Error (RMSE)  

 $ {\displaystyle \operatorname 
  {RMSE} =\sqrt{MSE}} $

 All are calculated using residuals   $ (Y_{i}-{\hat {Y_{i}}}) $

![residuals](images/residuals.png)

### Exercise

 - Fit a quick and dirty linear regression model
 - Store predictions in the y_hat variable using predict() from the fit model
 - Handcode SSE
 - Divide by the length of array to find Mean Squared Error
 - Check that your MSE equals sklearn's mean_squared_error function 

In [37]:
df = pd.read_csv('data/king_county.csv', index_col='id')
df = df.iloc[:, :12]

In [59]:
df.describe()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement
count,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0
mean,540088.1,3.370842,2.114757,2079.899736,15106.97,1.494309,0.007542,0.234303,3.40943,7.656873,1788.390691,291.509045
std,367127.2,0.930062,0.770163,918.440897,41420.51,0.539989,0.086517,0.766318,0.650743,1.175459,828.090978,442.575043
min,75000.0,0.0,0.0,290.0,520.0,1.0,0.0,0.0,1.0,1.0,290.0,0.0
25%,321950.0,3.0,1.75,1427.0,5040.0,1.0,0.0,0.0,3.0,7.0,1190.0,0.0
50%,450000.0,3.0,2.25,1910.0,7618.0,1.5,0.0,0.0,3.0,7.0,1560.0,0.0
75%,645000.0,4.0,2.5,2550.0,10688.0,2.0,0.0,0.0,4.0,8.0,2210.0,560.0
max,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,13.0,9410.0,4820.0


In [55]:
X = df.drop('price', axis=1)
y = df.price
n = len(y)
# Build the regression
lr = LinearRegression()
model = lr.fit(X, y)


In [62]:
# Calculate error
y_hat = model.predict(X)
# sse = sum((y - y_hat)**2)
# mse = sum((y - y_hat)**2)/n
# rmse = (sum((y - y_hat)**2)/n)**0.5

sse = mean_squared_error(y,y_hat,squared=True) * n
mse = mean_squared_error(y,y_hat,squared=True)
rmse = mean_squared_error(y,y_hat,squared=False)


sse, mse, rmse

(1149174268868301.2, 53170511676.69001, 230587.31898499973)

In [63]:
mean_squared_error(y,y_hat)

53170511676.69001

## Defining prediction error as a combination of bias and variance

$\Large Total\ Error\ = Prediction\ Error+ Irreducible\ Error$

Our prediction error can be further broken down into error due to bias and error due to variance.

$\Large Total\ Error = Model\ Bias^2 + Model\ Variance + Irreducible\ Error$

**Model Bias** is the expected prediction error of the trained model.

> In other words, if you were to train multiple models on different samples, what would be the average difference between the prediction and the real value?

**Model Variance** is the expected variation in predictions, relative to your trained model.

> In other words, what would be the average difference between any one model's prediction and the average of all the predictions?

**Bias vs. variance refers ultimately to the *accuracy* vs. *consistency* of the models trained by your algorithm.**

![target_bias_variance](images/target.png)

http://scott.fortmann-roe.com/docs/BiasVariance.html

**Bias vs. variance refers ultimately to the *accuracy* vs. *consistency* of the models trained by your algorithm.**

![target_bias_variance](images/Bias-vs-Variance-v4.png)

https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

# Coming up next

It goes without saying that we would generally like our models to have both low bias and low variance. But what is not so obvious is that, unfortunately, as one tends to go down, the other tends to go up. Moreover, we shall often be able to tweak model **hyperparameters** with the purpose of decreasing the bias (even if that also means increasing the variance) or of decreasing the variance (even if that also means decreasing the bias). And so we shall soon come to appreciate the ***bias-variance tradeoff*** as it applies to machine learning models.