<h2 align="center">Machine Learning</h2> 
<h3 align="center">Travis Millburn<br>Fall 2020</h3> 

<center>
<img src="../images/logo.png" alt="drawing" style="width: 300px;"/>
</center>

<h3 align="center">Class 9: Bootstrapping + Optimization + Model Bias</h3> 


### Outline

1. Quick review of variance vs std-deviation

2. Bootstrapping

3. A Bit about Bias

4. Confusion Matrix

5. Lab

6. Time for projects / Discussion


### Variance Review

<img src="../images/population_and_sample_Nn.png" alt="drawing"  width="30%"  align="right"/>
    
#### Population variance: $\sigma^2 = \frac{\sum_{i=1}^N (x_i - \mu)^2}{N}$ 

   ...As in _the whole population_

#### Sample variance: $ s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1}$ 

   ...As in _just the sample_
    
Note that sneaky $n-1$ denominator. 

#### Standard deviation = $\sqrt{\text{Variance}}$ 

Variance is a kind of average too.

#### Bagging
Bagging, or bootstrap aggregation, is a model aggregation technique to reduce model variance.  Data is split into multiple samples with replacement(!) called bootstrap samples. A bootstrap sample often is 3/4 of the original values and replacement resulting in repetition of values inside each model run.

<center>
<img src="../images/bagging.jpg" alt="drawing" style="width: 1000px;"/>
</center>

### Discussion: How is this differnt than K-Fold ?

#### We will come back to this for today's "Micro-Lab."

### Optimization review

Find "input" that minimizes or maximizes some function

In machine learning, what functions do we want to minimize and why? For regression?

* Calculus approach

* Via numerical package ~ black box

  - Tricks and equivalent problems
  - logarithms

### Formal Machine Learning Framework & Jargon 1/3

1. Given training data $(\mathbf x_{(i)},y_i)$ for $i=1,...,m$.

2. Choose a model $f(\cdot)$ where $f(\mathbf x)\approx y$

3. Define a loss function $L(f(\mathbf x), y)$ to minimize.


### Formal Machine Learning Framework & Jargon 2/3

Have data $(\mathbf x_{(i)},y_i)$, model $f(\cdot)$, and loss function $L(f(\mathbf x), y)$.

Want to minimize _true loss_ which is the expected value of the loss.

We approximate it by minimizing the _Empirical loss_  (a.k.a "risk") 
$$
L_{emp}(f) = \frac{1}{m}\sum_{i=1}^m L\big(f(\mathbf x_{(i)}), y_i\big)
$$
Emirical Risk minimization.

### Formal Machine Learning Framework & Jargon 3/3

Emirical Risk minimization:
$$
f^* = \arg \min\limits_f L_{emp}(f) = \arg \min\limits_f\frac{1}{m}\sum_{i=1}^m L\big(f(\mathbf x_{(i)}), y_i\big)
$$
To trade-off model risk and simplicity, we include regularizer:
$$
f^* = \arg \min\limits_f \big( L_{emp}(f) + \lambda R(f) \big) = \arg \min\limits_f  L_{reg}(f)
$$

Just cryptic talk for the same optimization problem we've been doing. But can be extended to cover many different techniques.



### Summary


Have data $(\mathbf x_{(i)},y_i)$, model $f(\cdot)$, and loss function $L(f(\mathbf x), y)$.

Regularized Emirical Risk minimization:
$$
f^* = \arg \min\limits_f \big( L_{emp}(f) + \lambda R(f) \big) = \arg \min\limits_f  L_{reg}(f)
$$

Can be extended to cover classification techniques. 

### Bias in 100 words or less

Bias is when your assumptions about the true predictor affect the estimated predictor. 

Example: you assume it is linear, use a linear model, get a linear result. If the true function is nonlinear, this linear model will have an error due to the bias.



<center>
<img src="../images/Under-fitting.png" alt="drawing" style="width: 400px;"/>
</center>


Formal definition: $$\text{Bias} = E(y - f(x))$$

### Variance even simpler

Variance is when your model changes when fit using different sample from same population.

I.e. it fits the noise $\varepsilon$ in addition to the true function

Out of fear of bias, we use a nonlinear model with way more parmeters to fit. It fits the (noisy) training data too well, and is worse than necessary on test data with different noise.

<center>
<img src="../images/Overfitted_Data.png" alt="drawing" style="width: 400px;"/>
</center>

Formal definition: $$\text{Variance} = Var(f(x))$$

### Bias-Variance Trade-off


Life is hard.


<center>
<img src="../images/bias_var_tradeoff.JPG" alt="drawing" style="width: 700px;"/>
</center>


But then this is why Machine Learning experts make the big bucks.

Regularization as restriction on model complexity.

### Example  

Let's start by looking at an example. We're going to be using some NFL data. The x axis is the number of touchdowns scored by team over a season and the y axis is whether they lost or won the game indicated by a value of 0 or 1 respectively.

<center><img src="../images/nfl.png" alt="drawing" style="width: 500px;"/></center>

So, how do we predict whether we have a win or a loss if we are given a score? Note that we are going to be predicting values between 0 and 1. Close to 0 means we're sure it's in class 0, close to 1 means we're sure it's in class 1, and closer to 0.5 means we don't know.

### Measuring success 

So how do we measure how well our model does? 

1. how do we test _generalizibility_?

2. What specific metrics might we compute?


### Accuracy (exam review)
The simplest measure is **accuracy**. This is the number of correct predictions over the total number of predictions. It's the percent you predicted correctly. In `sklearn`, this is what the `score` method calculates.

#### Shortcomings

Accuracy is a good first glance measure, but it has shortcomings. 

If the classes are unbalanced, accuracy will not measure how well you did at predicting. Say you are trying to predict whether or not an email is spam. Only 2% of emails are in fact spam emails. You could get 98% accuracy by always predicting not spam. This is a great accuracy but a horrible model!

What additional measurements might we do to check for such failures?

### Confusion Matrix

We can get a better picture our model by looking at the confusion matrix. We get the following four metrics:

* **True Positives (TP)**: Correct positive predictions
* **False Positives (FP)**: Incorrect positive predictions (false alarm)
* **True Negatives (TN)**: Correct negative predictions
* **False Negatives (FN)**: Incorrect negative predictions (a miss)

<center><img src="../images/logistic.png" alt="drawing" style="width: 500px;"/></center>

Note what happens as you move the decision threshold.

### Bagging!  Let's look at an example

(Bootstrap Aggregation)

In [1]:
# Some imports -- usual suspects
import sklearn
from sklearn import preprocessing
import pandas as pd

from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [4]:
# Data wrangling
col_names = [
    "Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
    "MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
    "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
]

df = pd.read_csv('adult.csv')
df.columns=col_names
df['Income'] = df['Income'].apply(lambda x: 0 if x == ' <=50K' else 1)

In [None]:
# Some (very) light feature engineering
def age_bucket(x):

    if 12 <= x < 18:
        return 1
    elif 18 <= x < 30:
        return 2
    elif 30 <= x < 40:
        return 3
    elif 40 <= x < 50:
        return 4
    elif 50 <= x < 66:
        return 5
    else:
        return 6
    
    
def education_bucket(x):
    if x in [' 1st-4th', ' 5th-6th', ' 7th-8th' , ' 9th',  ' 10th', ' 11th', ' 12th', ' Preschool']:
        return 1
    elif x in [ ' HS-grad']:
        return 2
    elif x in [' Some-college']:
        return 3
    elif x in [ ' Bachelors']:
        return 5
    elif x in [' Masters']:
        return 6
    elif x in [' Assoc-acdm', ' Assoc-voc']:
        return 4
    elif x in [' Doctorate', ' Prof-school']:
        return 7

In [3]:
df['CapitalNet'] = df['CapitalGain'] - df['CapitalLoss']
df['Age_class'] = df['Age'].apply(age_bucket)
df['Education_Class'] = df['Education'].apply(education_bucket)

NameError: name 'df' is not defined

In [5]:
for column in df.columns:
    if df[column].dtype == type(object):
        le = preprocessing.LabelEncoder()
        df[column] = le.fit_transform(df[column])
df.tail()

Unnamed: 0,Age,WorkClass,fnlwgt,Education,EducationNum,MaritalStatus,Occupation,Relationship,Race,Gender,CapitalGain,CapitalLoss,HoursPerWeek,NativeCountry,Income,CapitalNet,Age_class,Education_Class
32555,27,4,257302,7,12,2,13,5,4,0,0,0,38,39,0,0,2,4
32556,40,4,154374,11,9,2,7,0,4,1,0,0,40,39,1,0,4,2
32557,58,4,151910,11,9,6,1,4,4,0,0,0,40,39,0,0,5,2
32558,22,4,201490,11,9,4,1,3,4,1,0,0,20,39,0,0,2,2
32559,52,5,287927,11,9,2,4,5,4,0,15024,0,40,39,1,15024,5,2


In [26]:
X = df.drop(columns=['Income'])
y = df['Income']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=555, stratify=y)

In [7]:
# Let's remember from a few weeks ago how to build a Decision Tree

from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()

In [8]:
# K-Fold
results = sklearn.model_selection.cross_val_score(dt, X_train,y_train, cv=5)
results

array([0.81470106, 0.81081081, 0.81101556, 0.80835381, 0.80487305])

In [9]:
# Frequently, we will want to average the models across all the folds
results.mean(), results.std()

(0.8099508599508599, 0.003248760601195561)

In [10]:
# BAGGING
# Create a bag of estimators of size 11
dt_bag = BaggingClassifier(base_estimator=dt, n_estimators=100, random_state=555, n_jobs=-1)

# Fit / Train model
dt_bag.fit(X_train,y_train)

#Results
results = dt_bag.score(X_test, y_test)
results

0.8520884520884521

### This comparison illustrates:
* K-fold cross validation divides data into buckets and models on all, holding out one.  K-number of models.
* Bagging creates N number of models and does a train/test split (with replacement) for each.

#### We can use this bagging method with other (any) base model

In [28]:
knn = KNeighborsClassifier(n_neighbors=5)

nn_bag = BaggingClassifier(base_estimator=knn, n_estimators=100, random_state=555, n_jobs=-1)
nn_bag.fit(X_train,y_train)

#Results
results = nn_bag.score(X_test, y_test)
results

0.7878378378378378

#### Next Week: Boosting!!

#### Time for week 9 lab.