# Bias *Variance* Tradeoff
---

$E([y-f'(x)])^2= (Bias[f'(x)])^2 +Var[f'(x)]+\sigma^2$

Intrested to know mathamatical proof then visit [wiki]( https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff)

## Generalization Error
- Supervised Learning: y = f (x), f is unknown.

![image.png](attachment:image.png)

## Goals of Supervised Learning
- Find a model that best approximates f   in such a way f^(prediction data) ≈ f(actual Data)
- f^ can be Logistic Regression, Decision Tree, Neural Network ...
- Discard noise as much as possible.
- **End goal**: f^ should acheive a low predictive error on unseen datasets.

## Difficulties in Approximating f
### Overfitting:
- f^(x) fits the training set noise.
![image.png](attachment:image.png)

## Underfitting:

- f^ is not flexible enough to approximate f .
![image.png](attachment:image.png)

# Generalization Error
- Generalization Error of f^ : Does f^ generalize well on unseen data?
- It can be decomposed as follows:
- Generalization Error of f^ = bias + variance + irreducible error

# Bias
- Bias: error term that tells you, on average, how much f^ ≠ f.
- ![image.png](attachment:image.png)

# Variance
- Variance: tells you how much f^ is inconsistent over different training sets.
- ![image.png](attachment:image.png)

# Model Complexity
- Model Complexity: sets the flexibility of f^.
- Example: Maximum tree depth, Minimum samples per leaf, ...etc

# Bias-Variance Tradeoff
![image.png](attachment:image.png)

# Diagnosing Bias and Variance Problems
## Estimating the Generalization Error
### Solution:
- split the data to training and test sets,
- fit f^ to the training set,
- evaluate the error of f^ on the **unseen** test set.
- generalization error of f^ ≈ test set error of f^ .

## Better Model Evaluation with Cross-Validation
- Test set should not be touched until we are confident about f^'s performance.
- Evaluating f^ on training set: biased estimate, f^ has already seen all training points.
- Solution → Cross-Validation (CV):
    - K-Fold CV,
    - Hold-Out CV.

## Diagnose Variance Problems

- If f^ suffers from **high variance**:

    - CV error of f^> training set error of f^.
    
- f^ is said to overfit the training set. To remedy overfitting:

    - decrease model complexity,
    - for ex: decrease max depth, increase min samples per leaf, ...
    - gather more data, ..

## Diagnose Bias Problems
- if f^ suffers from high bias:
    - CV error of f^ ≈ training set error of f^ >> desired error.
- f^ is said to underfit the training set. To remedy underfitting:
    - increase model complexity
    - for ex: increase max depth, decrease min samples per leaf, ...
    - gather more relevant features

## Exercise using auto mpg data

## Step 1: Import Required Modules

In [1]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error as MSE

## Step 2: load data

In [2]:
#os.chdir("C:\\Users\\Hi\\Google Drive\\01 Data Science Lab Copy\\02 Lab Data\\Python")
df = pd.read_csv("auto-mpg.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 6 columns):
mpg       392 non-null float64
displ     392 non-null float64
hp        392 non-null int64
weight    392 non-null int64
accel     392 non-null float64
size      392 non-null float64
dtypes: float64(4), int64(2)
memory usage: 18.5 KB


In [3]:
df[1:5]

Unnamed: 0,mpg,displ,hp,weight,accel,size
1,9.0,304.0,193,4732,18.5,20.0
2,36.1,91.0,60,1800,16.4,10.0
3,18.5,250.0,98,3525,19.0,15.0
4,34.3,97.0,78,2188,15.8,10.0


In [4]:
X = df.loc[:,df.columns !="mpg"]

In [5]:
X[1:5]

Unnamed: 0,displ,hp,weight,accel,size
1,304.0,193,4732,18.5,20.0
2,91.0,60,1800,16.4,10.0
3,250.0,98,3525,19.0,15.0
4,97.0,78,2188,15.8,10.0


In [6]:
type(X)

pandas.core.frame.DataFrame

In [7]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 5 columns):
displ     392 non-null float64
hp        392 non-null int64
weight    392 non-null int64
accel     392 non-null float64
size      392 non-null float64
dtypes: float64(3), int64(2)
memory usage: 15.4 KB


In [8]:
y= df["mpg"]

In [9]:
type(y)

pandas.core.series.Series

In [10]:
y[1:5]

1     9.0
2    36.1
3    18.5
4    34.3
Name: mpg, dtype: float64

## Step 3: split into train and test

In [11]:
# Set SEED for reproducibility
SEED = 1
# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=SEED)

In [12]:
# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(max_depth=4,
                           min_samples_leaf=0.26,
                           random_state=SEED)

In [13]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 274 entries, 332 to 37
Data columns (total 5 columns):
displ     274 non-null float64
hp        274 non-null int64
weight    274 non-null int64
accel     274 non-null float64
size      274 non-null float64
dtypes: float64(3), int64(2)
memory usage: 12.8 KB


## Step 4: Calculate CV RMSE

In [14]:
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = cross_val_score(dt,
                                   X_train,
                                   y_train,
                                   cv=10, 
                                  scoring='neg_mean_squared_error', 
                                  n_jobs=-1) 

In [15]:
MSE_CV_scores

array([-46.94808341, -18.78121305, -18.19914701, -18.06935431,
       -17.19546733, -28.91247609, -39.41410887, -21.30453162,
       -31.96443414, -23.74191199])

In [16]:
# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)

# Print RMSE_CV
print("CV RMSE:", RMSE_CV)

CV RMSE: nan


  


## Calculate Regular RMSE on train data

In [17]:
# Fit dt to the training set
dt.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=0.26,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

In [18]:
# Predict the labels of the training set
y_pred_train = dt.predict(X_train)

In [19]:
# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)

# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))

Train RMSE: 5.15


## Calculate Regular RMSE on test data = baseline = generalised error

In [20]:
y_pred_test = dt.predict(X_test)

In [21]:
RMSE_test = (MSE(y_test, y_pred_test))**(1/2)

In [22]:
RMSE_test

4.859775831709125