## 1. BIAS

- Bias is the error rate of your model on the training dataset.
- Bias is how much your model __under-fits__ the training data.

### How do you compute bias?

$$Bias = E[y_p - y_t]$$

Expected difference between predicted and observed

Bias is a learners' tendency to learn the wrong thing
-----    

Which the following has higher bias?

1. $y = \theta _0$
1. $y = \theta _0 + \theta _1x $

Bias is bad
------

An algorithm that has a good ability to fit the training data has __low__ bias.

We want to minimize bias.

Algorithms with high bias 
-------

- Produce simple models
- Fail to capture meaningful patterns in the data
- Under-fit their training data (also don't over-fit either)

## How to decrease bias?

 - Make the model more complex!  

You can add more parameters:
$y = \theta _0 + \theta _1x $  
$y = \theta _0 + \theta _1x  + \theta _2x$  
$y = \theta _0 + \theta _1x  + \theta _2x + \theta _3x$  
…

Or increase model complexity by picking a different algorithm

- Larger set of features
- Better features

Both will increase your model's ability to fit the training dataset, thus lowering bias.

## 2. Variance

- Variance is the amount by which the model result will change for a small change in the input data.
- If for a small change in input data, the model results change a lot than the model is said to have high variance
- Variance is an algorithm's flexibility to learn patterns in the observed data.
- If Variance is high,  that means our model __over-fits__ the training data. 

Model Complexity will Increase Variance 
------

The more complex the model is, the more data points it will "capture". 

However, complexity will make the model "move" more to "capture" the data points, and hence its variance will be larger.

#### Variance is how much worse you do on the test dataset compared to the training dataset.

What should you do if you have high variance?
------

1. Feature Selection
1. Regularization
1. Dimensionality Reduction
1. Bagging methods (e.g., Random Forest)

## Bias Vs Variance

<center><img src="images/bias_var2.png" width="400"/></center>

## Bias-variance trade-off

The goal of Machine Learning:

1. Low bias (model the  patterns in the observed data) 
1. Low variance (not sensitive to specificities of the observed data)

<center><img src="images/bias_var.png" width="75%"/></center>

Bias-variance trade-off: A balancing act
------

<center><img src="images/abstract_better.png" width="75%"/></center>

## 3. Underfitting
- Underfitting refers to not capturing enough patterns in the data. The model performs poorly both in the training and the test set.

## 4. Overfitting
- Overfitting refers: a)capturing noise and b) capturing patterns which do not generalize well to unseen data. The model performs extremely well to the training set but poorly on the test set.

## 5. Cross Validation

- Cross-validation is a model validation techniques for assessing how the results of a statistical model will generalize to an independent data set. 
- The goal of cross-validation is to define a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting,underfitting and get an insight on how the model will generalize to an independent data set. 
- It is important the validation and the training set to be drawn from the same distribution otherwise it would make things worse.

## Approach 1 for Validation: Train/Test split or Holdout

- In this strategy, we simply split the data into two sets: train and test set so that the sample between train and test set do not overlap, if they do we simply can’t trust our model.

<center><img src="images/cv1.png" width="75%"/></center>

BUT:
- What if the split we make isn’t random? What if one subset of our data has only people from a certain state, employees with a certain income level but not other income levels, only women or only people at a certain age? . This will result in overfitting, even though we’re trying to avoid it! 

## Approach 2: K-fold

It can be viewed as repeated holdout and we simply average scores after K different holdouts. Every data point gets to be in a validation set exactly once, and gets to be in a training set k-1times. This significantly reduces underfitting as we are using most of the data for fitting, and also significantly reduces overfitting as most of the data is also being used in validation set.

<center><img src="images/cv2.png" width="75%"/></center>

## 6. Parameters & Hyperparameters - The difference

Model parameters are learned during training and learned for a specific model on specific data.

Hyperparameters are properties of the algorithm.

Hyperparameters are set before the start of a training.

-----

Model parameters are always learned that is why it is Machine Learning.

Hyperparameters can be picked or learned.

## 7. Evaluation metrics for classification

- Accuracy 
- Recall
- Precision
- F-score

<center><img src="images/pr_re.png" width="80%"/></center>

- True Positives (TP): correctly predicted a succesfull outcome /  one 
label 
- True Negatives (TN): correctly predicted a lack of an outcome / other label 
- False Positives (FP): incorrectly predicted a succesfull outcome (a "Type I error")
- False Negatives (FN): incorrectly predicted lack of an outcome (a "Type II error")

<center><img src="images/pregnant.jpg" width="50%"/></center>

### Accuracy

$$Accuracy = \frac{All\ Correct}{Total}$$

- Fraction of observations classified correctly
- 1 - error rate

#### What is the biggest limitation of accuracy?

Accuracy is an overall measure (ignores which classes were correctly predicted). It does not tell you what "types" of errors your classifier is making

It is effected by class imbalances, when there is much one group than another group.

### Precision

$$Precision = \frac{Class\ Correct}{Class\ Total\ Predicted}$$

Fraction of labeled items assigned to a class that are actually members of that class

### Recall

$$Recall = \frac{Class\ Correct}{Class\ Total\ Actual}$$

Fraction of labeled items in a class that are classified correctly

Confusion Matrix
------

<center><img src="images/confu_matrix.png" width="40%"/></center>

<center><img src="images/statistical-classification-metrics.png" width="80%"/></center>

### Example

<center><img src="images/p_r.png" width="90%"/></center>

### F<sub>1</sub> score

The F1 score is a measure of a model’s performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best, and those tending to 0 being the worst. 


$$F_1\ Score = 2•\frac{Precision•Recall}{Precision+Recall}$$

A single metric that combines precision and recall.

In Machine Learning, we want a single metric.

### Generalized F score

<center><img src="images/f_score_2.png" width="75%"/></center>

F<sub>1</sub> weighs recall and precision equally.

F<sub>0.5</sub> weighs recall lower than precision (by reducing the influence of false negatives).

F<sub>2</sub> weighs recall higher than precision (by placing more emphasis on false negatives).

ROC (receiver operating characteristic) curve 
----

ROC or Receiver Operating Characteristic curve is a graph of true positive rates vs the false positive rate at various thresholds. 
It’s often used as a proxy for the trade-off between the sensitivity of the model (true positives) vs the probability that it will trigger a false alarm (false positives).


<center><img src="images/roc_first.png" width="50%"/></center>