## Variance, Bias, Noise

These three terms, and only these three terms, contribute to your error.

In the example of regression analysis to predict housing prices, there could be an unaccounted-for feature that distorts the label. 

So, while we may have features for "square footage", "crime rate", "number of bathrooms", etc, there could be a feature like "did a celebrity live there?" that could be unaccounted for that makes the label different than the expected label.   

### Minimizing error

Train a classifier D on some training data set, and then draw a new data point x with a label y, how far off would we be; what would be the expected test error of this algorithm? 

That's all you want to minimize, ultimately, if you design a new algorithm.

Total Error = *Bias^2 + Variance + Noise*


## Noise

Noise is the expected difference between the label of a data point and its expected label

Typically don't address noise via the machine learning algorithm, you minimize noise by cleaning the data, adding features, or other data preparation steps. 

How large is the data-intrinsic noise? This error measures ambiguity due to your data distribution and feature representation. You can never reduce it algorithmically; it is an inherent aspect of the data. You might, however, be able to add more features that capture this seemingly random variability.

To formally compute noise: 

Compute the noise by just putting the squared difference of the actual label and the expected label. 


## Variance

Captures how much your hypothesis function changes if you train on a different training set. How "overspecialized" is your hypothesis to a particular training set? If we have the best possible model for our training data, how far off are we from the average hypothesis?

Variance refers to the sensitivity of the model to fluctuations in the training data. A model with high variance tends to overfit the data, meaning it learns the training data too well and struggles to generalize to new, unseen data.

How much do our classifiers vary if we train them on different data sets?

How different is that prediction from the average prediction? Or the expected prediction we would get if we had infinite amount of data?

The variance is the error that you get because you trained on one specific data set and not infinite amounts.

When training on different data sets, overfit classifiers are more likely to have a greater spread.

When training on different data sets, you can get classifiers that vary greatly from each other.

​​​​​​The average classifier, h-bar(x), returns the average prediction of classifiers trained on all possible data sets.

Training error is much lower than test error.

To lower variance:

* Add more training data, if possible. 

* Reduce model complexity 

* Bagging

To formally compute variance: 

For each one of these hundred data sets, you train a different regression tree, and then you compute the average prediction, and you compute the average squared difference from the prediction across all these hundred different trees for every single data point; that's the variance of your classifier



## Bias 

What is the inherent error that you obtain from your hypothesis function even with infinite training data, i.e., from your average hypothesis? This is due to your hypothesis function being "biased" to a particular kind of solution (e.g. linear classifier).  In other words, bias is inherent to your model.

Bias refers to the error introduced by approximating a real-world problem with a simplified model. A model with high bias tends to underfit the data, meaning it oversimplifies the underlying patterns and fails to capture the complexities in the data. 

Training error is higher than epsilon (the threshold for allowed error)

The bias is the prediction of the average classifier minus the average label squared.

Bias is error that you would get even if you had unlimited data (because the classifier is introducing the error)

To lower bias: 

* Increase features

* Add depth, decrease regularization, make linear non-linear

* Boosting 




### Expected Label

#### Y-bar

First, let us consider that for any given input x there might not exist a unique label y. For example, if your vector x describes features of a house (e.g. number of bedrooms, square footage) and the label y its price, you could imagine two houses with identical descriptions selling for different prices. So, for any given feature vector x, there is a distribution over possible labels.