# ML Fundamentals

Suggested readings: 
1. Bishop. p38-48
1. Shapire. Boosting: Foundation and Algorithms. p 23-43

Outline
1. Task formulation
1. Loss function
1. Empirical Risk minimization
1. No free lunch Theorem
1. Generalization bounds
1. Complexity of hypothesis space. VC dimention. Rademacher complexity
1. Bias Variance decomposition
1. Overfitting and Regularization
1. Quality Metrics
1. Validation: Leave One Out, Cross-Validation, Hold Out
1. Usual pipeline for model training
1. Optimal Bayesian classifier

## 1 Task Formulation

Here we will consider only classification or regression tasks.

Given dataset $\{ (x_i, y_i) \}_{i=1}^N $ of i.i.d. objects

Or, equvalently given:  
$X \in R^{Nxd}$ - feature matrix, where $d$ is dimension of feature space and $N$ - number of objects.  
$Y \in R^{N}$ - target vector

For classification $Y \in \{0,1, ... C-1\}^N$, where $C$ is a number of classes 

We want to find such algorithm $h \in H$ that "assigns for each object the right target value".

Classification
<img src="images/classification.png" style="height:300px">

Regression
<img src="images/regression.png" style="height:300px">

## 2 Loss function

$Loss: Y x Y -> R $ - loss function, that evaluates how bad our prediction for particular object are.

Some loss functions:
* $Loss(\hat y, y) = (\hat y - y)^2$
* $Loss(\hat y, y) = |\hat y - y|$
* $Loss(\hat y, y) = \frac {|\hat y - y|} {y}$
* $Loss(\hat y, y) = I[\hat y \neq y]$


For binary classification with $y \in {-1,1}$ a variable $z = yh(x)$ is called margin. Positive margin corresponds to successful classification, negative margin corresponds to error. $|yh(x)|$ is a distance to decision boundary, which can be interpreted as confidence in classification of the object.     
<img src="images/margin1.png" style="height:300px">

Loss functions also can be determined in terms of margin:
* $Loss(\hat y, y) = max(0, 1 - yf(x))$ Hinge loss
* $Loss(\hat y, y) = e^(-yh(x))$ AdaBoost loss
* $Loss(\hat y, y) = \log(1 + e^(-yh(x)))$ Logistic loss
* $Loss(\hat y, y) = I[yh(x) < 0]$ Classification error 

<img src="images/margin_loss.png" style="height:300px">

## Empirical Risk Minimization

In general, we want to optimize Expected Risk:  
$R = E [ Loss(x, y)] = \int_{-\infty}^{\infty}Loss(x, y)dP(x,y) = Pr_{(x_i, y_i) ~ D} [Loss(x_i, y_i)] $

But since we don't now the joint distibution $P(x,y)$, we can only deal with Empirical Risk (Loss functional):  
$\hat R = \sum_{i=1}^{N}Loss(x_i, y_i) $

## Generalization bounds

Hoeffding’s inequality:
Let $x_1$,..., $x_m$ be independent random variables such that {x_i \in [0,1]}.
Denote their average value by $A_m = \frac 1 m \sum_i^m x_i$
Then for any $\epsilon > 0$ we have 

$ Pr[A_m > E[A_m] + \epsilon] \leq e^{-2m\epsilon^2}$

Speaking about risk miminization, we can ask a question, how well does $\hat R$ approximates $R$?

Given m random examples, and for any $\delta > 0$, we can deduce that with probability $Pr >= 1 - \delta$, the following upper bound holds on the generalization error of $h$:
$R \leq \hat R + \sqrt{ \frac {\ln(1 / \delta)} {2m} }$

For finite hypothesis space $H$ under the same conditions:
$R \leq \hat R + \sqrt{ \frac {\ln|H| + \ln(1 / \delta)} {2m} }$



VC dimention is defined as the cardinality of the largest set of points that the algorithm can shatter.

<img src="images/vc_dim.png" style="height:300px">

Let H be a hypothesis space of VC-dimension $d \le \infty$, and assume that arandom training set of size $m$ is chosen where $m \geq d \geq 1$. Then for any $\epsilon > 0$,

$R \leq \hat R + \sqrt{ \frac {d\ln(m / d) + \ln(1 / \delta)} {2m} }$

Rademacher complexity:
Suppose now that the labels $y_i$ are chosen at random without regard to the $x_i$. In other words, suppose we replace each $y_i$ by a random variable $\sigma_i$ that is −1 or +1 with
equal probability, independent of everything else. Thus, the $\sigma_i$ represent labels that are pure noise. We can measure how well the space $H$ can fit this noise in expectation
by  
$E_{sigma} [\max_{h \in H} \frac 1 m \sum _{i=1}^m \sigma_i h(x_i)]$, which is called Rademacher complexity.

## No free Lunch Theorem

The No Free Lunch Theorems state that any one algorithm that searches for an optimal cost or fitness solution is not universally superior to any other algorithm.

"If an algorithm performs better than random search on some class of problems then in must perform worse than random search on the remaining problems." (No Free Lunch Theorems for Optimisation)

How that affects machine learning?
Every machine learning algorithm explicitly or implicitly implies some assumpsions made about observed data. So by contradicting these assumptions for every algorithm we create such dataset, where it achieves bad perfomance.

<img src="images/lunch.png" style="height:200px">

## Bias variance decomposition

Suppose our data is generated by:  
$y = f(x) + \epsilon$, where $\epsilon \in N(0,\sigma)$ is white noise.  
We want to build such estimator, that:  
$\hat y = h(x)$ is our prediction  

Consider MSE regression  

$MSE = E[(y - h(x))^2] $  
$ = E[(y - f(x) + f(x) - h(x))^2]$  
$ = E[(y - f(x))^2] + E[(f(x) - h(x))^2] - 2E[(y - f(x)(f(x) - h(x))]$  
$ = E[\epsilon^2] + E[(f(x) - h(x))^2] - 2(E[yf(x)] - E[yh(x)] - E[f^2(x)] + E[f(x)h(x)] ) $  

Notes:  
$E[f^2(x)] = f^2(x)$ since f is deterministic  
$E[yf(x)] = f^2(x)$ since $E[y] = f(x)$    
$E[yh(x)] = E[f(x)h(x)] + E[\epsilon h(x)] = E[f(x)h(x)] + 0$  


$ = E[\epsilon^2] + E[(f(x) - h(x))^2] - 2(f^2(x) - E[f(x)h(x)] + 0 - f^2(x) + E[f(x)h(x)]) $  
$ = E[\epsilon^2] + E[(f(x) - h(x))^2]$  
$ = E[\epsilon^2] + E[(f(x) - E[h(x)] + E[h(x)] -  h(x))^2]$  
$ = E[\epsilon^2] + E[(f(x) - E[h(x)])^2 ] + E[(E[h(x)] -  h(x))^2] + 2E[(E[h(x)] - h(x))(f(x) - E(h(x))]$  

$ = E[\epsilon^2] + E[(f(x) - E[h(x)])^2 ] + E[(E[h(x)] -  h(x))^2] + 2(E[f(x)E[h(x)]] -E[E[h(x)]^2]  - E[h(x)f(x)] + E[h(x)E[h(x)]])$  

Notes:  
$E[fE[h(x)]] = f(x)E[h(x)]$    
$E[E[h(x)]^2] = E[h(x)]^2$  
$E[f(x)h(x)] = f(x)E[h(x)]$    
$E[h(x)E[h(x)]] = E[h(x)]^2$   

$ = E[\epsilon^2] + E[(f(x) - E[h(x)])^2 ] + E[(E[h(x)] -  h(x))^2] + 2(f(x)E[h(x)] -E[h(x)]^2  - f(x)E[h(x)] + E[h(x)]^2)$  

$ = E[\epsilon^2] + E[(f(x) - E[h(x)])^2 ] + E[(E[h(x)] -  h(x))^2]$  

$ = Var[\epsilon] + E[(f(x) - E[h(x)])^2 ] + Var[h(x)]$   
$ = Var[\epsilon] + bias^2 + Var[h(x)]$   

Training and test error
<img src="images/dec2.png" style="height:400px">

Generalization error
<img src="images/dec1.png" style="height:400px">

## Learning curves

High bias
<img src="images/lc_bias.png" style="height:400px">

High variance
<img src="images/lc_var.png" style="height:400px">

## Overfitting and Regularization
Overfitting is a situation, when a model fitted on a train dataset shows worse perfomance on a test dataset.  
It corresponds to the fact, that model learns the given dataset but do not generalize to unseen data from the same distribution.  
Every model does overfit!  

<img src="images/overfitting.png" style="height:300px">


In response to overfitting, there is regularization techniques.  
In general, they correspond to setting restrictions onto hyposesis spaces, making generalization bound more tight. 

## Quality Metrics

To effectively optimize our Loss functional by SGD it must be:
1. differentialble almost everywhere
2. represented as sum of losses on each object

However, sometimes we may want to evaluate perfomance of our algorithm with some not so fancy measure. Here are some examples:
* auc
<img src="images/auc.png" style="height:300px">

* f1 score
<img src="images/f1.png" style="height:300px">

<img src="images/f2.png" style="height:200px">

In that case, we usually have 2 approaches:
1. Create some differentiable approximation of quality metric
1. Our use classic Loss functional, such that it's optimum coinsides with the optimum of your quality metric

## Common model training pipeline

1. Split dataset for train, test and validation parts
1. train model on the train dataset without regularization, try to achive zero training loss
1. add regularization, watch perfomance on the validation dataset
1. test final model perfomance on test dataset. Choose between different model families.

On the test dataset model must be evaluated by chosen quality metric, not by loss function used in model optimization.

Regression
<img src="images/pipeline.png" style="height:600px">

## Validation

Given dataset for training our model, we usually mant to optimize its hyperparameters by measuring model perfomance on validation on different choices of hyperparameters. 

## 1 Leave One out
Given dataset of m objects, create m experiments:  
1. create split (m-1):1
1. train on (m-1) object
1. evaluate perfomance on the m-th object
1. change split
Average perfomance over all experiments

Properties:
1. High variance of estimate
2. Low bias of estimate
3. O(m) complexity
4. Usually done when we have very small dataset

Regression
<img src="images/loo.png" style="height:200px">

## 2 Hold out
Given dataset of m objects, create m experiments:  
1. create split train:val, usually in proportion 80:20
1. train on train subset
1. evaluate perfomance on the val subset

Properties:
1. Moderate variance of estimate
1. High bias of estimate
1. O(1) complexity
1. Usually done when when we have large dataset and\or very heavy model


## 3 Cross validation
k = number of folds  
folds = non-intersecting subsets of the dataset  
Given dataset of m objects, create k experiments:  
1. create split for k-1:1
1. train on (k-1) folds
1. evaluate perfomance on the k-th
1. change split
Average perfomance over all experiments

Properties:
1. Low variance of estimate
2. Moderate bias of estimate
3. O(k) complexity
4. Usually done with k=5 or k=10

Regression
<img src="images/cv.png" style="height:600px">

## Optimal Bayesian classifier
Suppose we have 2 classes.

<img src="images/bayes.png" style="height:400px">

Bayesian risk:
$ R = \sum_{x,y} I[h(x) \neq y] P(x,y)c_y$, 
where $c_y$ is cost function for misclassification

By applying Bayes rule:  
$P(y |X) = \frac {P(X|y) P(y) c_y} {P(X)}$

$h(x) = \arg \max_y P(X | y) P(y) c_y$ is optimal Bayesian classifier


We assign y = 1 iff:  
$ P(X | y=1) P(y=1)c_1 > P(X | y=0) P(y=0)c_0 $  

$ \frac {P(X | y=1)}{P(X | y=0)} > \frac {P(y=0)c_0} {P(y=1)c_1} $  

Cost function and prior class probabilties are interchangable.