# Learning

## Introduction
Say we have data which contains a set of variables and values associated with them in tabular form. Our objective is to guess the relation between a *target*($Y$) and set of *predictors*($ X_1, X_2, X_3, ... X_p $). 

We assume that the real world relation ship between $Y$ and $X = (X_1, X_2, ... , X_p)$ can be of the form
$$ Y = f(X) + \epsilon $$
Where $f$ is some fixed but unknown function of $X$ and $\epsilon$ is random *error term* which is independent of $X$ and has mean zero.

## Motivation for Estimation of f
We are interested in estimating $f$ for two reasons
### Prediction
Assume that we came up with estimate of $f$ as $\hat{f}$ then the  resulting prediction using our estimate $\hat{Y}$ can be generated as follows.
$$\hat{y} = \hat{f}(X)$$
Here we are not concerned with nature is $\hat{f}$. We are mainly in for accuracy of $\hat{Y}$
The accuracy of $\hat{Y}$ depends on two quantities *reducible error* and *irreducible error*. The reducible error is based on our selection of $\hat{f}$ which can be minimized with better estimate $f$. But even we have right estimate for $f$ there is some error($\epsilon$) which is independent  of $X$ and $f$. This is called irreducible error. The quantity $\epsilon$ may also contain effect of unmeasured variable which isn't considered part of $X$.

Assume our model didn't change and input X didn't change then $\hat{Y}$ doesn't change but $Y$ changes as it is also dependent on error term. Then expectation of square error can be put as

\begin{align*}
E(Y - \hat{Y})^2 &= E[f(X) + \epsilon - \hat{f}(X)]^2 \\
    &= E[f(X) - \hat{f}(X)]^2 + E[e]^2 - 2(f(X) - \hat{f})E[\epsilon] \text{ ( } \because E[x] = 0) \\
    &= E[f(X) - \hat{f}(X)]^2 + E[e]^2
\end{align*}

The first part of result is known as *mean squared error* which is *reducible error*. The second part of the error term can be further solved to prove that it is *variance of* $\epsilon$ which is *irreducible error*.

### Inference
In this case we are looking for relation ship between $X$ and $Y$ like.
* Which predictors are associated with response
* What is the relation ship between response and each predictor
* Can these relationships be represented in linear fashion or some other complex curve. 

## Estimation of f
### Parametric Estimation
We make an assumption of model and try to guess the parameters using training data. In case of linear model we assume that
$$ f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p$$
Then using training data we try predict values of $\beta_i$ so that the function fits our data better.
$$ Y \approx \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p$$

This can be very useful for predicting relationships between $X_i$ and $Y$. But estimation accuracy largely depends on model we chose.
### Non Parametric Estimation 
In case of non parametric estimation we decide value for predictors based on closest data points. This doesn't make any guesses on $f$ and can incorporate very complex patterns. But this requires huge set of training data to give accurate results.

As we have observed there is a tradeoff between accuracy and interpretability of models. accuracy mainly depends on flexibility of the model. But more flexible the model the tougher it is to do the inference between predictors and target.

## Assessing Model Accuracy
Generally quality of fit is given by
$$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2$$
There are two scenarios when we use $MSE$. *training* $MSE$ is used to fit the model to training data and *test* $MSE$ is used to validate the model once we trained it. 

We are looking for model which has lowest *test* $MSE$ instead of lowest *training* $MSE$. We naturally assume that the model with lowest *test* $MSE$ has the lowest *training* $MSE$ but unfortunately that's not the case.
The common observed trend is as we increase the complexity of model *training* $MSE$ always drops while *test* $MSE$ drops till certain point and again increases.

## Bias-Variance Trade-Off

Let's assume that we that we do the process of choosing data set, building model and estimation for some input $x$ repeatedly for large number of times.

To formulate above statement
<br>
$$ y = f(x) + \epsilon$$<br>
$$ y_i = f_i(x)$$<br>

Here $y$ is original target value for predictor $x$ .<br>
$f$ is the real relation between $y$ and $x$ .<br>
$f_i$ and $y_i$ are $i^{th}$ model and it's prediction for $x$.



We need to find out expectation of squared error for prediction on $x$.

We are going to use the following rules<br>

1. $f(x)$ is constant so $E[f(x)] = f(x)$. Similarly $E[f(x)^2] = f(x)^2$.<br>
2. Mean of \epsilon is zero. i.e $E[\epsilon] = 0$<br>
3. $f_i$ and $\epsilon$ can be assumed independent. i.e $E[f_i(x)]*E[\epsilon] = 0$<br>
4. $E[z^2] = Var[z] + E[z]^2$<br>

now to the derivation

\begin{align*}
 E[(y - y_i)^2] &= E[(f(x) + \epsilon - f_i(x))^2] \\
      &= E[(f(x) - f_i(x))^2] + E[(\epsilon)^2] + 2 * E[(f(x) - f_i(x))(\epsilon)] \\
      &= E[(f(x) - f_i(x))^2] + E[(\epsilon)^2] + 2 * E[f(x)\epsilon] - 2*E[f_i(x)\epsilon] \\
      &= E[(f(x) - f_i(x))^2] + E[(\epsilon)^2] + 2 * f(x) * E[\epsilon] - 2*E[f_i(x)]*E[\epsilon] \text{ ( }\because 1 \text{ and } 3) \\
      &= E[(f(x) - f_i(x))^2] + E[(\epsilon)^2] \text{ ( } \because 2) \\
      &= E[(f(x) - f_i(x))^2] + Var(\epsilon) + (E[\epsilon])^2 \text{ ( } \because 4) \\
      &= E[(f(x) - f_i(x))^2] + Var(\epsilon) text{ ( } \because 2) \\
      &= E[f(x)^2] + E[f_i(x)^2] - 2 * f(x) * E[f_i(x)] + Var(\epsilon) \text{ ( } \because 1)\\
      &= f(x)^2 + E[f_i(x)]^2 - 2 * f(x) * E[f_i(x)] + Var(f_i(x)) + Var(\epsilon) \text{ ( } \because 4)\\
      &= (f(x) - E[f_i(x)])^2 + + Var(f_i(x)) + Var(\epsilon)
\end{align*}

The term $(f(x) - E[f_i(x)])^2$ is called *bias*. It shows how close our predicted relation is to the original relation. As complexity of model increases the *bias* decreases as it can predict the model more accurately. So simpler models have high bias and complex models have low bias.

The term $Var(f_i(x))$ is called *variance*. It shows how prediction varies with change in datasets. The complex the model the higher the prediction varies with change in training data. So simpler the model is lower the variance and as model gets complex the variance increases.

These two measures determine error which is the reason why we find the above behaviour of $MSE$ with respect to complexity.
This can also be used to explain concepts of over fitting and under fitting 