# Model Validation (ESL)
This notebook expands on the concepts presented in chapeter 7 *Model Assessment and Selection* of the *Elements of Statistical Learning* book. The topics are not necessarily examined in order of appearence on the chapter, rather following a convenient thread of thought.

## Extra-sample and In-sample errors
It is interesting to elaborate the concept of *Extra-sample* and *In-sample* error covered in sections 7.4 and followinng of the chapter. Using the same notation of the books, let
$$ \mathcal{T} = \{(x_1, y_1, \ldots, x_N, y_N)\} $$
be the training set
and considere a re-sampling of the points (here a make a little change in the notation my $y_i'$ corresponds to the $Y^0_i$ of the book).
$$ \mathcal{T}' = \{ (x_1, y_i'), \ldots, (x_N, y_N') \} $$
For a given error function $L(y_i, \hat{f})$ we can calculate the *training error*
$$ \bar{err} = \sum_{i=1}^{N}{L(y_i, \hat{f}(x_i))} $$
the *in-sample error*
$$ Err_{in} = \sum_{i=1}^{N}{L(y_i', \hat{f}(x_i))} $$
and the *optimism*
$$ Err_{in} - \bar{err} $$


In [100]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

def linear_model(beta, X, noise=0):
    '''Computes and returns beta.T*X + noise'''
    return (np.matmul(X,beta) + noise*np.random.randn(np.shape(X)[0])).reshape(-1,1)

n_samples = 50
n_features = 1
noise = 30
X = np.random.randn(n_samples, n_features)*5
beta  = np.random.randn(n_features)*40
y = linear_model(beta,X,noise)

model = LinearRegression()
model.fit(X,y)
tilde_y = model.predict(X)

n_resamples = 30
Y_res = np.zeros([n_samples, n_resamples])
err_in_sample = np.zeros([n_resamples,1])
# resample
for i in range(n_resamples):
    ys = linear_model(beta,X,noise)
    Y_res[:,i] = ys.reshape(n_samples)
    err_in_sample[i] = mean_squared_error(ys, model.predict(X))

exp_in_sample = np.mean(err_in_sample)
train_err = mean_squared_error(y, model.predict(X))
print("Training error {0}".format(train_err))
print("In sample mean error {0}".format(exp_in_sample))
print("Average optimism {0}".format(exp_in_sample-train_err))

Training error 663.2259363096656
In sample mean error 899.9026352085289
Average optimism 236.67669889886326


We can have from the above analysis a partial confirmation that the optimism is, on average, positive and thus that the training error is somehow too good (*i.e.*, too optimistic) as predictor of the actual error. This is quite intuitive, but I don't see as obvious the analytical argument for it.

It seems to me that the same phenomenon that makes training error biased downward, should be in place when we use training error as predictor of the test error. In this case the phenomenon should be even higher because we are not trying to estimate the same $x_i$ used for constructing the prediction, but we are trying to predict the unknown function in points $x'$ that (presumibly) we have not yet seen.

The claim made in the text is that
$$ E[Err_{in}-\bar{err}] = \frac{2}{N}\sum_{i=1}^{N}{\hat{y}_i, y_i)}$$
The proof of this results (outlined in Exercise 7.4) could shed some lights on the nature of this situation and it is therefore useful to pursue to better understand the whole phenomenon.