## Model Accuracy

## Links

- https://www.pythonforengineers.com/cross-validation-and-model-selection/
- https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6

## 1. The basics

Suppose we have to fit a model to a set of data. We would like to know how well the model describes the data, that is **model accuracy**. Knowing model accuracy, it would help us decide whether the model us usefull at all and also helps us choose among multiple competing models.

The **first thing** to do before getting into any details is to look at your data and your model fit. THis will give you a sense of the data and model and may reveal anomailes (or bugs) in your data or model that you should fix up front.

To quantify model accuracy, we could use the same metric of squared error (or, alternatively, absoulte error) that we discussed in the context of model fitting. However, this metric is difficult to interpret because the magnitude and range of the metric depends on the units of the data and the number of data points.

## 2. Coefficient of determination ($R^2$)

A useful metric for model accuracy is the **coefficient of determination ($R^2$)**

**Variance** - technically, variance is the square of the standard deviation

$$variance = \frac{\sum_{i=1}^{n}(x_i-\overline{x})^2}{n-1}$$

where:
- $n$ - the number of data points
- $x_i$ - the i-th data point
- $\overline{x}$ - mean of the data points

For intuition, we can ingore the denominator and think of variance as the total squared deviation of a set of data points from their mean.

**$R^2$** is the percentage of variance explained by a model:

$$R^2 = 100 * (1 - \frac{unexplained variance}{total variance}) \to R^2 = 100 * (1 - \frac{\sum_{i=1}^{n}(d_i-m_i)^2}{\sum_{i=1}^{n}(d_i-\overline{d})^2} )$$

where:
- $n$ - number of data points
- $d_i$ - the i-th data point
- $m_i$ - model fit for the i-th data point
- $overline{d}$ - the mean of the data points

![%D0%B8%D0%B7%D0%BE%D0%B1%D1%80%D0%B0%D0%B6%D0%B5%D0%BD%D0%B8%D0%B5.png](attachment:%D0%B8%D0%B7%D0%BE%D0%B1%D1%80%D0%B0%D0%B6%D0%B5%D0%BD%D0%B8%D0%B5.png)

There are two main components of the formula. 

The first component is in the numberator and is the sum of the squares of the residuals (which is just the usual conpect of squarred error).

The second component is in the denominator and is the sum of the squares of the deviations of the data points from their mean. 

The way the formula works is to quantify the variance that *is not* explained by the model (the numberator), express that as a fraction of the total variance (the denominator), substract the result from one so that we get the variance that *is* explained, and then multiply by 100 to obtain the percentage. ( The denominator of the variance formula, $n-1$, cancels out in the computation of the ratio.

The $R^2$ metric has an upper bound of 100% (corresponding to the case where a model matches the data exactly) and does not have a lower bound (since a model can be arbitrarily bad). An $R^2$ of 0% is achieved by a model that gets the mean of the data correct and nothing else.

## 3. Accuracy on the sample VS accuracy on the population

Now that we understand metric of $R^2$, we could calculate it for the data and the model fit that we have. This approach is fine if characterizing the set of data that we have collected is the goal of the modeling effort.

However, the observed set of data is just one sample from the population, that is, the distribution that underlines the data-collection process. We are probably interested not necessarily in how well the fitted model describes the sample, but in how well the fitted model describes the population. The problem is that the accuracy of a model evaluated on data used to fit the model will, on average, overestimate the true accuracy of the model.

For example, suppose we knew the true model $t$ (which can be thought of as the underlying function that generates the observed data). Imagine that we measure the accuracy of model $t$ by obtaining a dataset and calculating how well the model describes the data. The resulting accuracy level $R^2_{true}$  reflects the true accuracy of model $t$. Now suppose we allowed the parameters of the model to vary in order to fit the dataset; this produces fitted model $f$.

If we were to calculate how well the fitted  model describes the dataset, the resulting accuracy level $R^2_{fitted}$ would be larger than the original accuracy level :  $R^2_{fitted}$ > $R^2_{true}$. 

Furthermore, since the parameters of fitted model $f$ differ from the parameters of true model $t$, the fitted model will not perform as well as the true model in describign new data. Thus, if we were obtain a new dataset and calculate how well the fttted model describes the data, the resulting accuracy level $R^2_{generalize}$ would be less than the accuracy level of the true model, $R^2_{true}$. We can summarize the various relationships as follows:

 $$R^2_{fitted} > R^2_{true} > R^2_{generalize}$$

To keep things simpler, what this means is that after fitting a model to a set of data, the $R^2$ value of the fitted model on the observed sample will be larger than the $R^2$ value of the true model on the population, which in turn will be larger than the $R^2$ value of the fitted model on the population. In practice, we do not know the true model, so the bottom line is that the $R^2$ on the observed sample will, on average, be an overestimate of the $R^2$ on the population.

# 4. Cross-validation

![%D0%B8%D0%B7%D0%BE%D0%B1%D1%80%D0%B0%D0%B6%D0%B5%D0%BD%D0%B8%D0%B5.png](attachment:%D0%B8%D0%B7%D0%BE%D0%B1%D1%80%D0%B0%D0%B6%D0%B5%D0%BD%D0%B8%D0%B5.png)

To quantify the accuracy of a fitted model, we can use the technique of **cross-validation**. The idea is simple:
- First, use a set of *training* data to fit the parameters of a model
- Then, use an independent set of *testing* data to evaluate the accuracy of the fitted model

There are various flavors of cross-validation. 

In **leave-one-out cross-validation**, a single data point is omitted from the fitting process and the fitted model is used to predict that data point. The process is repeated for each of the remaining data points. Finally, the model predictions of the data points are aggregated and then compared to the data using some metric (such as **$R^2$**).

In [28]:
# LOOCV

import numpy as np
from sklearn.model_selection import LeaveOneOut

x = np.array([[1, 2], [3, 4]])
y = np.array([1, 2])

loo = LeaveOneOut()

loo.get_n_splits(x)

for train_index, test_index in loo.split(x):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = x[train_index], x[test_index]
   y_train, y_test = y[train_index], y[test_index]
   print(X_train, X_test, y_train, y_test)

TRAIN: [1] TEST: [0]
[[3 4]] [[1 2]] [2] [1]
TRAIN: [0] TEST: [1]
[[1 2]] [[3 4]] [1] [2]


In **K-fold cross-validation**, the dataset is randomly divided into *k* parts, and then the process proceeds as usual (omit one part from the fitting process, use the fitted model to predict that part, omit the next part, etc.). Finally, there can be simple forms of cross-validation, such as collecting two sets of data and using one for training and the other for testing.

![%D0%B8%D0%B7%D0%BE%D0%B1%D1%80%D0%B0%D0%B6%D0%B5%D0%BD%D0%B8%D0%B5.png](attachment:%D0%B8%D0%B7%D0%BE%D0%B1%D1%80%D0%B0%D0%B6%D0%B5%D0%BD%D0%B8%D0%B5.png)

In K-Folds Cross Validation we split data into *k* different subsets (or folds). We use *k-1* subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model. After that we test it against the test set.

In [3]:
import matplotlib.pyplot as plt
import numpy as np 

from sklearn.model_selection import KFold

# test data
x = np.array([[1,2] ,[3,4],[1,2],[3,4]]) 

# actual result
y = np.array([1,2,3,4])

# Create two folds - two separate subsets
kf = KFold(n_splits=2)

In [19]:
# Show created slices
for train_index, test_index in kf.split(x):
    
    print('TRAIN:', train_index, 'TEST:', test_index)
    
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]

TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]


Which cross-validation scheme to use depends on balancing computational time against model performance:

1. The larget the amount of cross validation iterations, the more computational time will be needed
2. The larget the amount of data used to fit the model, the better estimates of the model parametes and the higher the model accuracy
3. The larger the amount of data used in the accuracy calculation, the more reliable the accuracy estimate.

When performing cross-validation with more than one iteration, a tricky issue is that a different fitted model is obtained on each iteration. The accuracy level that is obtained can be interpreted as the expected accuracy of the model when fitted with a particular amount of data.

For example, suppose we perform 5-fold cross-validation. The $R^2$ between the model redictions and the data can be interpreted as an estimate of the accuracy of the model when fitted on a dataset whose size is equal to 80% of the number of data points in the observed dataset.

When dividing a data set into different parts for training and testing, it is important to ensure that the division is as strict as possible. It is all too easy to slip and allow some dependencies between the training and testing data, which will invalidate the idea that the perfomance on the testing data is an unbiased estimate of model accuracy.

## 5. Overfitting

The concept of **overfitting** is useful for thinking about why cross-validation is important.

For any given set of data, some of the data is signal and some of the data is noise. When we fit a model to data, we should be careful to not fit too much of the data, i.e. overfit the data. This is because if we were to fit all of the data, we will have fit not only the signal but also the noise in the data.

Fitting the noise in the data is undesirable since the resulting model will deviate from the true model. In practice, we of course do not know which part of the data is signal and which part is noise. But what we can do is to use cross-validation to help us determine when overfitting is occuring.

For example, suppose the true model underlying a set of data is quadratic, and suppose we are fitting a model consisting of polynomials of increasing degree (a constant regressor, a linear regressor, a cubic regressor, etc.). As we increse the maximum degree of the polynomials in the model, we will invariably fit the data better and better. However, beyond a certain point, the imporvements in fit will mostly reflect fitting the noise in the data instead of the signal. To determine when this overfitting occurs, we can examine the cross validation perfomance on the model.

## 6. Simple models VS complex models

It is useful to think of model complexity as a dimension along which models vary.

Complex, flexible models have the potential to describe many different types of functions. The advantage of such models is that the true model (i.e. the model that most accurately describes the population from which the data are sampled) may in fact be contained in the set of models that can be described. The disadvantage of such models is that they have many free parameters and it may be difficult to obtain good parameter estimates with limited or noisy data. 

On the other hand, simple, less flexible models describe fewer types of functions compared to complex models. The advantage of simple models is that they have fewer free parameters and so it becoms feasible to obtain good parameter estimates with limited or noisy data. The disadvantage of simple models is that the types of fucntions that can be described may be poor approximations to the true underlying functions.

Suppose that for some reason you really want to fit a complex model to some data, but you find that the model is overfit. There really isn't any magical solution: you have to either collect more data, reduct the noise leve, change how the data are sampled, or some combination of these approaches.

*A priori*, it is impossible to say whether a simple or complex model will yield the most accurate fitted model for a given dataset, since this depends on the amount of data available, the nature of the underlying effect, etc. We just have to try different models and see which one works the best.