# How to compute evaluation metrics in Python
This template shows you two methods for computing the different metrics used for performance evaluation and feature selection. 

### F-Statistics by Fisher

A statistical test is a process by which we try to show whether a hypothesis is confirmed or disproved by the data at our disposal. This test hypothesis, also called null hypothesis and noted $H_{0}$, would have consequences on the properties of the observed data if it is actually verified. These properties are summarized by a test statistic, the value of which gives an idea of the probability that H_0 is true.

Fisher's F-statistic allows to test the veracity of the following hypotheses:


* When the Fisher test is applied to the model as a whole, the null hypothesis, noted $H_{0}$, is "the variables chosen to construct the model are not jointly significant in describing the target variable". If the hypothesis is true, the F-statistic should follow a Fisher probability distribution law noted F-distribution of parameters $(n - 1, n - 1)$ where $n$ is the number of observations used to train the model. However, if the value of the F-statistic, noted "F", is outside the most probable regions of the distribution, then we can reject the null hypothesis and conclude that the chosen model has a real explanatory power on the target variable.

It may seem a little farfetched but all statistical tests work like that. We make an assumption, this assumption if it held would cause the test statistic to follow a given distribution, if the actual value of the statistic lands too far from the probable scope of the hypothetical distribution we are allowed to reject the null hypothesis.

Mathematically, the F-statistic is written:

$$
F = \frac{SSE}{SSR}
$$

The F-test can also compare two nested models (model 1 which includes "model_1_variables" and model 2 which includes "model_1_variables + $X_d$". In this case the F-statistic follows an F-law of parameters $(n - 1, n - 1)$ if the assumption that the simplest model (model 1) of the two models best describes the target variable is verified. The mathematical formula of F is then :

$$
F = (\frac{SSR_{2}-SSR_{1}}{p_{2}-p_{1}})(\frac{n-p_{1}}{SSR_{1}})
$$

If the value of F-statistic is in an unlikely region of the F-distribution, then the hypothesis is rejected and the test suggests that the more complex Model 2 provides significant additional information compared to the simpler Model 1.

Graphically the F-test can be illustrated as follows:

![F-statistic](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/curve.png)

We represent the density distribution of the F-distribution, as in any test we define a level $\alpha$ between 0 and 1 which will influence the size of the hypothesis rejection zone. Very often, we choose $\alpha = 5%$ when no specific knowledge can help us modulate our standards. The F test is one-sided, only large values of F will allow us to reject the hypothesis. More precisely if the value of F is in the upper part of the expected distribution equivalent to 5% probability, then we can say that the hypothesis is rejected at $1 - \alpha$ 95%.

This first metric allows us to test the hypothesis that the explanatory variables have no influence on the target variable, we will now look at metrics that indicate the performance level of the model.

* $R^2_{adjusted}$

$R^2_{adjusted}$ is a modified version of $R²$ that penalizes the number of explanatory variables selected to build the model. Its mathematical formula is:

$$
R^2_{ajusted} = 1-\frac{n-1}{n-p-1}(1-R^2)
$$

Where $p$ is the number of explanatory variables used and $n$ is the number of observations used. The growth of $R^2$ as a function of $p$ is compensated by the decrease of $\frac{n-1}{(n-p-1)}$ as a function of $p$. Consequently, if the information contribution of an explanatory variable is not significant enough, then $R^2_{adjusted}$ will decrease. In fact, it is possible to use this indicator to compare the performance of models that do not necessarily have the same number of explanatory variables.

* P-values

P-values are evaluation metrics that make it possible to evaluate the contribution of each explanatory variable individually as opposed to evaluating the model as a whole. Unfortunately it cannot be easily computed using sklearn so we will introduce the statsmodels library that calculates all important metrics automatically. The p_value can be interpreted as the probability that a given parameter's true value is 0, in other words the probability that a variable does not bring any significant information to the linear model. Usually we consider that a p_value inferior to 5% means that the variable is significant, otherwise it is consider not significant, however it depends on the context and the standards of the industry you are working in : for example web marketing agencies typically have lower standards than pharmaceutical companies because their goals and constraints are fundamentally different.


## Computation by hand

In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression

In [2]:
# SST, SSE and SSR have to be calculated manually
# generate some example data
X = np.array([
    [1,3,5,6,7],
    [4.6, 3.7, 3.4, 3.0, 3.1]
]).transpose()
Y = np.array([2.1, 3.5, 4.4, 5.6, 5.9])

from sklearn.linear_model import LinearRegression
model = LinearRegression() # create and instanceof the model

model.fit(X,Y) # fit the model

# calculate evaluation metrics
SST = np.sum(np.square(Y - np.mean(Y)))
print("Sum of Square Total {}".format(SST))

SSE = np.sum(np.square(model.predict(X) - np.mean(Y)))
print("Sum of Square Explained {}".format(SSE))

SSR = np.sum(np.square(Y - model.predict(X)))
print("Sum of Square Residual {}".format(SSR))
print("\n")

# calculate R square and adjusted R-square
R_2 = 1 - SSR/SST
print("R square {}".format(R_2))
R_2_alt = model.score(X,Y) # alternative method to calculate R square
print("R square {}".format(R_2_alt))
n = X.shape[0]
p = X.shape[1]
R_2_adj = 1 - (n-1)/(n-p-1)*(1-R_2)
print("R square adjusted {}".format(R_2_adj))


Sum of Square Total 9.74
Sum of Square Explained 9.612353658536577
Sum of Square Residual 0.1276463414634147


R square 0.9868946261331196
R square 0.9868946261331196
R square adjusted 0.9737892522662392


## Computation with statsmodel

In [3]:
# alternative solution with library statsmodels (useful mainly for linear models)
import statsmodels.api as sm

X2 = sm.add_constant(X) # the coefficient beta_0 also called intercept is not automatically included, so we need to manually add a constant variable equal to one.
est = sm.OLS(Y, X2)
est2 = est.fit()
print("\n")
print("-----------------------------------------------------------------------------------------")
print("------------------------Results from statsmodels-----------------------------------------")
print("-----------------------------------------------------------------------------------------")
print("\n")
print(est2.summary())

  from pandas import Int64Index as NumericIndex




-----------------------------------------------------------------------------------------
------------------------Results from statsmodels-----------------------------------------
-----------------------------------------------------------------------------------------


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.987
Model:                            OLS   Adj. R-squared:                  0.974
Method:                 Least Squares   F-statistic:                     75.30
Date:                Thu, 23 Jun 2022   Prob (F-statistic):             0.0131
Time:                        17:06:35   Log-Likelihood:                 2.0751
No. Observations:                   5   AIC:                             1.850
Df Residuals:                       2   BIC:                            0.6781
Df Model:                           2                                         
Covariance Type

  warn("omni_normtest is not valid with less than 8 observations; %i "
