## F-test for Linear Regression

The F-test for linear regression tests whether any of the independent variables in a multiple linear regression model are significant. The F-test of overall significance indicates whether our linear regression model provides a better fit to the data than a model that contains no independent variables.

The hypothesis for the F-test of the overall significance are as follows:
- Null hypothesis: The fit of the intercept-only model and our model are equal
- Alterative hypothesis: The fit of the intercept-only model is significantly reduced compared to our model

If the p-value for the F-test of overall signigicance test is less than the significance level, we can reject the null hypothesis and conclude that our model provides a better fit than the intercept-only model.

To test whether there is a regression relationship between the response variable Y and the set of variable X1, X2 .... , We make the hypothesis testing of: 

- H0:  Β₁ = 0
- Ha:  Β₁ ≠ 0

We use the test statistics:

- F = MSR/MSE

#### Import library

In [2]:
# import library

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# import library for linear regression

import statsmodels.api as sm

%matplotlib inline

#### Create Dataset

In [3]:
# create dataset and assign x and y
# we create an artificial data of bills(independent variable) and tips given(dependent variable)

x = [34, 108, 64, 88, 99 ,51] 
y = [5, 17, 11, 8, 14, 5]
df = pd.DataFrame({"bills" : x, "tips" : y})
x = df.bills
y = df.tips
df

Unnamed: 0,bills,tips
0,34,5
1,108,17
2,64,11
3,88,8
4,99,14
5,51,5


#### Parameter of linear regression

In [4]:
# calculate parameters

x = [34, 108, 64, 88, 99 ,51] 
y = [5, 17, 11, 8, 14, 5]
df = pd.DataFrame({"x" : x, "y" : y})
df['x-x̄'] = df['x'] - np.mean(df['x'])
df['y-ȳ'] = df['y'] - np.mean(df['y'])
df['(x-x̄)(y-ȳ)'] = df['x-x̄'] * df['y-ȳ']
df['(x-x̄)^2'] = df['x-x̄']**2
df['(y-ȳ)^2'] = df['y-ȳ']**2
df

Unnamed: 0,x,y,x-x̄,y-ȳ,(x-x̄)(y-ȳ),(x-x̄)^2,(y-ȳ)^2
0,34,5,-40.0,-5.0,200.0,1600.0,25.0
1,108,17,34.0,7.0,238.0,1156.0,49.0
2,64,11,-10.0,1.0,-10.0,100.0,1.0
3,88,8,14.0,-2.0,-28.0,196.0,4.0
4,99,14,25.0,4.0,100.0,625.0,16.0
5,51,5,-23.0,-5.0,115.0,529.0,25.0


In [5]:
# calculate the slope of linear line

b1 = np.round((df['(x-x̄)(y-ȳ)'].sum() / df['(x-x̄)^2'].sum()),4)
b1

0.1462

In [6]:
# calculate the y-intercept of linear line

b0 = np.round(np.mean(df['y'] - b1 * np.mean(df['x'])),4)
b0

-0.8188

In [48]:
# Print linear regression equation
print(f'The linear regression is y = {b0} + {b1}x')

The linear regression is y = -0.8188 + 0.1462x


In [8]:
# the calculation needed for F-test

x = [34, 108, 64, 88, 99 ,51] 
y = [5, 17, 11, 8, 14, 5]
df = pd.DataFrame({"x" : x, "y" : y})
df['x-x̄'] = df['x'] - np.mean(df['x'])
df['y-ȳ'] = df['y'] - np.mean(df['y'])
df['(x-x̄)(y-ȳ)'] = df['x-x̄'] * df['y-ȳ']
df['(x-x̄)^2'] = df['x-x̄']**2
df['(y-ȳ)^2'] = df['y-ȳ']**2
df['ŷ'] = b0 + b1 * df['x']
df['(y-ŷ)^2'] = (df['y'] - (b0 + b1 * df['x'])) ** 2
df['(ŷ-ȳ)^2'] = ((b0 + b1 * df['x']) - np.mean(df['y'])) ** 2
df

Unnamed: 0,x,y,x-x̄,y-ȳ,(x-x̄)(y-ȳ),(x-x̄)^2,(y-ȳ)^2,ŷ,(y-ŷ)^2,(ŷ-ȳ)^2
0,34,5,-40.0,-5.0,200.0,1600.0,25.0,4.152,0.719104,34.199104
1,108,17,34.0,7.0,238.0,1156.0,49.0,14.9708,4.117653,24.708853
2,64,11,-10.0,1.0,-10.0,100.0,1.0,8.538,6.061444,2.137444
3,88,8,14.0,-2.0,-28.0,196.0,4.0,12.0468,16.37659,4.18939
4,99,14,25.0,4.0,100.0,625.0,16.0,13.655,0.119025,13.359025
5,51,5,-23.0,-5.0,115.0,529.0,25.0,6.6374,2.681079,11.307079


#### Calculate F statistics

In [47]:
# SSR : Sum of square residuals
SSR = df['(ŷ-ȳ)^2'].sum()
# SSE : Sum of square errors
SSE = df['(y-ŷ)^2'].sum()

# Print MSR and MSE
print(f'Sum of square residuals is {SSR.round(2)}, Sum of square errors is {SSE.round(2)}')

Sum of square residuals is 89.9, Sum of square errors is 30.07


In [45]:
# MSR: Mean Squared Residual
MSR = SSR / 1

# MSE: Mean Squared Error 
MSE = SSE / (len(x)-2)

# Print MSR and MSE
print(f'Mean Squared Residual is {MSR.round(2)}, Mean Squared Error is {MSE.round(2)}')

Mean Squared Residual is 89.9, Mean Squared Error is 7.52


In [40]:
# Calculate the F-value
F_value = MSR / MSE
print(f'F-value is {F_value.round(2)}')

F-value is 11.96


In [32]:
# degree of freedom and alpha value
alpha = 0.05
df_regression = 2-1
df_error = len(x)-2 

In [39]:
# Calculate the F-value using Percent point function (inverse of cdf)
import scipy.stats
f_critical = scipy.stats.f.ppf(1- alpha, df_regression, df_error)
print(f'F_critical is {f_critical.round(2)}')

F_critical is 7.71


In [38]:
# Check the hypothesis
if F_value > f_critical:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Reject the null hypothesis


< Conclusion >

- Our model y = -0.8188 + 0.1462x is significant
- Our model is good fit for 95% confidence level (alpha = 0.05)

- Null hypothesis rejected. 
- F-value > F-critical : Statistical significance linear relationship exist between independent variable x and dependent variable y. Which is not due by chance.