# <b>F-Statistics</b>
In linear regression analysis, the F-statistic is a statistical test used to determine if the overall linear regression model is significant or not. Specifically, it tests the <span style="color:blue">Null hypothesis:</span> <span style='color:red'>All of the regression coefficients in the model are equal to zero.</span> indicating that there is no relationship between the independent variables and the dependent variable.

When the F-statistic is calculated, it takes into account the variation in the dependent variable that is explained by the model (i.e., the "sum of squares due to regression") as well as the variation that is not explained by the model (i.e., the "sum of squares due to error"). If the F-statistic is large enough, it suggests that the variation explained by the model is significantly greater than the variation not explained by the model, indicating that there is a linear relationship between the independent variables and the dependent variable.

Therefore, the F-statistic does not directly give information about the linearity of the relationship between the independent variables and the dependent variable. Instead, it tests whether the overall linear regression model is significant or not. However, if the F-statistic is significant, it suggests that there is evidence of a linear relationship between the independent variables and the dependent variable, and further analysis of the regression coefficients and other diagnostic measures can be used to examine the nature and strength of this relationship.

In [1]:
import numpy as np
import statsmodels.api as sm

# Generate some random data
np.random.seed(123)
x = np.random.normal(0, 1, 100)
y = 2*x + np.random.normal(0, 1, 100)

# Fit a linear regression model
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()

# Calculate the F-statistic and p-value
f_statistic = model.fvalue
p_value = model.f_pvalue

print("F-statistic:", f_statistic)
print(model.summary())

F-statistic: 521.7033489265004
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.842
Model:                            OLS   Adj. R-squared:                  0.840
Method:                 Least Squares   F-statistic:                     521.7
Date:                Tue, 16 May 2023   Prob (F-statistic):           4.95e-41
Time:                        10:20:40   Log-Likelihood:                -138.83
No. Observations:                 100   AIC:                             281.7
Df Residuals:                      98   BIC:                             286.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0191

In [2]:
from sklearn.linear_model import LinearRegression
import scipy.stats 

In [10]:
lr=LinearRegression().fit(x.reshape(-1, 1),y)
Y_hat=lr.predict(x.reshape(-1, 1))
k=1
n=len(x)
Ess=np.sum((Y_hat - np.mean(y))**2)
Rrs=np.sum((y - Y_hat)**2)

f_statistics=(Ess/k)/(Rrs/(n-k-1))
print(f_statistics)

521.7033489265004


In [27]:
scipy.stats.f.sf(f_statistics,k,n-k-1)

4.952865753448413e-41

In [19]:
2*(1-scipy.stats.f.cdf(f_statistics,k,n-k-1))

2.220446049250313e-16

## R2 score 

# The R-squared (R2) Score

The R-squared (R2) score is a statistical measure used to evaluate the goodness of fit of a linear regression model. It provides an indication of how well the observed data fits the regression model and how well the independent variables explain the variation in the dependent variable.

The R2 score is a number between 0 and 1, with higher values indicating a better fit of the model.

### Mathematical Derivation
Mathematically, the R2 score is defined as the proportion of the total variance in the dependent variable that is explained by the independent variables in the regression model. It is calculated using the following formula:


​R2 = {$SS_{exp}$}/{$SS_{tot}$} 

where $SS_{exp}$ is the sum of squares explained, $SS_{res}$ is the sum of squared residuals (i.e., the sum of the squared differences between the observed values and the predicted values), and $SS_{tot}$ is the total sum of squares (i.e., the sum of the squared differences between the observed values and the mean value of the dependent variable).

The total sum of squares can be expressed as the sum of the explained sum of squares and the residual sum of squares:


​$SS_{tot}$ = $SS_{exp}$ + $SS_{res}$
 

Therefore, the R2 score can also be calculated using the following formula:

R2 = 1- ({$SS_{res}$}/{$SS_{tot}$})

### Interpretation of R2 Score
The R2 score ranges from 0 to 1, where an R2 of 0 means that none of the variation in the dependent variable is explained by the independent variables in the model, and an R2 of 1 means that all of the variation in the dependent variable is explained by the independent variables in the model.

Generally, an R2 score above 0.7 is considered a good fit for a regression model, but the threshold may vary depending on the context and the field of study. It is important to note that the R2 score alone does not provide information about the statistical significance of the regression coefficients or the overall significance of the regression model. Therefore, it should be used in conjunction with other statistical tests and diagnostic measures to evaluate the fit of the regression model and the significance of the results.

In [41]:
from sklearn.metrics import r2_score 


In [44]:
r2_score(y,Y_hat)

0.8418598186216624

### Where R2 score is fail 

- One situation is when the model is overfitted to the data. This happens when the model is too complex and fits too closely to the specific data points in the sample. In this case, the R2 score may be high, but the model may not generalize well to new data outside of the sample.

- Another situation is when the relationship between the independent variables and the dependent variable is not linear. The R2 score only measures the goodness of fit of a linear regression model, and does not account for other types of relationships between the variables. This means that the R2 score may be low even if there is a strong relationship between the variables.

#### ``understand by an example ``

<span style='color:pink'> When there is an irrelevant column (also known as a "feature" or "predictor variable") in the linear regression model, it can artificially inflate the R2 score, making it appear that the model is a good fit for the data. This is because the irrelevant column does not contribute to the prediction of the dependent variable, but it does add an extra degree of freedom to the model, which can increase the R2 score.

<span style='color:pink'>However, in reality, the irrelevant column is not useful for predicting the dependent variable, and including it in the model can actually reduce the performance of the model. This is because it can introduce noise and make it more difficult for the model to identify the relevant patterns in the data.

<span style='color:pink'>So, even though the R2 score may not change when an irrelevant column is added to the model, it is important to carefully evaluate the relevance of each feature in the model to ensure that it is contributing to the prediction of the dependent variable. This can be done using techniques such as feature selection or regularization, which aim to identify and remove irrelevant or redundant features from the model.</span>

In [102]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.feature_selection import f_regression

# Load the California housing dataset
data = fetch_california_housing(as_frame=True)
X = data.data
y = data.target

# Create a copy of the dataset with an irrelevant feature added
X_with_irrelevant_feature = np.hstack((X, np.random.rand(X.shape[0], 1003)))

# Train a linear regression model on the original dataset
lr = LinearRegression().fit(X, y)
y_pred_orig = lr.predict(X)
r2_orig = r2_score(y, y_pred_orig)

# Train a linear regression model on the dataset with the irrelevant feature added
lr_with_irrelevant = LinearRegression().fit(X_with_irrelevant_feature, y)
y_pred_irrelevant = lr_with_irrelevant.predict(X_with_irrelevant_feature)
r2_with_irrelevant = r2_score(y, y_pred_irrelevant)

# Perform feature selection using F-test to identify the relevant features
f_values, p_values = f_regression(X, y)
relevant_features = np.argwhere(p_values < 0.05).flatten()

# Train a linear regression model using only the relevant features
X_relevant = X.iloc[:, relevant_features]
lr_relevant = LinearRegression().fit(X_relevant, y)
y_pred_relevant = lr_relevant.predict(X_relevant)
r2_relevant = r2_score(y, y_pred_relevant)




In [98]:
print(r2_orig)
print(r2_with_irrelevant)
print(r2_with_irrelevant)

0.606232685199805
0.6260654099664034
0.6260654099664034


# Adjusted r2 square 

adjusted R2 can also be a good alternative to regular R2 when dealing with models that have irrelevant features. Adjusted R2 is a modified version of R2 that takes into account the number of predictor variables in the model. Unlike R2, adjusted R2 penalizes the inclusion of irrelevant features, which can help to mitigate the issue of R2 not capturing the true performance of the model in the presence of irrelevant feature

$ Adj R2 = 1 - [(1 - R2) * (n - 1) / (n - p - 1)] $

In [99]:
n = X.shape[0]
p = len(relevant_features)
adj_r2_orig = 1 - ((1 - r2_orig) * (n - 1) / (n - p - 1))
print(adj_r2_orig)

0.606079995629818


In [101]:
p_irr=X_with_irrelevant_feature.shape[1]
adj_r2_irr = 1 - ((1 - r2_with_irrelevant) * (n - 1) / (n - p_irr - 1))
print(adj_r2_irr)

0.6068047685090991


#### T-Test (idividual indepentent variable  is  relation with dependent variable)

T-test is a statistical test used in linear regression to determine if a particular predictor variable is significantly associated with the target variable. In other words, it helps us determine whether the coefficient of a predictor variable is statistically different from zero.

<span style="color:pink">The null hypothesis of the t-test is that the coefficient of the predictor variable is zero, meaning that there is no significant relationship between the predictor variable and the target variable. The alternative hypothesis is that the coefficient is non-zero, indicating that there is a significant relationship between the predictor variable and the target variable.

To conduct a t-test in linear regression, we calculate the t-statistic by dividing the coefficient of the predictor variable by its standard error. The standard error measures the variability of the estimate of the coefficient, and the t-statistic measures how many standard errors the estimate is away from zero.

If the t-statistic is large and positive or large and negative, it indicates that the coefficient is significantly different from zero, and we reject the null hypothesis. If the t-statistic is small or close to zero, it indicates that the coefficient is not significantly different from zero, and we fail to reject the null hypothesis.

The t-test is useful for determining which predictor variables are most important in explaining the variability of the target variable, and can help us identify which variables to include in our regression model.

# Mathematical explaination 

In linear regression, we model the relationship between a dependent variable Y and one or more independent variables X. For a simple linear regression model with one independent variable, the model is given by:

Y = β0 + β1X + ε

where Y is the dependent variable, X is the independent variable, β0 and β1 are the intercept and slope coefficients, and ε is the error term.

To determine if the slope coefficient β1 is significantly different from zero, we conduct a t-test. The t-test is based on the t-statistic, which is calculated as:

t = (β1 - 0) / SE(β1)

where β1 is the estimated slope coefficient, 0 is the null hypothesis value (which is zero in this case), and SE(β1) is the standard error of the slope coefficient.

The standard error of the slope coefficient is calculated as:

SE(β1) = sqrt[Σ(yi - ŷi)^2 / (n - 2)] / sqrt[Σ(xi - x̄)^2]

where yi is the observed value of Y, ŷi is the predicted value of Y, xi is the observed value of X, x̄ is the mean value of X, and n is the sample size.

The t-statistic follows a t-distribution with n - 2 degrees of freedom. We can use this distribution to calculate the p-value of the t-test, which represents the probability of observing a t-statistic as extreme or more extreme than the one we calculated, assuming the null hypothesis is true.

#  Interpretation   

If the p-value is less than our chosen significance level (e.g., 0.05), we reject the null hypothesis and conclude that the slope coefficient is significantly different from zero, indicating that there is a significant relationship between the independent variable X and the dependent variable Y. If the p-value is greater than our significance level, we fail to reject the null hypothesis and conclude that there is insufficient evidence to suggest a significant relationship between X and Y.

In [104]:
Coef_=lr.coef_

In [200]:
import numpy as np
from scipy.stats import t

# Load the California housing dataset
data = fetch_california_housing(as_frame=True)
X = data.data
y = data.target

# Train a linear regression model on the original dataset
lr = LinearRegression().fit(X, y)
y_pred_orig = lr.predict(X)
n = len(y)
p = X.shape[1]
df = n - p - 1
r2_orig = r2_score(y, y_pred_orig)

# Compute the standard error and t-test for each coefficient
X_mean = np.mean(X, axis=0)
X_diff = X - X_mean
sse = np.sum((y - y_pred_orig) ** 2) / df
se = np.sqrt(np.diag(sse * np.linalg.inv(np.dot(X_diff.T, X_diff))))
t_stat = lr.coef_ / se
p_values = 2 * t.sf(np.abs(t_stat), df)

# Print the results
for i, col_name in enumerate(X.columns):
    print(f'{col_name}:')
    print(f'Coefficient = {lr.coef_[i]:.4f}')
    print(f'Standard Error = {se[i]:.4f}')
    print(f't-Statistic = {t_stat[i]:.4f}')
    print(f'p-value = {p_values[i]:.4f}')



MedInc:
Coefficient = 0.4367
Standard Error = 0.0042
t-Statistic = 104.0538
p-value = 0.0000
HouseAge:
Coefficient = 0.0094
Standard Error = 0.0004
t-Statistic = 21.1432
p-value = 0.0000
AveRooms:
Coefficient = -0.1073
Standard Error = 0.0059
t-Statistic = -18.2354
p-value = 0.0000
AveBedrms:
Coefficient = 0.6451
Standard Error = 0.0281
t-Statistic = 22.9276
p-value = 0.0000
Population:
Coefficient = -0.0000
Standard Error = 0.0000
t-Statistic = -0.8373
p-value = 0.4024
AveOccup:
Coefficient = -0.0038
Standard Error = 0.0005
t-Statistic = -7.7686
p-value = 0.0000
Latitude:
Coefficient = -0.4213
Standard Error = 0.0072
t-Statistic = -58.5414
p-value = 0.0000
Longitude:
Coefficient = -0.4345
Standard Error = 0.0075
t-Statistic = -57.6822
p-value = 0.0000


In [150]:
# using ols , i wann to generate summary to compare the t_statistics of mannual calucation and unsing function calculation 

X = sm.add_constant(X)
model = sm.OLS(y, X).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:            MedHouseVal   R-squared:                       0.606
Model:                            OLS   Adj. R-squared:                  0.606
Method:                 Least Squares   F-statistic:                     3970.
Date:                Wed, 10 May 2023   Prob (F-statistic):               0.00
Time:                        22:31:02   Log-Likelihood:                -22624.
No. Observations:               20640   AIC:                         4.527e+04
Df Residuals:                   20631   BIC:                         4.534e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -36.9419      0.659    -56.067      0.0

In [140]:
4.36693293e-01/0.002654

164.54155727204218

In [167]:
lr.intercept_

-36.94192020718445

In [169]:
lr.intercept_

-36.94192020718445

In [198]:
np.diag(sse * np.linalg.inv(np.dot(X_diff.T, X_diff)))

array([1.76131542e-05, 1.99165349e-07, 3.46376893e-05, 7.91575022e-04,
       2.25548856e-11, 2.37573240e-07, 5.17948835e-05, 5.67444334e-05])

# F-statistics 

In [None]:
import numpy as np

# Input data
X = np.array([2, 4, 6, 8, 10, 3, 5, 7, 9, 11])
Y = np.array([70, 85, 90, 95, 100, 75, 80, 92, 94, 105])

# Calculate mean of X and Y
mean_X = np.mean(X)
mean_Y = np.mean(Y)

# Calculate the regression coefficients
beta_1 = np.sum((X - mean_X) * (Y - mean_Y)) / np.sum((X - mean_X) ** 2)
beta_0 = mean_Y - beta_1 * mean_X

# Calculate predicted Y values
Y_pred = beta_0 + beta_1 * X

# Calculate explained sum of squares (ESS)
ESS = np.sum((Y_pred - mean_Y) ** 2)

# Calculate residual sum of squares (RSS)
RSS = np.sum((Y - Y_pred) ** 2)

# Calculate degrees of freedom
df1 = 1  # Number of predictors
df2 = len(X) - 1 - df1  # Total number of observations - Number of predictors - 1

# Calculate mean squares
MS1 = ESS / df1
MS2 = RSS / df2

# Calculate F-statistic
F = MS1 / MS2

# Output the results
print("Regression Coefficients:")
print("beta_0:", beta_0)
print("beta_1:", beta_1)
print("")

print("Sum of Squares:")
print("ESS:", ESS)
print("RSS:", RSS)
print("")

print("Degrees of Freedom:")
print("df1:", df1)
print("df2:", df2)
print("")

print("Mean Squares:")
print("MS1:", MS1)
print("MS2:", MS2)
print("")

print("F-statistic:", F)

import numpy as np
from scipy.stats import f

# Input data
method_A_scores = np.array([70, 85, 75])
method_B_scores = np.array([90, 95, 80])
method_C_scores = np.array([100, 92, 94, 105])

# Calculate the overall mean
overall_mean = np.mean(np.concatenate((method_A_scores, method_B_scores, method_C_scores)))

# Calculate the group means
group_A_mean = np.mean(method_A_scores)
group_B_mean = np.mean(method_B_scores)
group_C_mean = np.mean(method_C_scores)

# Calculate the total sum of squares (SST)
SST = np.sum((np.concatenate((method_A_scores, method_B_scores, method_C_scores)) - overall_mean) ** 2)

# Calculate the between-group sum of squares (SSB)
SSB = np.sum((np.array([group_A_mean] * len(method_A_scores)) - overall_mean) ** 2) + \
      np.sum((np.array([group_B_mean] * len(method_B_scores)) - overall_mean) ** 2) + \
      np.sum((np.array([group_C_mean] * len(method_C_scores)) - overall_mean) ** 2)

# Calculate the within-group sum of squares (SSW)
SSW = np.sum((method_A_scores - group_A_mean) ** 2) + \
      np.sum((method_B_scores - group_B_mean) ** 2) + \
      np.sum((method_C_scores - group_C_mean) ** 2)

#Calculate degrees of freedom
df1 = 2 # Number of groups - 1
df2 = len(method_A_scores) + len(method_B_scores) + len(method_C_scores) - df1 - 1

#Calculate mean squares
MSB = SSB / df1
MSW = SSW / df2

#Calculate F-statistic
F = MSB / MSW

#Calculate p-value
p_value = 1 - f.cdf(F, df1, df2)

#Output the results
print("Between-Group Variation:")
print("SSB:", SSB)
print("df1:", df1)
print("MSB:", MSB)
print("")

print("Within-Group Variation:")
print("SSW:", SSW)
print("df2:", df2)
print("MSW:", MSW)
print("")

print("F-statistic:", F)
print("p-value:", p_value)

In [14]:
import numpy as np
from scipy.stats import f

# Input data
X = np.array([2, 4, 6, 8, 10, 3, 5, 7, 9, 11])
Y = np.array([70, 85, 90, 95, 100, 75, 80, 92, 94, 105])
method = np.array(['A', 'A', 'B', 'B', 'C', 'A', 'B', 'C', 'C', 'C'])

# Linear Regression
# Calculate the regression coefficients
mean_X = np.mean(X)
mean_Y = np.mean(Y)
beta_1 = np.sum((X - mean_X) * (Y - mean_Y)) / np.sum((X - mean_X) ** 2)
beta_0 = mean_Y - beta_1 * mean_X

# Calculate predicted Y values
Y_pred = beta_0 + beta_1 * X

# Calculate the explained sum of squares (ESS) and residual sum of squares (RSS)
ESS = np.sum((Y_pred - np.mean(Y)) ** 2)
RSS = np.sum((Y - Y_pred) ** 2)

# Calculate degrees of freedom
df1_lr = 1  # Number of predictors (slope)
df2_lr = len(X) - 1 - df1_lr  # Total number of observations - Number of predictors - 1

# Calculate mean squares
MS1_lr = ESS / df1_lr
MS2_lr = RSS / df2_lr

# Calculate F-statistic
F_lr = MS1_lr / MS2_lr

# ANOVA
# Convert method to integers
method_int = np.unique(method, return_inverse=True)[1]

# Calculate group means
group_means = []
for group in np.unique(method_int):
    group_means.append(np.mean(Y[method_int == group]))
group_means = np.array(group_means)

# Calculate overall mean
overall_mean = np.mean(Y)

# Calculate the total sum of squares (SST)
SST = np.sum((Y - overall_mean) ** 2)

# Calculate the between-group sum of squares (SSB)
SSB = np.sum((group_means - overall_mean) ** 2) * (len(np.unique(method_int)) - 1)

# Calculate the within-group sum of squares (SSW)
SSW = np.sum((Y - group_means[method_int]) ** 2)

# Calculate degrees of freedom
df1_anova = len(np.unique(method_int)) - 1  # Number of groups - 1
df2_anova = len(X) - len(np.unique(method_int))  # Total number of observations - Number of groups

# Calculate mean squares
MSB = SSB / df1_anova
MSW = SSW / df2_anova

# Calculate F-statistic
F_anova = MSB / MSW

# Calculate p-value
p_value_anova = 1 - f.cdf(F_anova, df1_anova, df2_anova)

# Output the results
print("Linear Regression:")
print("F-statistic:", F_lr)
print("")

print("ANOVA:")
print("Between-Group Variation:")
print("SSB:", SSB)
print("df1:", df1_anova)
print("MSB:", MSB)
print("")

print("Within-Group Variation:")
print("SSW:", SSW)
print("df2:", df2_anova)
print("MSW:", MSW)
print("F-statistic:", F_anova)
print("p-value:", p_value_anova)


Linear Regression:
F-statistic: 111.02064896755141

ANOVA:
Between-Group Variation:
SSB: 452.3961111111109
df1: 2
MSB: 226.19805555555544

Within-Group Variation:
SSW: 338.08333333333337
df2: 7
MSW: 48.29761904761905
F-statistic: 4.683420425601838
p-value: 0.05116419342251366


In [15]:
import numpy as np
from scipy.stats import f

# Input data
X = np.array([2, 4, 6, 8, 10, 3, 5, 7, 9, 11])
Y = np.array([70, 85, 90, 95, 100, 75, 80, 92, 94, 105])
method = np.array(['A', 'A', 'B', 'B', 'C', 'A', 'B', 'C', 'C', 'C'])

# Linear Regression
# Calculate the regression coefficients
mean_X = np.mean(X)
mean_Y = np.mean(Y)
beta_1 = np.sum((X - mean_X) * (Y - mean_Y)) / np.sum((X - mean_X) ** 2)
beta_0 = mean_Y - beta_1 * mean_X

# Calculate predicted Y values
Y_pred = beta_0 + beta_1 * X

# Calculate the explained sum of squares (ESS) and residual sum of squares (RSS)
ESS = np.sum((Y_pred - np.mean(Y)) ** 2)
RSS = np.sum((Y - Y_pred) ** 2)

# Calculate degrees of freedom
df1_lr = 1  # Number of predictors (slope)
df2_lr = len(X) - 1 - df1_lr  # Total number of observations - Number of predictors - 1

# Calculate mean squares
MS1_lr = ESS / df1_lr
MS2_lr = RSS / df2_lr

# Calculate F-statistic
F_lr = MS1_lr / MS2_lr

# ANOVA
# Convert method to integers
method_int = np.unique(method, return_inverse=True)[1]

# Calculate group means
group_means = []
for group in np.unique(method_int):
    group_means.append(np.mean(Y[method_int == group]))
group_means = np.array(group_means)

# Calculate overall mean
overall_mean = np.mean(Y)

# Calculate the total sum of squares (SST)
SST = np.sum((Y - overall_mean) ** 2)

# Calculate the between-group sum of squares (SSB)
SSB = np.sum((group_means - overall_mean) ** 2) * (len(np.unique(method_int)) - 1)

# Calculate the within-group sum of squares (SSW)
SSW = np.sum((Y - group_means[method_int]) ** 2)

# Calculate degrees of freedom
df1_anova = len(np.unique(method_int)) - 1  # Number of groups - 1
df2_anova = len(Y) - len(np.unique(method_int))  # Total number of observations - Number of groups

# Calculate mean squares
MSB = SSB / df1_anova
MSW = SSW / df2_anova

# Calculate F-statistic
F_anova = MSB / MSW

# Calculate p-value
p_value_anova = 1 - f.cdf(F_anova, df1_anova, df2_anova)

# Output the results
print("Linear Regression:")
print("F-statistic:", F_lr)
print("")

print("ANOVA:")
print("Between-Group Variation:")
print("SSB:", SSB)
print("df1:", df1_anova)
print("MSB:", MSB)
print("")

print("Within-Group Variation:")
print("SSW:", SSW)
print("df2:", df2_anova)
print("df2:", df2_anova)
print("MSW:", MSW)

print("F-statistic:", F_anova)
print("p-value:", p_value_anova)




Linear Regression:
F-statistic: 111.02064896755141

ANOVA:
Between-Group Variation:
SSB: 452.3961111111109
df1: 2
MSB: 226.19805555555544

Within-Group Variation:
SSW: 338.08333333333337
df2: 7
df2: 7
MSW: 48.29761904761905
F-statistic: 4.683420425601838
p-value: 0.05116419342251366
