## Hypothesis testing and confidence interval( for simple regression)

In [1]:
import pandas as pd
import statsmodels.api as sm

In [2]:
#create dataset
df = pd.DataFrame({'hours': [1, 2, 4, 5, 5, 6, 6, 7, 8, 10, 11, 11, 12, 12, 14],
                   'score': [64, 66, 76, 73, 74, 81, 83, 82, 80, 88, 84, 82, 91, 93, 89]})

In [3]:
#view first six rows of dataset
df[0:6]

Unnamed: 0,hours,score
0,1,64
1,2,66
2,4,76
3,5,73
4,5,74
5,6,81


In [4]:
#define response variable
y = df['score']

#define explanatory variable
x = df[['hours']]

#add constant to predictor variables
x = sm.add_constant(x)

#fit linear regression model
model = sm.OLS(y, x).fit()

#view model summary
print(model.summary())

#confidence interval (95%)
model.conf_int( alpha=0.05 )


                            OLS Regression Results                            
Dep. Variable:                  score   R-squared:                       0.831
Model:                            OLS   Adj. R-squared:                  0.818
Method:                 Least Squares   F-statistic:                     63.91
Date:                Fri, 08 Mar 2024   Prob (F-statistic):           2.25e-06
Time:                        11:11:33   Log-Likelihood:                -39.594
No. Observations:                  15   AIC:                             83.19
Df Residuals:                      13   BIC:                             84.60
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         65.3340      2.106     31.023      0.0



Unnamed: 0,0,1
const,60.784238,69.883665
hours,1.446682,2.518068


# conclusion
Score = 65.334+ 1.982(hours)

This means each additional hour studied is associated with an avg increase of exam score of 1.982 and the intecept
value 65.33 tells us the avg expected exam score student who studies 0 hours.

Since p-values for Hours interpreted as 0.00 which is significatntly less than 0.05, therefore we can say that there is a statistically significant association between hours and scores.

R^2 value tells us 83.1% of the variation in the scores can be explained by hours studied.

Here, F statistics and corresponding p-values tell us the overall significance of the regression model that is whether explanatory variable in the model are useful for explaining the variation in response variable. Since p-value in the example is less the 0.05 therefore our model is statistically significant and hours is deemed to usefull for explainig the variation in score.

In [5]:
data = sm.datasets.get_rdataset('cars', 'datasets').data
# Fit the linear regression model
model = sm.formula.ols('dist ~ speed', data=data).fit()
# Print the t-test results
print(model.summary())

#confidence interval (95%)
model.conf_int( alpha=0.05 )

                            OLS Regression Results                            
Dep. Variable:                   dist   R-squared:                       0.651
Model:                            OLS   Adj. R-squared:                  0.644
Method:                 Least Squares   F-statistic:                     89.57
Date:                Fri, 08 Mar 2024   Prob (F-statistic):           1.49e-12
Time:                        11:11:40   Log-Likelihood:                -206.58
No. Observations:                  50   AIC:                             417.2
Df Residuals:                      48   BIC:                             421.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -17.5791      6.758     -2.601      0.0

Unnamed: 0,0,1
Intercept,-31.16785,-3.99034
speed,3.096964,4.767853
