<a href="https://colab.research.google.com/github/rajdeepbanerjee-git/JNCLectures_Intro_to_ML/blob/main/Week7/Week7_regression_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.stats import t

In [4]:
# read data
df = pd.read_csv('/content/Housing.csv')
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   price             545 non-null    int64 
 1   area              545 non-null    int64 
 2   bedrooms          545 non-null    int64 
 3   bathrooms         545 non-null    int64 
 4   stories           545 non-null    int64 
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    int64 
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB
None


Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


For simplicity, let's take only numerical variables as dependent variable.

In [5]:
#Select only numerical variables
numerical_cols = df.select_dtypes(include=[np.number]).columns # np.number ensures any data type which is considered numerical by numpy library
print(numerical_cols)
df_numeric = df[numerical_cols]

Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'parking'], dtype='object')


In [6]:
# Define independent and dependent variables
X = df_numeric.drop(columns=["price"])  # Independent variables
y = df_numeric["price"]  # Dependent variable (house price)


In [7]:
# Build regression model and calculate beta, p-value, etc.
print(X)
X = sm.add_constant(X)  # Add intercept
print(X)
model = sm.OLS(y, X).fit()

     area  bedrooms  bathrooms  stories  parking
0    7420         4          2        3        2
1    8960         4          4        4        3
2    9960         3          2        2        2
3    7500         4          2        2        3
4    7420         4          1        2        2
..    ...       ...        ...      ...      ...
540  3000         2          1        1        2
541  2400         3          1        1        0
542  3620         2          1        1        0
543  2910         3          1        1        0
544  3850         3          1        2        0

[545 rows x 5 columns]
     const  area  bedrooms  bathrooms  stories  parking
0      1.0  7420         4          2        3        2
1      1.0  8960         4          4        4        3
2      1.0  9960         3          2        2        2
3      1.0  7500         4          2        2        3
4      1.0  7420         4          1        2        2
..     ...   ...       ...        ...      ...      

In [9]:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.562
Model:                            OLS   Adj. R-squared:                  0.558
Method:                 Least Squares   F-statistic:                     138.1
Date:                Thu, 20 Feb 2025   Prob (F-statistic):           4.37e-94
Time:                        14:11:40   Log-Likelihood:                -8418.8
No. Observations:                 545   AIC:                         1.685e+04
Df Residuals:                     539   BIC:                         1.688e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1.457e+05   2.47e+05     -0.591      0.5

Let us try to compute the p-value of 'bedrooms' feature variable using the t-statistics separately.

In [16]:
# you get the beta and the SE(beta) directly from the fitted model
print(model.params, model.bse)
# so the t-values are basically their ratio: beta/SE(beta)
t_values = model.params/model.bse
print(t_values)
print(f"t-value of bedrooms: {t_values['bedrooms']}, degrees of freedom {len(X)-X.shape[1]-1}")

const       -1.457345e+05
area         3.311155e+02
bedrooms     1.678098e+05
bathrooms    1.133740e+06
stories      5.479398e+05
parking      3.775963e+05
dtype: float64 const        246634.480772
area             26.599666
bedrooms      82932.687437
bathrooms    118828.331193
stories       68894.469456
parking       66804.137568
dtype: float64
const        -0.590893
area         12.448107
bedrooms      2.023446
bathrooms     9.540992
stories       7.953321
parking       5.652289
dtype: float64
t-value of bedrooms: 2.023445679189467, degrees of freedom 538


Go to https://datatab.net/tutorial/t-distribution and use the above values to check the p-value.
- What was the null hypothesis?
- Will you accept or reject it based on this p-value?

#### [Variance inflation factor](https://www.statsmodels.org/dev/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html#statsmodels.stats.outliers_influence.variance_inflation_factor)


In [18]:
#Compute VIF to check for multicollinearity: < 3 always accept, > 5 can be rejected
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)

     Feature        VIF
0      const  21.415032
1       area   1.170959
2   bedrooms   1.316597
3  bathrooms   1.252775
4    stories   1.255202
5    parking   1.164172


#### Do It yourself!
If `y_pred = model.fittedvalues`, then write a code to calculate adjusted R-squared.