# Multiple Linear Regression using Statsmodel
## CPE 490 590 
### Author: Rahul Bhadani

In [2]:
!pip install statsmodels

Collecting statsmodels
  Downloading statsmodels-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[0mCollecting patsy>=0.5.4
  Downloading patsy-0.5.6-py2.py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.9/233.9 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: patsy, statsmodels
Successfully installed patsy-0.5.6 statsmodels-0.14.1


Note: The multiple linear regression used in this notebook, doesn'thave bias term $w_0$, i.e. $w_0 = 0$.

In [7]:
import statsmodels.api as sm 
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'Serif'
plt.rcParams['font.size'] = 15

## Read the data

In [5]:
df = pd.read_csv('Dataset/Advertising/Advertising.csv', index_col=0)
df.columns

Index(['TV', 'Radio', 'Newspaper', 'Sales'], dtype='object')

The dataset contains TV Budget, Radio Budget and Newspaper Budget for an advertisement of a product at a company and Sales.

Our goal is to predict sales based on TV Budget, Radio Budget and Newspaper Budget

## Split the Dataset into Training and Testing

In [8]:
df_filtered = df[['TV', 'Radio', 'Newspaper']]
y = df[['Sales']]
# Separate features and labels
x = df_filtered.values.astype(np.float64)
y = y.values.reshape(-1, 1).astype(np.float64)

X_Train, X_Test, Y_Train, Y_Test = train_test_split(x, y, test_size = 1/3, random_state = 0)


# Fitting Simple Linear Regression to the training set

In [11]:
xt = sm.add_constant(X_Train) 
est = sm.OLS(Y_Train, xt).fit() 

# Print the Summary of the Result

In [12]:
est.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.907
Model:,OLS,Adj. R-squared:,0.905
Method:,Least Squares,F-statistic:,418.1
Date:,"Tue, 20 Feb 2024",Prob (F-statistic):,3.0200000000000003e-66
Time:,16:18:18,Log-Likelihood:,-250.83
No. Observations:,133,AIC:,509.7
Df Residuals:,129,BIC:,521.2
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9038,0.368,7.886,0.000,2.175,3.632
x1,0.0443,0.002,26.445,0.000,0.041,0.048
x2,0.1966,0.010,20.420,0.000,0.178,0.216
x3,0.0026,0.007,0.371,0.712,-0.011,0.017

0,1,2,3
Omnibus:,10.854,Durbin-Watson:,2.117
Prob(Omnibus):,0.004,Jarque-Bera (JB):,11.895
Skew:,-0.731,Prob(JB):,0.00261
Kurtosis:,2.909,Cond. No.,462.0


# Summary
In the above result, we see that $R^2$ coefficient of determination was $0.907$,and estimated coefficients were $w_0 = 2.9038$, $w_1 = 0.0443	$, $w_2 = 0.1966$, and $w_3 = 0.0026$.
We can also see their respective 95% confidence interval as [2.175,	3.632], [0.041,	0.048], [0.178,	0.216], and [-0.011,	0.017].

Note: the answer might be different if rerun the notebook and training and test split will happen randomly everytime the whole notebook is run

# Prediction on the test data

In [13]:
xtest = sm.add_constant(X_Test)

# Make a prediction
y_pred = est.predict(xtest)

# Mean Squared Error

In [16]:
from sklearn.metrics import mean_squared_error, r2_score

# Calculate the Mean Squared Error
mse = mean_squared_error(Y_Test, y_pred)
print(mse)

3.3864786054959777
