## Introduction to Statsmodels API: Part B


## Step 1: Importing Libraries


- Let's import a dataset and look at how we can change the parameters that are fed into the model and choose the best parameters that are a good fit for the model's output.

- Let's import the packages, read the CSV file and then build the model.

- Let's also import the pandas library for data manipulation and the statsmodels library for statistical modeling.


In [2]:
import pandas as pd
import statsmodels.formula.api as smf

## Step 2: Loading the Dataset

- Load the dataset from **Advertising.csv** file
- Display the first few rows of the data


In [3]:
data = pd.read_csv('Advertising.csv',index_col = 0)

In [4]:
data.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [5]:
data.shape

(200, 4)

## Step 3: Building a Linear Regression Model with TV as the Independent Variable

- Build a linear regression model using the ordinary least squares (OLS) method
- The dependent variable is **Sales**, and the independent variable is **TV**
- Fit the model to the data and print a summary


In [13]:
model = smf.ols(formula = ' Sales ~ TV ',data = data).fit()

In [30]:
model.param

AttributeError: 'OLS' object has no attribute 'params'

In [21]:
# The offical refactor
import statsmodels.regression.linear_model as lin_model
import statsmodels.api as sm

X = sm.add_constant(data["TV"]) # add constant
model_non_r_refactor = lin_model.OLS(data["Sales"], X).fit()

In [28]:
type(lin_model)

module

Let's display the summary of the OLS regression model.

In [18]:
#Print a summary of the model
model.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.612
Model:,OLS,Adj. R-squared:,0.61
Method:,Least Squares,F-statistic:,312.1
Date:,"Tue, 26 Sep 2023",Prob (F-statistic):,1.47e-42
Time:,03:04:16,Log-Likelihood:,-519.05
No. Observations:,200,AIC:,1042.0
Df Residuals:,198,BIC:,1049.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.0326,0.458,15.360,0.000,6.130,7.935
TV,0.0475,0.003,17.668,0.000,0.042,0.053

0,1,2,3
Omnibus:,0.531,Durbin-Watson:,1.935
Prob(Omnibus):,0.767,Jarque-Bera (JB):,0.669
Skew:,-0.089,Prob(JB):,0.716
Kurtosis:,2.779,Cond. No.,338.0


In [22]:
model_non_r_refactor.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.612
Model:,OLS,Adj. R-squared:,0.61
Method:,Least Squares,F-statistic:,312.1
Date:,"Tue, 26 Sep 2023",Prob (F-statistic):,1.47e-42
Time:,03:06:20,Log-Likelihood:,-519.05
No. Observations:,200,AIC:,1042.0
Df Residuals:,198,BIC:,1049.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,7.0326,0.458,15.360,0.000,6.130,7.935
TV,0.0475,0.003,17.668,0.000,0.042,0.053

0,1,2,3
Omnibus:,0.531,Durbin-Watson:,1.935
Prob(Omnibus):,0.767,Jarque-Bera (JB):,0.669
Skew:,-0.089,Prob(JB):,0.716
Kurtosis:,2.779,Cond. No.,338.0


**Observation**
* The R-squared value is 0.612.
* The adjusted R-squared value is 0.610.
* The value of **TV** is highly dependent on **Sales**.

## Step 4: Analyzing Model Parameters

- Extract the coefficients (m and c) of the fitted linear equation y = mx + c
- Calculate the confidence intervals and p-values for the model parameters
- Calculate the R-squared value of the model


Now, let's find the model parameters.

In [23]:
# y = mX + intercept

# We are using statsmodel.formula.api.ols implementation of the Linear Regression Algorithm
## To create a Linear Regression Model fit to the data of Advertising.csv.

## This is the model:
## sales = (0.047)TV + 7.032

model.params

Intercept    7.032594
TV           0.047537
dtype: float64

Let's calculate the confidence intervals for the model parameters.

In [24]:
# P Values
## https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLSResults.conf_int.html
## Still questions about what this is saying
model.conf_int()

Unnamed: 0,0,1
Intercept,6.129719,7.935468
TV,0.042231,0.052843


In [25]:
type(model)

statsmodels.regression.linear_model.RegressionResultsWrapper

Let's calculate the p-value of the model.

In [26]:
# P vale is the probability that the null hypothesis (intercept == 0 ; slope == 0) is True

## With the values seen, we should reject the null hypothesis that the intercept is 0 and that the slope is 0
## Linear Regression: # we should reject null hypothesis that there is 0 relationship between 

## Sales and TV
model.pvalues

Intercept    1.406300e-35
TV           1.467390e-42
dtype: float64

Let's calculate the R-squared value of the model.

In [27]:
model.rsquared

0.6118750508500712

In [None]:
# 

**Observation**

The R-squared value is 0.61, which is not a satisfactory value.

What does it mean to have a satisfactory value?
1. Depends on the use case.
   - "Is there a relationship between Sales and TV because I as an analyst want to understand which thing I should investigate more" : R > 0.1 is satisfactory for ME AMY (that bar will depend on your organization or your academic area of research).
   - "IS THIS FORMULA sales = 0.04TV + 7 GOOD ENOUGH to predict Sales such that my organization can confidently bet 1 million dollars of ad spend to optimize sales?"
         - The stakes are very high for 1 million of dollars of spend 

## Step 5: Building a Multiple Regression Model with TV, Radio and Newspaper as Independent Variables

- Build another model using the independent variables TV, radio and newspaper
- Fit the model to the data, and print a summary


In [35]:
model = smf.ols(formula = ' Sales ~ TV + Radio + Newspaper ',data = data).fit()

Now, let's find the summary of the model.

In [36]:
model.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Tue, 26 Sep 2023",Prob (F-statistic):,1.58e-96
Time:,03:41:40,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.9389,0.312,9.422,0.000,2.324,3.554
TV,0.0458,0.001,32.809,0.000,0.043,0.049
Radio,0.1885,0.009,21.893,0.000,0.172,0.206
Newspaper,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


**Observation**

This new model gives a better output than the model with just **TV** as a feature.

## Step 6: Analyzing the New Model Parameters

- Extract the parameters of the new model
- Calculate the p-value for the new model's parameters
- Calculate the R-squared value of the new model


In [37]:
# sales = 0.045TV + 0.188Radio - 0.001Newspaper + 2.93

model.params

Intercept    2.938889
TV           0.045765
Radio        0.188530
Newspaper   -0.001037
dtype: float64

In [39]:
model.pvalues

Intercept    1.267295e-17
TV           1.509960e-81
Radio        1.505339e-54
Newspaper    8.599151e-01
dtype: float64

**Observations**

1. The p-value is low and satisfactory for TV and Radio 
2. The p-value is not satisfactory for Newspaper 



In [40]:
model.rsquared

0.8972106381789521

**Observation**
The R-squared value is 0.89 and is satisfactory now.

# Remove the Newspaper column 

In [41]:
# statsmodel.formula.api 
model = smf.ols(formula = ' Sales ~ TV + Radio ',data = data).fit()

In [42]:
model.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,859.6
Date:,"Tue, 26 Sep 2023",Prob (F-statistic):,4.83e-98
Time:,03:46:16,Log-Likelihood:,-386.2
No. Observations:,200,AIC:,778.4
Df Residuals:,197,BIC:,788.3
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.9211,0.294,9.919,0.000,2.340,3.502
TV,0.0458,0.001,32.909,0.000,0.043,0.048
Radio,0.1880,0.008,23.382,0.000,0.172,0.204

0,1,2,3
Omnibus:,60.022,Durbin-Watson:,2.081
Prob(Omnibus):,0.0,Jarque-Bera (JB):,148.679
Skew:,-1.323,Prob(JB):,5.19e-33
Kurtosis:,6.292,Cond. No.,425.0


In [43]:
# old ones
# Intercept    2.938889
# TV           0.045765
# Radio        0.188530
# Newspaper   -0.001037

model.params

Intercept    2.921100
TV           0.045755
Radio        0.187994
dtype: float64