The owner of Bangalore Pizza restaurant chain wants to predict the sales of his specialty Thin Crust Masala Pizza. He gathered data on monthly sales at his restaurant and potentially relevant variables for 15 outlets across Karnataka.

1) Estimate the MLR model coefficients<br>
2) Is there an evidence of violation of any key assumption of regression analysis?<br>
3) Which variable among these would you choose to remove and Why?<br>
4) Will removing that variable increase the overall explanatory power of the model?

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm

In [2]:
# load data
df = pd.read_csv('C:/Users/Karthik.Iyer/Downloads/AccelerateAI/Regression-Models-main/MLR_Q14_PizzaSales.csv')
df.head()

Unnamed: 0,Outlet Number,Quantity Sold,Average Price,Monthly Advertising Expenditures,Disposable Income per Household
0,1,85300,$10.14,"$64,800","$42,100"
1,2,40500,$10.88,"$42,800","$38,300"
2,3,61800,$12.33,"$58,600","$41,000"
3,4,50800,$12.70,"$46,500","$43,300"
4,5,60600,$12.29,"$50,700","$44,000"


In [3]:
# Check shape
df.shape

(15, 5)

There are 15 observations with 5 features

In [4]:
# Check data type
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 5 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   Outlet Number                     15 non-null     int64 
 1   Quantity Sold                     15 non-null     object
 2   Average Price                     15 non-null     object
 3   Monthly Advertising Expenditures  15 non-null     object
 4   Disposable Income per Household   15 non-null     object
dtypes: int64(1), object(4)
memory usage: 728.0+ bytes


Variables with object data types needs to be cleansed.

In [5]:
# Check missing values
df.isnull().sum()

Outlet Number                       0
Quantity Sold                       0
Average Price                       0
Monthly Advertising Expenditures    0
Disposable Income per Household     0
dtype: int64

In [6]:
# Lets handle variables with object data types
df['Quantity Sold'] = df['Quantity Sold'].apply(lambda x: int(x.replace(',','')))
df['Average Price'] = df['Average Price'].apply(lambda x: float(x.replace('$','')))
df['Monthly Advertising Expenditures'] = df['Monthly Advertising Expenditures'].apply(lambda x: int(x.replace('$','').replace(',','')))
df['Disposable Income per Household'] = df['Disposable Income per Household'].apply(lambda x: int(x.replace('$','').replace(',','')))

In [7]:
# Check data
df.head()

Unnamed: 0,Outlet Number,Quantity Sold,Average Price,Monthly Advertising Expenditures,Disposable Income per Household
0,1,85300,10.14,64800,42100
1,2,40500,10.88,42800,38300
2,3,61800,12.33,58600,41000
3,4,50800,12.7,46500,43300
4,5,60600,12.29,50700,44000


In [8]:
# drop variables
X = df.drop(['Outlet Number', 'Quantity Sold'], axis=1)
X.head()

Unnamed: 0,Average Price,Monthly Advertising Expenditures,Disposable Income per Household
0,10.14,64800,42100
1,10.88,42800,38300
2,12.33,58600,41000
3,12.7,46500,43300
4,12.29,50700,44000


In [9]:
# Check for correlation
X.corr()

Unnamed: 0,Average Price,Monthly Advertising Expenditures,Disposable Income per Household
Average Price,1.0,-0.172954,0.33416
Monthly Advertising Expenditures,-0.172954,1.0,0.429599
Disposable Income per Household,0.33416,0.429599,1.0


Don't see variables which are **strongly correlated**

In [10]:
# Lets train the model
Y = df['Quantity Sold']

X1 = sm.add_constant(X)
reg_model1 = sm.OLS(Y,X1).fit()
reg_model1.summary()

  "anyway, n=%i" % int(n))


0,1,2,3
Dep. Variable:,Quantity Sold,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.936
Method:,Least Squares,F-statistic:,69.17
Date:,"Fri, 20 May 2022",Prob (F-statistic):,2e-07
Time:,01:05:57,Log-Likelihood:,-140.63
No. Observations:,15,AIC:,289.3
Df Residuals:,11,BIC:,292.1
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-3.33e+04,1.79e+04,-1.861,0.090,-7.27e+04,6092.416
Average Price,-4041.5338,1040.640,-3.884,0.003,-6331.968,-1751.100
Monthly Advertising Expenditures,1.4544,0.152,9.593,0.000,1.121,1.788
Disposable Income per Household,1.5279,0.513,2.979,0.013,0.399,2.657

0,1,2,3
Omnibus:,0.919,Durbin-Watson:,1.542
Prob(Omnibus):,0.632,Jarque-Bera (JB):,0.815
Skew:,0.353,Prob(JB):,0.665
Kurtosis:,2.103,Cond. No.,1440000.0


1) Estimate the MLR model coefficients<br>

All variables are significant. 

**Regression Eq:**<br>
Quantity Sold = -33300 - 4041.5338 * Average Price + 1.4544 * Monthly Advertising Expenditures + 1.5279 * Disposable Income per Household

The model coefficients are:
- intercept(const) = -33300 : The mean quantities sold when there is no influence of any of the explanatory variables.
- beta (Average Price) = -4041.5338	: The average decrease in quantities sold with an unit increase in average price.
- beta (Monthly Advertising Expenditures) = 1.4544 : The average increase in quantities sold with an unit increase in Monthly Advertising Expenditures
- beta (Disposable Income per Household) = 1.5279 : The average increase in quantities sold with an unit increase in Disposable Income per Household

2) Is there an evidence of violation of any key assumption of regression analysis?

In [11]:
# Lets check multi-collinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor

pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

Average Price                       176.795369
Monthly Advertising Expenditures     94.344057
Disposable Income per Household     362.798199
dtype: float64

VIFs are very high indicating **strong multi-collinearity**. Hence the assumption of regression analysis is violated

3) Which variable among these would you choose to remove and Why?<br>

Average Price should be chosen to remove first due to very high VIF

4) Will removing that variable increase the overall explanatory power of the model?

In [12]:
# Check X1
X1.head()

Unnamed: 0,const,Average Price,Monthly Advertising Expenditures,Disposable Income per Household
0,1.0,10.14,64800,42100
1,1.0,10.88,42800,38300
2,1.0,12.33,58600,41000
3,1.0,12.7,46500,43300
4,1.0,12.29,50700,44000


In [13]:
# Lets retrain the model after removing Average Price
Y = df['Quantity Sold']
X1.drop(['const', 'Average Price'], axis=1, inplace=True)

X2 = sm.add_constant(X1)
reg_model2 = sm.OLS(Y,X2).fit()
reg_model2.summary()

  "anyway, n=%i" % int(n))


0,1,2,3
Dep. Variable:,Quantity Sold,R-squared:,0.881
Model:,OLS,Adj. R-squared:,0.861
Method:,Least Squares,F-statistic:,44.27
Date:,"Fri, 20 May 2022",Prob (F-statistic):,2.89e-06
Time:,01:05:57,Log-Likelihood:,-147.11
No. Observations:,15,AIC:,300.2
Df Residuals:,12,BIC:,302.3
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-5.366e+04,2.52e+04,-2.127,0.055,-1.09e+05,1314.915
Monthly Advertising Expenditures,1.6734,0.207,8.064,0.000,1.221,2.125
Disposable Income per Household,0.6133,0.672,0.913,0.379,-0.850,2.077

0,1,2,3
Omnibus:,0.962,Durbin-Watson:,1.393
Prob(Omnibus):,0.618,Jarque-Bera (JB):,0.752
Skew:,0.214,Prob(JB):,0.686
Kurtosis:,1.989,Cond. No.,1370000.0


Removing Average Price did not help increasing the overall explanatory power of the model which is R-Squared.

Infact, even though the p-values are less than 0.05 and all variables were significant. The VIFs were very high indicating strong multi-collinearity.

This could also be due to very less number of observations (only 15) for 5 features. Model can be retrained considering larger samples.