500 household were surveyed on their monthly expenses. The data is in the file MLR_Q16_MonthlyExpense.csv Location: https://drive.google.com/drive/folders/1rRbSnLml_iqwC8EeFOrEsetoov2yyHrF. For this, use the monthly payment as the dependent variable.

1) Begin with family size and iteratively add one variable and estimate the resulting regression equation.<br>
2) Does adding any explanatory variable lead to a fall in adjusted R-Squared.<br>
3) Which variables are added in the final model?<br>
4) Interpret the coefficients, R-Squared and standard error of estimate for the final model.<br>
5) What result do you get if you use "mlxtend.feature_selection" stepwise regression?

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm

In [2]:
# Load data
survey_df = pd.read_csv('C:/Users/Karthik.Iyer/Downloads/AccelerateAI/Regression-Models-main/MLR_Q16_MonthlyExpense.csv')
survey_df.head()

Unnamed: 0,Household,Monthly Payment,Family Size,Sector No,Rent,Own,Income,Utilities,Debt
0,1,"$1,585",2,2,0,1,"$96,709",$252,"$5,692"
1,2,"$1,314",6,2,1,0,"$77,470",$216,"$4,267"
2,3,$383,3,4,1,0,"$65,746",$207,"$2,903"
3,4,"$1,002",1,1,0,1,"$56,610",$249,"$3,896"
4,5,$743,3,3,1,0,"$59,185",$217,"$3,011"


In [3]:
# Check data type
survey_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Household        500 non-null    int64 
 1   Monthly Payment  500 non-null    object
 2   Family Size      500 non-null    int64 
 3   Sector No        500 non-null    int64 
 4   Rent             500 non-null    int64 
 5   Own              500 non-null    int64 
 6   Income           500 non-null    object
 7   Utilities        500 non-null    object
 8   Debt             500 non-null    object
dtypes: int64(5), object(4)
memory usage: 35.3+ KB


In [4]:
# Data cleaning
def clean(string):
    clean_str = string.str.replace("\$|,","", regex=True).astype(int)
    return clean_str

survey_df[['Monthly Payment','Income', 'Utilities','Debt']] = survey_df[['Monthly Payment','Income', 'Utilities','Debt']].apply(clean)

# Convert Sector No to dummy variable
sector = pd.get_dummies(survey_df['Sector No'], prefix='Sector', drop_first=True)
survey_df = pd.concat([survey_df,sector], axis=1)

# Drop Household, Sector No, and one from the 2 dummy variables [Rent/Own] can be dropped
survey_df.drop(['Household','Sector No','Rent'], axis=1, inplace=True)

# Check data
survey_df.head()

Unnamed: 0,Monthly Payment,Family Size,Own,Income,Utilities,Debt,Sector_2,Sector_3,Sector_4
0,1585,2,1,96709,252,5692,1,0,0
1,1314,6,0,77470,216,4267,1,0,0
2,383,3,0,65746,207,2903,0,0,1
3,1002,1,1,56610,249,3896,0,0,0
4,743,3,0,59185,217,3011,0,1,0


In [5]:
# Check for correlation
survey_df[['Family Size','Income','Utilities','Debt']].corr()

Unnamed: 0,Family Size,Income,Utilities,Debt
Family Size,1.0,0.200114,0.256233,0.293883
Income,0.200114,1.0,0.281003,0.528817
Utilities,0.256233,0.281003,1.0,0.777548
Debt,0.293883,0.528817,0.777548,1.0


**Debt and Utilities are correlated**

In [6]:
# Check multi-collinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
X = survey_df[["Family Size", "Income", "Utilities", "Debt"]]

pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

Family Size     5.149581
Income          7.989059
Utilities      14.542619
Debt           13.912665
dtype: float64

In [7]:
# Drop Utilities and check for multi-collinearity
X = X.drop('Utilities', axis=1)
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

Family Size    4.408813
Income         7.484651
Debt           8.942531
dtype: float64

In [8]:
# Drop Debt and check for multi-collinearity
X = X.drop('Debt', axis=1)
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

Family Size    3.541786
Income         3.541786
dtype: float64

In [9]:
# Train the model by adding one variable at a time
Y = survey_df['Monthly Payment']
X = survey_df.drop(['Monthly Payment','Utilities','Debt'], axis=1)

In [10]:
# Check X
X.head()

Unnamed: 0,Family Size,Own,Income,Sector_2,Sector_3,Sector_4
0,2,1,96709,1,0,0
1,6,0,77470,1,0,0
2,3,0,65746,0,0,1
3,1,1,56610,0,0,0
4,3,0,59185,0,1,0


In [11]:
# Fit model by adding one variable at a time
col_list = range(len(X.columns))
X = sm.add_constant(X)

for numVar in col_list:
    X_model = X.iloc[:,0:numVar+2]
    ols = sm.OLS(Y, X_model).fit()
    print("____________________________")
    print("R-SQ", ols.rsquared.round(2), "/Adj_R-SQ:", ols.rsquared_adj.round(2))
    print(ols.pvalues.round(2))
    print()

ols.summary()

____________________________
R-SQ 0.01 /Adj_R-SQ: 0.0
const          0.00
Family Size    0.09
dtype: float64

____________________________
R-SQ 0.31 /Adj_R-SQ: 0.31
const          0.00
Family Size    0.09
Own            0.00
dtype: float64

____________________________
R-SQ 0.39 /Adj_R-SQ: 0.39
const          0.0
Family Size    0.0
Own            0.0
Income         0.0
dtype: float64

____________________________
R-SQ 0.42 /Adj_R-SQ: 0.42
const          0.0
Family Size    0.0
Own            0.0
Income         0.0
Sector_2       0.0
dtype: float64

____________________________
R-SQ 0.43 /Adj_R-SQ: 0.42
const          0.00
Family Size    0.00
Own            0.00
Income         0.00
Sector_2       0.00
Sector_3       0.04
dtype: float64

____________________________
R-SQ 0.49 /Adj_R-SQ: 0.49
const          0.00
Family Size    0.01
Own            0.00
Income         0.01
Sector_2       0.00
Sector_3       0.01
Sector_4       0.00
dtype: float64



0,1,2,3
Dep. Variable:,Monthly Payment,R-squared:,0.492
Model:,OLS,Adj. R-squared:,0.486
Method:,Least Squares,F-statistic:,79.49
Date:,"Mon, 23 May 2022",Prob (F-statistic):,2.7e-69
Time:,17:36:25,Log-Likelihood:,-3475.5
No. Observations:,500,AIC:,6965.0
Df Residuals:,493,BIC:,6994.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,885.6756,47.355,18.703,0.000,792.634,978.717
Family Size,-18.4527,7.543,-2.446,0.015,-33.273,-3.633
Own,239.1337,25.917,9.227,0.000,188.213,290.055
Income,0.0012,0.000,2.454,0.014,0.000,0.002
Sector_2,113.2689,33.534,3.378,0.001,47.382,179.156
Sector_3,-83.9968,33.755,-2.488,0.013,-150.317,-17.676
Sector_4,-288.2482,36.464,-7.905,0.000,-359.891,-216.605

0,1,2,3
Omnibus:,23.881,Durbin-Watson:,2.059
Prob(Omnibus):,0.0,Jarque-Bera (JB):,26.471
Skew:,0.507,Prob(JB):,1.79e-06
Kurtosis:,3.491,Cond. No.,388000.0


In [12]:
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
from sklearn.linear_model import LinearRegression

Y = survey_df['Monthly Payment']
X = survey_df.drop(['Monthly Payment','Utilities','Debt'], axis=1)

lr = LinearRegression()
sfs_forward = sfs(lr,
                 k_features=(1,6),
                 forward=True,
                 floating=True,
                 scoring='neg_mean_squared_error',
                 cv=10)
sfs = sfs_forward.fit(X,Y)
print('Forward Selection Subset:', sfs.k_feature_names_)

Forward Selection Subset: ('Family Size', 'Own', 'Income', 'Sector_2', 'Sector_3', 'Sector_4')


In [13]:
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

sfs_back = sfs(lr,
               k_features=5,
               forward=False,
               floating=True,
               scoring='neg_mean_squared_error',
               cv=10)

sfs = sfs_back.fit(X,Y)
print('Backward Elimination Subset:', sfs.k_feature_names_)

Backward Elimination Subset: ('Family Size', 'Own', 'Sector_2', 'Sector_3', 'Sector_4')


**Answers:**

1) Final Model:<br>
Monthly Payment = 885.7 -18.45 * Family Size + 239 * Own + 0.0012 * Income + 113 * Sector_2 - 84 * Sector_3 - 288 * Sector_4

2) Does adding any explanatory variable lead to a fall in adjusted R-Squared.
No.

3) Which variables are added in the final model?<br>
(Family Size , Own, Income, Sector_2, Sector_3, Sector_4)

4) Interpret the coefficients, R-squared and standard error of estimate for the final model.

Coefficients:

- Family Size: -18.45 | As Family size increases, expense on home mortgage/rent decrease a bit
- Own : 239 | Owning a house increased the home mortgage, as compared to renting
- Income : 0.0012 | As income goes up, people spend more on home mortgage
- Sector : | People living in Sector 2, spend the the highest (113 over Sector 1). Sector 3 and Sector 4 residents spend 84 and 288 less respectively.

R-Squared is 0.49, which is quiet low.

Standard error of estimate:

In [14]:
np.sqrt((np.sum((ols.resid)**2))/(ols.df_resid)).round(2)

254.44

Standard error of estimate is high.

5) What result do you get if you use “mlxtend.feature_selection” stepwise regression?

- Forward Selection Subset: ('Family Size', 'Own', 'Income', 'Sector_2', 'Sector_3', 'Sector_4')
- Backward Elimination Subset: ('Family Size', 'Own', 'Sector_2', 'Sector_3', 'Sector_4')