Melon Musk, the owner of SooperMall has asked a data scientist to help him develop a predictive model for managing his revenue.
The data collected for 1000 customers is in the excel sheet attached.

1) Fit a regression model with amount spent as target, and identify the relevant explanatory variables.<br>
2) Which age groups spends the most on purchases?<br>
3) How much additional spend is an extra catalog expected to generate?<br>
4) What is the impact of no of children on amount spent. Does this make sense? Explain.

In [1]:
# import relevant libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm

In [2]:
# load data
df = pd.read_excel('C:/Users/Karthik.Iyer/Downloads/AccelerateAI/Regression-Models-main/SooperMall Catalog Marketing.xlsx', sheet_name='Data')
df.head()

Unnamed: 0,Person,Age,Gender,Own Home,Married,Close,Salary,Children,Catalogs,Amount Spent
0,1.0,1.0,0.0,0.0,0.0,1.0,16400.0,1.0,12.0,217.691
1,2.0,2.0,0.0,1.0,1.0,0.0,108100.0,3.0,18.0,2632.462
2,3.0,2.0,1.0,1.0,1.0,1.0,97300.0,1.0,12.0,3047.563
3,4.0,3.0,1.0,1.0,1.0,1.0,26800.0,0.0,12.0,434.606
4,5.0,1.0,1.0,0.0,0.0,1.0,11200.0,0.0,6.0,105.624


In [3]:
# Check the info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Person        1000 non-null   float64
 1   Age           1000 non-null   float64
 2   Gender        1000 non-null   float64
 3   Own Home      1000 non-null   float64
 4   Married       1000 non-null   float64
 5   Close         1000 non-null   float64
 6   Salary        1000 non-null   float64
 7   Children      1000 non-null   float64
 8   Catalogs      1000 non-null   float64
 9   Amount Spent  1000 non-null   float64
dtypes: float64(10)
memory usage: 78.2 KB


In [4]:
# Check for any missing values
df.isnull().sum()

Person          0
Age             0
Gender          0
Own Home        0
Married         0
Close           0
Salary          0
Children        0
Catalogs        0
Amount Spent    0
dtype: int64

There are no missing values in the dataframe

In [5]:
# Drop Person as this is only a unique identifier
df.drop('Person', axis=1, inplace=True)
df.columns

Index(['Age', 'Gender', 'Own Home', 'Married', 'Close', 'Salary', 'Children',
       'Catalogs', 'Amount Spent'],
      dtype='object')

In [6]:
# Check value counts for Age
df.Age.value_counts()

2.0    508
1.0    287
3.0    205
Name: Age, dtype: int64

In [7]:
# # Use data dictionary and replace the encoded variable Age with correct values
df['Age'] = df['Age'].apply(lambda x: str(x).replace('1.0', '30 or younger'))
df['Age'] = df['Age'].apply(lambda x: str(x).replace('2.0', '31 to 55'))
df['Age'] = df['Age'].apply(lambda x: str(x).replace('3.0', '56 or older'))

In [8]:
# Check the count after replacing
df.Age.value_counts()

31 to 55         508
30 or younger    287
56 or older      205
Name: Age, dtype: int64

In [9]:
# Check value count for Gender
df.Gender.value_counts()

0.0    506
1.0    494
Name: Gender, dtype: int64

In [10]:
# Use data dictionary and replace the encoded variable Gender with correct values
df['Gender'] = df['Gender'].apply(lambda x: str(x).replace('0.0', 'Female'))
df['Gender'] = df['Gender'].apply(lambda x: str(x).replace('1.0', 'Male'))

In [11]:
# Check value count after replacing
df.Gender.value_counts()

Female    506
Male      494
Name: Gender, dtype: int64

In [12]:
# Check value count for Own Home
df['Own Home'].value_counts()

1.0    516
0.0    484
Name: Own Home, dtype: int64

In [13]:
# Use data dictionary and replace the encoded variable Own Home with correct values
df['Own Home'] = df['Own Home'].apply(lambda x: str(x).replace('1.0', 'owns a home'))
df['Own Home'] = df['Own Home'].apply(lambda x: str(x).replace('0.0', 'does not own a home'))

In [14]:
# Check count after replacing
df['Own Home'].value_counts()

owns a home            516
does not own a home    484
Name: Own Home, dtype: int64

In [15]:
# Check value count for Married
df['Married'].value_counts()

1.0    502
0.0    498
Name: Married, dtype: int64

In [16]:
# Use data dictionary and replace the encoded variable Married with correct values
df['Married'] = df['Married'].apply(lambda x: str(x).replace('1.0', 'Married'))
df['Married'] = df['Married'].apply(lambda x: str(x).replace('0.0', 'Not married'))

In [17]:
# Check value counts after replacing
df['Married'].value_counts()

Married        502
Not married    498
Name: Married, dtype: int64

In [18]:
# Check the count for Close
df['Close'].value_counts()

1.0    710
0.0    290
Name: Close, dtype: int64

In [19]:
# Use data dictionary and replace the encoded variable Close with correct values
df['Close'] = df['Close'].apply(lambda x: str(x).replace('1.0', 'closer to shopping area'))
df['Close'] = df['Close'].apply(lambda x: str(x).replace('0.0', 'farther to shopping area'))

In [20]:
# Check the count after replacing
df['Close'].value_counts()

closer to shopping area     710
farther to shopping area    290
Name: Close, dtype: int64

In [21]:
# Check data
df.head()

Unnamed: 0,Age,Gender,Own Home,Married,Close,Salary,Children,Catalogs,Amount Spent
0,30 or younger,Female,does not own a home,Not married,closer to shopping area,16400.0,1.0,12.0,217.691
1,31 to 55,Female,owns a home,Married,farther to shopping area,108100.0,3.0,18.0,2632.462
2,31 to 55,Male,owns a home,Married,closer to shopping area,97300.0,1.0,12.0,3047.563
3,56 or older,Male,owns a home,Married,closer to shopping area,26800.0,0.0,12.0,434.606
4,30 or younger,Male,does not own a home,Not married,closer to shopping area,11200.0,0.0,6.0,105.624


In [22]:
# Create dummy variables
df_dummy_age = pd.get_dummies(df.Age, drop_first=True)
df_dummy_gender = pd.get_dummies(df.Gender, drop_first=True)
df_dummy_ownhome = pd.get_dummies(df['Own Home'], drop_first=True)
df_dummy_married = pd.get_dummies(df['Married'], drop_first=True)
df_dummy_close = pd.get_dummies(df['Close'], drop_first=True)

In [23]:
# lets merge dataframes
df = pd.concat([df, df_dummy_age, df_dummy_gender, df_dummy_ownhome, df_dummy_married, df_dummy_close], axis=1)
df.head()

Unnamed: 0,Age,Gender,Own Home,Married,Close,Salary,Children,Catalogs,Amount Spent,31 to 55,56 or older,Male,owns a home,Not married,farther to shopping area
0,30 or younger,Female,does not own a home,Not married,closer to shopping area,16400.0,1.0,12.0,217.691,0,0,0,0,1,0
1,31 to 55,Female,owns a home,Married,farther to shopping area,108100.0,3.0,18.0,2632.462,1,0,0,1,0,1
2,31 to 55,Male,owns a home,Married,closer to shopping area,97300.0,1.0,12.0,3047.563,1,0,1,1,0,0
3,56 or older,Male,owns a home,Married,closer to shopping area,26800.0,0.0,12.0,434.606,0,1,1,1,0,0
4,30 or younger,Male,does not own a home,Not married,closer to shopping area,11200.0,0.0,6.0,105.624,0,0,1,0,1,0


In [24]:
# lets drop the variables from which we derived dummies
df.drop(['Age', 'Gender', 'Own Home', 'Married', 'Close'], axis=1, inplace=True)

In [25]:
# Check the data again
df.head()

Unnamed: 0,Salary,Children,Catalogs,Amount Spent,31 to 55,56 or older,Male,owns a home,Not married,farther to shopping area
0,16400.0,1.0,12.0,217.691,0,0,0,0,1,0
1,108100.0,3.0,18.0,2632.462,1,0,0,1,0,1
2,97300.0,1.0,12.0,3047.563,1,0,1,1,0,0
3,26800.0,0.0,12.0,434.606,0,1,1,1,0,0
4,11200.0,0.0,6.0,105.624,0,0,1,0,1,0


### 1) Fit a regression model with amount spent as target, and identify the relevant explanatory variables.

In [26]:
# lets Train the model
Y = df['Amount Spent']
X = df.drop('Amount Spent', axis=1)

X = sm.add_constant(X) #adding constant
reg_model = sm.OLS(Y,X).fit()
reg_model.summary()

0,1,2,3
Dep. Variable:,Amount Spent,R-squared:,0.717
Model:,OLS,Adj. R-squared:,0.714
Method:,Least Squares,F-statistic:,278.5
Date:,"Mon, 16 May 2022",Prob (F-statistic):,4.2599999999999995e-264
Time:,17:14:37,Log-Likelihood:,-7655.6
No. Observations:,1000,AIC:,15330.0
Df Residuals:,990,BIC:,15380.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-601.6465,75.067,-8.015,0.000,-748.956,-454.337
Salary,0.0222,0.001,22.604,0.000,0.020,0.024
Children,-201.3627,17.244,-11.677,0.000,-235.201,-167.524
Catalogs,43.1413,2.548,16.929,0.000,38.140,48.142
31 to 55,-93.3095,51.419,-1.815,0.070,-194.212,7.593
56 or older,-41.1596,56.568,-0.728,0.467,-152.166,69.847
Male,-41.5490,34.663,-1.199,0.231,-109.571,26.473
owns a home,47.5096,38.632,1.230,0.219,-28.300,123.319
Not married,63.9201,46.858,1.364,0.173,-28.032,155.872

0,1,2,3
Omnibus:,239.891,Durbin-Watson:,1.966
Prob(Omnibus):,0.0,Jarque-Bera (JB):,832.064
Skew:,1.133,Prob(JB):,2.09e-181
Kurtosis:,6.852,Cond. No.,333000.0


In [27]:
#Checking for only significant variables whose p-value are less than 0.05
reg_model.pvalues[reg_model.pvalues < 0.05]

const                       3.091348e-15
Salary                      1.509545e-91
Children                    1.312927e-29
Catalogs                    1.184329e-56
farther to shopping area    6.803851e-41
dtype: float64

In [28]:
# lets consider only the significant variables
Y = df['Amount Spent']
X = df[['farther to shopping area', 'Salary', 'Children', 'Catalogs']]

X = sm.add_constant(X)
reg_model = sm.OLS(Y,X).fit()
reg_model.summary()

0,1,2,3
Dep. Variable:,Amount Spent,R-squared:,0.715
Model:,OLS,Adj. R-squared:,0.714
Method:,Least Squares,F-statistic:,623.6
Date:,"Mon, 16 May 2022",Prob (F-statistic):,2.8900000000000003e-269
Time:,17:14:37,Log-Likelihood:,-7659.1
No. Observations:,1000,AIC:,15330.0
Df Residuals:,995,BIC:,15350.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-539.8203,49.592,-10.885,0.000,-637.138,-442.503
farther to shopping area,508.0951,36.217,14.029,0.000,437.024,579.166
Salary,0.0209,0.001,38.466,0.000,0.020,0.022
Children,-203.4757,15.625,-13.023,0.000,-234.137,-172.815
Catalogs,42.7183,2.544,16.794,0.000,37.727,47.710

0,1,2,3
Omnibus:,251.537,Durbin-Watson:,1.96
Prob(Omnibus):,0.0,Jarque-Bera (JB):,883.443
Skew:,1.185,Prob(JB):,1.45e-192
Kurtosis:,6.948,Cond. No.,198000.0


**Regression Equation:**<br>
Amount Spent = -539.8203 + 508.0951 * not close to shopping area + 0.0209 * Salary - 203.4757 * Children + 42.7183 * Catalogs

The relevant explanatory variables are: not close to shopping area, Salary, Children and Catalogs

- intercept (const) = -539.8203: The mean amout spent by the customer when there is no influence of any of the explanatory variables (i.e the customer is living close to shopping area, not earning salary, has no children and no catalogs sent to them)
- Beta(not close to shopping area) = 508.0951: The average increase in amount spent when the customer lives far to shopping area over those who live closer.
- Beta(Salary) = 0.0209: The average increase in amount spent with an unit increase in Annual Salary of the customer.
- Beta(Children) = -203.4757: The average decrease in amount spent by the customer with an unit increase in number of children.
- Beta(Catalogs) = 42.7183: The average increase in amount spent with an unit increase in the catalogs sent to customers.

### 2) Which age groups spends the most on purchases?

This cannot be determined due to varaibles related to Age - '31 to 55' and '56 or older' being insignificant due to p-values more than 0.05

### 3) How much additional spend is an extra catalog expected to generate?

This can be determined from the coefficient value for Catalogs. With an unit increase in Catalog, there is additional spend of 42.7183

### 4) What is the impact of no. of children on amount spent. Does this make sense? Explain.

As the number of children increases by one unit, the average amount spent decreases by 203.4757