Consider a dataset which indicates customer will purchase Automotive or not. Following information variables exist:
* User ID
* Gender of customer
* Age of customer
* Estimated salary of customer
* Puchased (1 or 0): 1 - Auto purchased and 0 - Auto not purchased

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from statsmodels.api import Logit, add_constant

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load data
df = pd.read_csv('C:/Users/Karthik.Iyer/Downloads/AccelerateAI/Classification-Models-main/data/LR3.csv')
df.sample(5)

Unnamed: 0,UserID,Gender,Age,EstimatedSalary,Purchased
369,15624755,Female,54,26000,1
123,15574305,Male,35,53000,0
84,15798659,Female,30,62000,0
361,15778830,Female,53,34000,1
54,15654901,Female,27,58000,0


In [3]:
# Lets check info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   UserID           400 non-null    int64 
 1   Gender           400 non-null    object
 2   Age              400 non-null    int64 
 3   EstimatedSalary  400 non-null    int64 
 4   Purchased        400 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 15.8+ KB


1. There are no missing values.
2. Data types of variables appear to be appropriate.

In [4]:
# Lets check correlation
X = df[['Age','EstimatedSalary']]
X.corr()

Unnamed: 0,Age,EstimatedSalary
Age,1.0,0.155238
EstimatedSalary,0.155238,1.0


**No strong correlation between Age and Salary**

In [5]:
# Check correlation with y
y = df['Purchased']
X.corrwith(y)

Age                0.622454
EstimatedSalary    0.362083
dtype: float64

**Age of the customer seem to have better correlation with y than Salary**

In [6]:
# Check multi-collinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

Age                4.575819
EstimatedSalary    4.575819
dtype: float64

**VIFs are less than 5 indicating no strong multi-collinearity**

In [7]:
# Lets check Gender
df['Gender'].value_counts()

Female    204
Male      196
Name: Gender, dtype: int64

In [8]:
# Create dummies
gender_dummy = pd.get_dummies(df['Gender'], drop_first=True)
df = pd.concat([df, gender_dummy], axis=1)
df.head()

Unnamed: 0,UserID,Gender,Age,EstimatedSalary,Purchased,Male
0,15624510,Male,19,19000,0,1
1,15810944,Male,35,20000,0,1
2,15668575,Female,26,43000,0,0
3,15603246,Female,27,57000,0,0
4,15804002,Male,19,76000,0,1


In [9]:
# Drop UserID and Gender
df.drop(['UserID','Gender'], axis=1, inplace=True)
df.columns

Index(['Age', 'EstimatedSalary', 'Purchased', 'Male'], dtype='object')

In [10]:
# Fit the model
y = df['Purchased']
X = df.drop('Purchased', axis=1)

X = sm.add_constant(X)
model1 = sm.Logit(y,X).fit()
model1.summary()

Optimization terminated successfully.
         Current function value: 0.344804
         Iterations 8


0,1,2,3
Dep. Variable:,Purchased,No. Observations:,400.0
Model:,Logit,Df Residuals:,396.0
Method:,MLE,Df Model:,3.0
Date:,"Tue, 07 Jun 2022",Pseudo R-squ.:,0.4711
Time:,18:39:14,Log-Likelihood:,-137.92
converged:,True,LL-Null:,-260.79
Covariance Type:,nonrobust,LLR p-value:,5.488e-53

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-12.7836,1.359,-9.405,0.000,-15.448,-10.120
Age,0.2370,0.026,8.984,0.000,0.185,0.289
EstimatedSalary,3.644e-05,5.47e-06,6.659,0.000,2.57e-05,4.72e-05
Male,0.3338,0.305,1.094,0.274,-0.264,0.932


**Since Male is insignificant, lets fit the model with only Age and Salary**

In [11]:
# Fit the model
y = df['Purchased']
X = df[['Age','EstimatedSalary']]

X = sm.add_constant(X)
model2 = sm.Logit(y,X).fit()
model2.summary()

Optimization terminated successfully.
         Current function value: 0.346314
         Iterations 8


0,1,2,3
Dep. Variable:,Purchased,No. Observations:,400.0
Model:,Logit,Df Residuals:,397.0
Method:,MLE,Df Model:,2.0
Date:,"Tue, 07 Jun 2022",Pseudo R-squ.:,0.4688
Time:,18:39:14,Log-Likelihood:,-138.53
converged:,True,LL-Null:,-260.79
Covariance Type:,nonrobust,LLR p-value:,7.994999999999999e-54

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-12.4340,1.300,-9.566,0.000,-14.982,-9.886
Age,0.2335,0.026,9.013,0.000,0.183,0.284
EstimatedSalary,3.59e-05,5.43e-06,6.613,0.000,2.53e-05,4.65e-05


1. Fit a Logit model and explain the significance of predictors on the 'Purchase' decision

**Logit Equation:**<br>
Purchased = -12.434 + 0.2335 * Age + 0.0000359 * EstimatedSalary

The p-values for both the variables - Age and EstimatedSalary are 0 indicating both variables are significant

Interpretation of betas:<br>
* The log odds for Purchase decision (whether customer purchases or not) increases by 0.2335 for each unit of 'Age'
* The log odds for Purchase decision (whether customer purchases or not) increases by 0.0000359 for each unit of 'EstimatedSalary'

2. Calculate Wald test statistic for Age

In [12]:
wald_test = round((0.2335/0.026),2)
wald_test

8.98