This project aims to answer the questions on https://www.kaggle.com/jackdaoud/marketing-data/tasks?taskId=2986 by performing statistical tests in the form of **regression** or in some cases, hypothesis testing.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import mstats
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.stattools import durbin_watson, jarque_bera
from statsmodels.stats.diagnostic import het_breuschpagan
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import warnings
warnings.filterwarnings("ignore")
sns.set_style('whitegrid')

In [None]:
data = pd.read_csv('../input/marketing-data/marketing_data.csv')

# Exploratory Data analysis

In [None]:
data.info()

In [None]:
data.head()

In [None]:
data = data[data['Year_Birth'] >= 1940]

In [None]:
data.columns = data.columns.str.replace(' ','')

In [None]:
data['children_total'] = data['Kidhome'] + data['Teenhome']
data.drop(['Kidhome','Teenhome'], axis = 1, inplace = True)

In [None]:
data.drop(['ID','Year_Birth','Recency','Dt_Customer','Response','Complain','NumWebVisitsMonth'], axis = 1, inplace = True)

In [None]:
data['Income'] = data['Income'].str.replace('$','').str.replace(',','').astype('float')
data['Income'].fillna(data['Income'].median(), inplace = True)

In [None]:
data['total_purchase'] = data['NumDealsPurchases'] + data['NumWebPurchases'] + data['NumCatalogPurchases'] + data['NumStorePurchases']
data['total_spent'] = data['MntWines'] + data['MntFruits'] + data['MntMeatProducts'] + data['MntFishProducts'] + data['MntSweetProducts'] + data['MntGoldProds']

In [None]:
data[data['Country'] == 'IND'].describe()

In [None]:
data[data['Country'] == 'US'].describe()

In [None]:
data[data['Country'] == 'SP'].describe()

In [None]:
data[data['Country'] == 'SA'].describe()

In [None]:
data[data['Country'] == 'AUS'].describe()

In [None]:
data[data['Country'] == 'CA'].describe()

In [None]:
data[data['Country'] == 'GER'].describe()

In [None]:
data.loc[data.Income < 34000, 'Income'] = 34000
data = data[data['Income'] <= 200000]
data = data[data['Country'] != 'ME']
data['Marital_Status'] = data['Marital_Status'].replace('Absurd','Single').replace('YOLO', 'Single').replace('Alone','Single')

In [None]:
dummies = pd.get_dummies(data[['Marital_Status','Education','Country']], drop_first = True)
data_dummies = pd.concat([data,dummies], axis = 1)
data_dummies.drop(['Country','Marital_Status','Education'], axis = 1, inplace = True)

In [None]:
cols = ['Income', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases','children_total']

In [None]:
data[cols].describe()

In [None]:
fig, ax = plt.subplots(4,3, figsize = (20,15))
for i, col in enumerate(data_dummies[cols]):
        sns.boxplot(data = data_dummies[col],ax = ax[i//3,i%3]).set_title(col)

In [None]:
fig, ax = plt.subplots(4,3, figsize = (20,15))
for i, col in enumerate(data_dummies[cols]):
    sns.histplot(data_dummies[col],ax = ax[i//3,i%3])


There were many rows that had less than 10,000 dollars in `Income`. This is most likely an error. Therefore, Income of less than 10,000  were replaced by 34,000, which is close to the 25th quantile for each country. There are outliers, but they are not errors. The outliers are possible due to preferences and demographics, so they will be kept.  

# Statistical Analysis

Before drawing inference from regression models, we must check the assumptions for the chosen regresison model. This is to validate significance testing by having correct coefficients, p-values and standard errors.

In [None]:
data_dummies['NumStorePurchases'].describe()

NumStorePurchases is a variable with count data. Poisson regression would be the optimal model. However the assumption of mean = variance does not hold. Variance is greater than the mean which indicates over dispersion. Therefore, we will use Negative Binomial Regression instead. There is no indication as to what the time period is for the number of store purchases. I am assuming the time period is the same as the variables indicating amount spent on products (2 year period).

In [None]:
x = data_dummies[data_dummies.drop(['total_purchase','total_spent','NumStorePurchases'], axis = 1).columns]
y = data_dummies['NumStorePurchases']

In [None]:
nb_constant = sm.add_constant(x)
nb = sm.GLM(y,nb_constant, family = sm.families.NegativeBinomial()).fit()

nb_df = pd.DataFrame()
nb_df['coeff'] = nb.params[nb.pvalues <= 0.05]
nb_df['p-value'] = round(nb.pvalues[nb.pvalues <= 0.05],3)

nb_df[1:]

People who shop on the web with deals have a significant relationship in explaining the number of store purchasing. Wine and income is also an influencing factor for people to shop in store.  

In [None]:
fig, ax = plt.subplots(1,2, figsize = (10,3))
sns.barplot(x = data['Country'], y = data['total_spent'], ax = ax[0], estimator = np.mean, palette = 'Spectral',ci = None).set_title('Average Spent By Country')
sns.barplot(x = data['Country'], y = data['total_purchase'], ax = ax[1], estimator = np.mean, palette = 'Spectral', ci = None).set_title('Average Purchases By Country')
plt.show()

On average, the US did not particulary do better than other countries. We will exclude Mexico because there are only 3 data points. US on average had higher purchases, but not by a significant amount. Average spent by country were all pretty close to or roughly about the same. Although there are outliers in the data, average was still used over median due to knowing that it is possible for people to spend more than others on certain products because of demographics, preferences, etc.

In [None]:
store_purchases_based_on_gold = data_dummies[['NumStorePurchases','MntGoldProds']]
store_purchases_based_on_gold['Abv_Avg_Gold'] = np.where(store_purchases_based_on_gold['MntGoldProds'] > np.mean(store_purchases_based_on_gold['MntGoldProds']),'Above AVG', 'Below or AVG')

fig, ax = plt.subplots(1,2,figsize = (10,5))

sns.scatterplot(x = 'MntGoldProds',
          y = 'NumStorePurchases',
          data = store_purchases_based_on_gold,
          hue = 'Abv_Avg_Gold',
          legend = True,
            ax = ax[0])

sns.boxplot(x = 'Abv_Avg_Gold', 
            y = 'NumStorePurchases',
            data = store_purchases_based_on_gold, 
            ax = ax[1])

plt.show()

We can see that people who spend above average on gold products tend to have more in store purchases. We can further confirm this with hypothesis testing.

In [None]:
# Checking equal variance.
# H0: Variances are equal.
#H1: Variances are not equal.
above_avg = store_purchases_based_on_gold[store_purchases_based_on_gold['Abv_Avg_Gold'] == 'Above AVG']['NumStorePurchases']
below_avg = store_purchases_based_on_gold[store_purchases_based_on_gold['Abv_Avg_Gold'] == 'Below or AVG']['NumStorePurchases']

ts, pval = stats.levene(above_avg, below_avg, center = 'median')

if pval < 0.05:
    print('Reject the null, variances between the two groups are not equal.')
else:
    print('Fail to reject the null, variances between the two groups are equal.')


In [None]:
#Distribution is non-normal, but ttest is robust. Normality is not a strict requirement.
# Right tailed test.
# H0: On average, people who spend above or below average on gold have the same in store purchases.
# H1: People who spent an above average amount on gold have more in store purchases.
ts, pval = stats.ttest_ind(above_avg, below_avg, equal_var = False)

if pval/2 < 0.05:
    print('Reject the null. People who spent an above average amount on gold have more in store purchases.')
else:
    print('Fail to reject the null. On average, people who spend above or below average on gold have the same in store purchases.')


In [None]:
# Checking linearity.
cols = ['Income', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntSweetProducts',
       'MntGoldProds','NumDealsPurchases', 'NumWebPurchases',
            'NumCatalogPurchases', 'NumStorePurchases','children_total']

fig, ax = plt.subplots(3,4, figsize = (20,15))
for i, col in enumerate(data_dummies[cols]):
    if col == 'children_total' or col == 'NumStorePurchases' or col == 'NumWebPurchases' or col == 'NumDealsPurchases' or col == 'NumCatalogPurchases':
        sns.boxplot(x = data_dummies[col], y = data_dummies['MntFishProducts'],
                   showmeans = True,
                   meanprops = {'marker':'o', 'markerfacecolor':'red'},
                   ax = ax[i//4,i%4])
        sns.scatterplot(x = data_dummies[col], y = data_dummies['MntFishProducts'],
                   ax = ax[i//4,i%4])
    else:
        sns.scatterplot(x = data_dummies[col], y = data_dummies['MntFishProducts'],
                   ax = ax[i//4,i%4])
ax[2,3].set_visible(False)     

Transformation will be needed for the  amount spent on product variables since their plots do not seem to show linearity. I suspect that catalog, deals, and web purchases have a quadratic relationship.

In [None]:

fig, ax = plt.subplots(3,4, figsize = (20,15))
for i, col in enumerate(data_dummies[cols]):
    if col == 'NumWebPurchases' or col == 'NumCatalogPurchases'or col == 'NumDealsPurchases':
        sns.boxplot(x = data_dummies[col], y = np.log1p(data_dummies['MntFishProducts']),
                   showmeans = True,
                   meanprops = {'marker':'o', 'markerfacecolor':'red'},
                   ax = ax[i//4,i%4])
        sns.regplot(x = data_dummies[col], y = np.log1p(data_dummies['MntFishProducts']),
                   ci = False,
                   line_kws = {'color':'black'},
                   ax = ax[i//4,i%4],
                   order = 2,
                   scatter = False)
    elif col == 'NumStorePurchases' or col == 'children_total':
        sns.boxplot(x = data_dummies[col], y = np.log1p(data_dummies['MntFishProducts']),
                   showmeans = True,
                   meanprops = {'marker':'o', 'markerfacecolor':'red'},
                   ax = ax[i//4,i%4])
        sns.regplot(x = data_dummies[col], y = np.log1p(data_dummies['MntFishProducts']),
                   ci = False,
                   line_kws = {'color':'black'},
                   ax = ax[i//4,i%4],
                   scatter = False)
        
    else:
        sns.regplot(x = np.log1p(data_dummies[col]), y= np.log1p(data_dummies['MntFishProducts']),
                   ci = False,
                   line_kws = {'color':'black'},
                   ax = ax[i//4,i%4])
        
ax[2,3].set_visible(False)

In [None]:

# Log transforming and adding quadratic terms.
cont_cols = ['Income', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntSweetProducts',
       'MntGoldProds']


data_dummies1 = data_dummies.copy()

for col in cont_cols:
    data_dummies1[col] = np.log1p(data_dummies1[col])

# Centering technique to reduce multicollinearity given quadratic terms.
for col in cols:
    data_dummies1[col] = data_dummies1[col] - np.mean(data_dummies1[col])
data_dummies1['NumDealsPurchases^2'] = data_dummies1['NumDealsPurchases']**2
data_dummies1['NumWebPurchases^2'] = data_dummies1['NumWebPurchases']**2
data_dummies1['NumCatalogPurchases^2'] = data_dummies1['NumCatalogPurchases']**2

#interaction term
data_dummies1['Married_PhD'] = data_dummies1['Marital_Status_Married']*data_dummies1['Education_PhD']

In [None]:
x1 = data_dummies1[data_dummies1.drop(['MntFishProducts','total_purchase','total_spent'], axis = 1).columns]
y1 = data_dummies1['MntFishProducts']

In [None]:
ols_constant = sm.add_constant(x1)
lm = sm.OLS(y1,ols_constant).fit()

In [None]:
fig, ax = plt.subplots(1,3, figsize = (20,5))
sns.histplot(lm.resid, ax = ax[0]).set_title('Residual Histogram')
sm.qqplot(lm.resid,line = 'r', ax = ax[1])
sns.residplot(lm.fittedvalues, lm.resid, ax = ax[2]).set_title('Residuals VS Predicted')
plt.show()

In [None]:
dw = durbin_watson(lm.resid)
_,jbpval,_,_ =  jarque_bera(lm.resid)
_,hppval,_,_ = het_breuschpagan(lm.resid, lm.model.exog)

if dw > 1.5:
    print('No autocorrelation.')
else:
    print('Autoccorelation is present.')
if jbpval < 0.05 and round(np.mean(lm.resid)) == 0:
    print('Residuals are not completely normal, but mean of residuals is approximately zero.')
else:
    print('Residuals are nornmal.')
if hppval < 0.05:
    print('Variances are not equal.')
else:
    print('Variances are equal.')

In [None]:
vif_data = pd.DataFrame()
vif_data["feature"] = x1.columns
  
# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(x1.values, i)
                          for i in range(len(x1.columns))]

#VIF over 10 is problematic
vif_data[vif_data['VIF'] > 10]

In [None]:

#VIF over 5
vif_data[vif_data['VIF'] >= 5]

There are no variables with VIF over 10, but there is some multicollinearity present. The OLS model assumes little to no multicollinearity.

The OLS model has proven to have some difficulties in meeting the assumptions. Residuals are not completely normal, but the mean is approximately zero. This is good enough to pass the normality of residual test. Linearity has been achieved by log transforming and adding quadratic terms.The variables with higher terms are assumed to have a quadratic relationship.Data has no autocorrelation, shows independence. Heteroskedasticity is present despite transformations. Quantile Regression at 0.5 (median) will be used over OLS due to the model not having any assumptions about the residuals and is robust to outliers. For more info on quantile regression visit http://people.ku.edu/~chkim/soc910/note/Soc910_Note_08_Qreg.pdf

In [None]:
quant_constant = sm.add_constant(x1)
quant_model = sm.QuantReg(y1,quant_constant).fit()
quant_df = pd.DataFrame()
quant_df['coeffs'] = quant_model.params
quant_df['p-values'] = round(quant_model.pvalues,2)
quant_df[quant_df.index == 'Married_PhD']

Married PhD candidates do not have a significant relationship in explaining the amount spent on fish products. Variables that are significant are the following:

In [None]:
quant_df[(quant_df['p-values'] < 0.05) & (quant_df.index != 'const')]

In [None]:
total_cmp = data_dummies['AcceptedCmp1'] + data_dummies['AcceptedCmp2'] + data_dummies['AcceptedCmp3']+ data_dummies['AcceptedCmp4'] + data_dummies['AcceptedCmp5']
total_cmp.value_counts()

There are people who did accept multiple campaigns and there are those who did not accept any. To determine whether or not geographical region has a significant relationship in explaining the success or failure of a compaign, we will first need to create a new variable to determine if a customer accepted a campaign (regardless of number of campaigns).

In [None]:
data_dummies2 = data_dummies.copy()
data_dummies2['AcceptedCmp'] = np.where(total_cmp > 0,1,0)
data_dummies2.drop(['AcceptedCmp1','AcceptedCmp2',
                   'AcceptedCmp3','AcceptedCmp4',
                   'AcceptedCmp5'], axis = 1, inplace = True)
data_dummies2.reset_index(inplace = True, drop = True)

In [None]:
cmp_x = data_dummies2[data_dummies2.drop(['AcceptedCmp','total_purchase','total_spent'], axis = 1).columns]
cmp_y = data_dummies2['AcceptedCmp']
cmp_x_const = sm.add_constant(cmp_x)
logit_m = sm.Logit(cmp_y,cmp_x_const).fit()

In [None]:
# Checking linearity assumption of logistic regression.
cols = ['Income', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts',
       'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases',
       'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
       'children_total']
fig, ax = plt.subplots(3,4,figsize = (20,15))
for i, col in enumerate(cmp_x[cols]):
    sns.scatterplot(x = cmp_x[col], y = cmp_x[col]*logit_m.params[col],
                   ax = ax[i//4,i%4])
    ax[i//4,i%4].set(ylabel = 'Log Odds')

In [None]:
vif_data = pd.DataFrame()
vif_data["feature"] = cmp_x.columns
  
# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(cmp_x.values, i)
                          for i in range(len(cmp_x.columns))]

#VIF over 10 is problematic
vif_data[vif_data['VIF'] > 10]

From the plots above, there is a linear relationship between the independent variables and their respective log odds. Slight multicollinearity present, where the `Income` variable has a VIF over 10. We will be dropping the variable. From the linear regression model, it was determined that the data is not autocorrelated.

In [None]:
cmp_x = data_dummies2[data_dummies2.drop(['AcceptedCmp','total_purchase','total_spent','Income'], axis = 1).columns]
cmp_y = data_dummies2['AcceptedCmp']
cmp_x_const = sm.add_constant(cmp_x)
logit_m = sm.Logit(cmp_y,cmp_x_const).fit()
logit_df = pd.DataFrame()
logit_df['coeffs'] = logit_m.params
logit_df['p-values'] = round(logit_m.pvalues,2)


In [None]:
logit_df[logit_df.index.str.contains('(Country)')]

Overall, regions do not have a significant relationship in explaning the success or failure of a campaign. Although there is one country that is significant, which is Spain. This is most likely due to the fact that a big chunk of this dataset contains information about customers from Spain.

# Data Visualization


In [None]:
data_melted = pd.melt(data_dummies,
                     value_vars = ['AcceptedCmp1','AcceptedCmp2',
                                    'AcceptedCmp3','AcceptedCmp4',
                                      'AcceptedCmp5'],
                     var_name = 'cmp',
                     value_name = 'success',
                     ignore_index = True)

cmp_success = data_melted[data_melted['success'] == 1]


barplot = sns.barplot(y = 'cmp',
                     x = 'success',
                     data = cmp_success,
                     ci = None,
                     estimator = np.sum,
                     palette = 'Spectral')
    
plt.title('Campaign Success')
plt.ylabel('Campaign')
plt.xlabel('# of Success')

barplot.bar_label(barplot.containers[0])
plt.show()

Campaign 4 did the best job. Although, compared to campaign 1, 3 and 5, it was not overwhelmingly better.

In [None]:
data_melted2 = pd.melt(data_dummies,
                      value_vars = ['MntWines', 'MntFruits',
                                   'MntMeatProducts', 'MntFishProducts', 
                                    'MntSweetProducts', 'MntGoldProds'],
                      value_name = 'amount',
                      var_name = 'products',
                      ignore_index = True)

plt.figure(figsize = (10,5))
barplot1 = sns.barplot(y = 'products',
                     x = 'amount',
                     data = data_melted2,
                     ci = None,
                     estimator = np.sum,
                     palette = 'Spectral')
    
plt.title('Total Spent By Product')
plt.ylabel('Products')
plt.xlabel('$')

barplot1.bar_label(barplot1.containers[0])
plt.show()


Wine and meat products tend to do the best in sales.

In [None]:
data_melted3 = pd.melt(data_dummies,
                      value_vars = ['NumWebPurchases','NumCatalogPurchases',
                                   'NumStorePurchases',
                                   'NumDealsPurchases'],
                      var_name = 'channel(s)',
                      value_name = 'purchases',
                      ignore_index = True)

barplot2 = sns.barplot(y = 'channel(s)',
                     x = 'purchases',
                     data = data_melted3,
                     ci = None,
                     estimator = np.sum,
                     palette = 'Spectral')


barplot2.bar_label(barplot2.containers[0])
plt.show()


# Conclusion


From our analysis, Campaigns did not have a significant relationship in explaining the number of store purchases. Campaign 3 was significant, but it actually decreased the number of store purchases. There were 1772 customers out of 2333 total that did not accept the campaign. It was determined that the deals and web channel has a significant relationship in explaining the number of store purchases. We should drop the campaigns as they have no effect on the store channel, but use marketing through the deals and web channel. To increase the sale of underperforming products like fish, gold, sweets and fruits, we should create deals to pair them up with wine/meat and promote it on the web, which in turn will promote in store purchasing.