## Introduction
Given a dataset with the following information, we will perform a marketing 
analysis on the XYZ company. Below are the following variables/features in 
the dataset:  

- ID _(int)_: customer's unique identifier
- Year_Birth _(int)_: customer's birth year
- Education _(str)_: customer's education level
- Marital_Status _(str)_: marital status (divorced, single, window, married 
etc)
- Income _(str)_: yearly household income
- Kidhome _(int)_: number of children in household
- Teenhome _(int)_: number of teenagers in household
- Dt_Customer _(datetime)_: date of enrollment
- Recency _(int)_: number of days since last purchase
- MntWines _(int)_: amount spent on wine in the last 2 years
- MntFruits _(int)_: amount spent on fruits in the last 2 years
- MntMeatProducts _(int)_: amount spent on meat in the last 2 years
- MntFishProducts _(int)_: amount spent on fish in the last 2 years
- MntSweetProducts _(int)_: amount spent on sweets in the last 2 years
- MntGoldProds _(int)_: amount spent on gold in the last 2 years
- NumDealsPurchases _(int)_: number of purchases made with a discount
- NumWebPurchases _(int)_: number of purchases made through web site
- NumCatalogPurchases _(int)_: number of purchases made using a catalogue
- NumStorePurchases _(int)_: number of purchases made directly in stores
- NumWebVisitsMonth _(int)_: number of visits to web site in the last month
- AcceptedCmp1 _(int)_: 1 if customer accepted the offer in the 1st campaign, 0 
otherwise
- AcceptedCmp2 _(int)_: 1 if customer accepted the offer in the 2nd campaign, 0 
otherwise
- AcceptedCmp3 _(int)_: 1 if customer accepted the offer in the 3rd campaign, 0 
otherwise
- AcceptedCmp4 _(int)_: 1 if customer accepted the offer in the 4th campaign, 0 
otherwise
- AcceptedCmp5 _(int)_: 1 if customer accepted the offer in the 5th campaign, 0 
otherwise
- Response _(int)_: 1 if customer accepted the offer in the last campaign, 0 otherwise
- Complain _(int)_: 1 if customer complained in the last 2 years, 0 otherwise
- Country _(str)_: location of customer

In [3]:
# Prior to analysis, we will clean the data and load modules

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.width", 1000)

df = pd.read_csv("marketing_data.csv", sep=",")

# View data
print(df.head(1))

# View data-types
print(df.dtypes)

# Convert categorical data
df.columns = df.columns.str.strip()
df["Education"] = df["Education"].replace(["2n Cycle"], "2n_Cycle")
df["Education"] = df["Education"].astype("category")
df["Marital_Status"] = df["Marital_Status"].astype("category")
df["Country"] = df["Country"].astype("object")
df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"])

# Encode categorical predictor variables
categorical_columns = ["Education", "Marital_Status"]
for cc in categorical_columns:
    dummies = pd.get_dummies(df[cc])
    dummies = dummies.add_prefix("{}#".format(cc))
    df = df.join(dummies)

# Convert income to int
df["Income"] = df["Income"].replace({"\$": "", ",": ""}, regex=True)
df["Income"] = df["Income"].astype("float")

# Enrollment date
df["Dt_Year"] = df["Dt_Customer"].dt.year
df["Dt_Month"] = df["Dt_Customer"].dt.month
df["Dt_Day"] = df["Dt_Customer"].dt.month

# View updated dataset
print(df.head(1))

     ID  Year_Birth   Education Marital_Status      Income   Kidhome  Teenhome Dt_Customer  Recency  MntWines  MntFruits  MntMeatProducts  MntFishProducts  MntSweetProducts  MntGoldProds  NumDealsPurchases  NumWebPurchases  NumCatalogPurchases  NumStorePurchases  NumWebVisitsMonth  AcceptedCmp3  AcceptedCmp4  AcceptedCmp5  AcceptedCmp1  AcceptedCmp2  Response  Complain Country
0  1826        1970  Graduation       Divorced  $84,835.00         0         0     6/16/14        0       189        104              379              111               189           218                  1                4                    4                  6                  1             0             0             0             0             0         1         0      SP
ID                      int64
Year_Birth              int64
Education              object
Marital_Status         object
 Income                object
Kidhome                 int64
Teenhome                int64
Dt_Customer            object


## A/B Testing
This section performs A/B testing on the amount of wine purchased and Income
<br> ----<br>
Hypothesis <br>
H0: A consumer's income has no impact on money spent on wine.
<br>
H1: When a consumer's income is increased from an average of $45,488 to \$70,515, they will
purchase more wine.

Split data into 3 groups based on the MntWine column.

In [20]:
from math import sqrt
df_wine = df["MntWines"]
first_third = df_wine.max() * 0.33
second_third = df_wine.max() * 0.66
max_wine = df_wine.max()
min_wine = df_wine.min()
df_first = df[df['MntWines'] < first_third]
df_first = df_first['Income']
df_second = df[(df['MntWines'] > first_third) & (df['MntWines'] < second_third)]
df_second = df_second['Income']
df_third = df[df['MntWines'] > second_third]
df_third = df_third['Income']

Calculate the Cohen d value. The result of this is how many sample we need to calculate the p value.

In [21]:
cohen = ((df_first.mean() - second_third.mean()) / (sqrt((df_first.std() ** 2 + df_second.std() **2) / 2)))
from statsmodels.stats.power import TTestIndPower
effect = cohen # Obtained from previous step.
alpha = 0.05  # Enable 95% confidence for two tail test.
power = 0.95  # One minus the probability of a type II error.
# Limits possibility of type II error to 20%.
analysis = TTestIndPower()
numSamplesNeeded = analysis.solve_power(effect, power=power, alpha=alpha)
print(numSamplesNeeded)

6.0865228316807505


Calculate P value with the required sample size (cohen d value)

In [25]:

from scipy import stats
old_menu_list = []
new_menu_list = []
old_menu_sales = [list(df_first.sample(n=6))]
for i in (old_menu_sales[0]):
    old_menu_list.append(i)
new_menu_sales = [df_third.sample(n=6).tolist()]
for i in new_menu_sales[0]:
    new_menu_list.append(i)

testResult = stats.ttest_ind(new_menu_list, old_menu_list, equal_var=False)

import numpy as np
print("Hypothesis test p-value: " + str(testResult))
print("New mean: " + str(np.mean(new_menu_sales)))
print("New std: " + str(np.std(new_menu_sales)))

Hypothesis test p-value: Ttest_indResult(statistic=5.103145340348087, pvalue=0.0011907866689955069)
New sales mean: 77763.83333333333
New sales std: 7179.426658089689


With the p value being less than 0.05, we can reject the null hypothesis which says wine
purchased is not related to consumer's income.