# AB Testing Marketing Campaign

The three question we want to answer:

Q 1: Can the promotion be identified as a significant factor influencing the sales of the new menu item?

Q 2: Is there a notable contrast in sales among the three distinct promotions that were examined?

Q 3: Do sales exhibit a considerable variance between stores of varying ages?

In [17]:
import pandas as pd
import numpy as np
import seaborn as sns
import itertools
from pingouin import pairwise_ttests
from scipy.stats import shapiro, normaltest, anderson, mannwhitneyu, kruskal, pearsonr

In [18]:
def description_data(df):
    df_describe = pd.DataFrame({'dataFeatures' : df.columns, 'dataType' : df.dtypes.values, 
              'null' : [df[i].isna().sum() for i in df.columns],
              'nullPct' : [((df[i].isna().sum()/len(df[i]))*100).round(2) for i in df.columns],
             'Nunique' : [df[i].nunique() for i in df.columns],
             'uniqueSample' : [list(pd.Series(df[i].unique()).sample(2)) for i in df.columns]}).reset_index(drop = True)
    return df_describe

In [19]:
file_path_excel = "data/WA_Marketing-Campaign.csv"
df = pd.read_csv(file_path_excel)

In [20]:
df.head()

Unnamed: 0,MarketID,MarketSize,LocationID,AgeOfStore,Promotion,week,SalesInThousands
0,1,Medium,1,4,3,1,33.73
1,1,Medium,1,4,3,2,35.67
2,1,Medium,1,4,3,3,29.03
3,1,Medium,1,4,3,4,39.25
4,1,Medium,2,5,2,1,27.81


In [21]:
description_data(df)

Unnamed: 0,dataFeatures,dataType,null,nullPct,Nunique,uniqueSample
0,MarketID,int64,0,0.0,10,"[1, 2]"
1,MarketSize,object,0,0.0,3,"[Medium, Large]"
2,LocationID,int64,0,0.0,137,"[907, 403]"
3,AgeOfStore,int64,0,0.0,25,"[12, 6]"
4,Promotion,int64,0,0.0,3,"[2, 3]"
5,week,int64,0,0.0,4,"[3, 4]"
6,SalesInThousands,float64,0,0.0,517,"[51.26, 93.63]"


# Normality Test

$H_0$: the data has a Normal distribution.

$H_1$: the data does not have a Normal distribution.

In [22]:
promotion_1 = df[df["Promotion"] == 1]["SalesInThousands"]
promotion_2 = df[df["Promotion"] == 2]["SalesInThousands"]
promotion_3 = df[df["Promotion"] == 3]["SalesInThousands"]

In [23]:
def try_normal(data):
    result = {'Anderson' : {i:j for i,j in zip(anderson(data)[2], anderson(data)[1])}, 'Shapiro (P-value)': shapiro(data)[1], 
     'K^2 (P-value)': normaltest(data)[1]}
    result['Anderson']['Test statistic'] = anderson(data)[0]
    return result

In [24]:
try_normal(promotion_1)

{'Anderson': {15.0: 0.563,
  10.0: 0.642,
  5.0: 0.77,
  2.5: 0.898,
  1.0: 1.068,
  'Test statistic': 4.811285489370562},
 'Shapiro (P-value)': 1.977244323825289e-08,
 'K^2 (P-value)': 0.0001272552757774151}

In [25]:
try_normal(promotion_2)

{'Anderson': {15.0: 0.564,
  10.0: 0.643,
  5.0: 0.771,
  2.5: 0.899,
  1.0: 1.07,
  'Test statistic': 5.56041600226439},
 'Shapiro (P-value)': 5.456262108793908e-09,
 'K^2 (P-value)': 4.348396213594891e-06}

In [26]:
try_normal(promotion_3)

{'Anderson': {15.0: 0.564,
  10.0: 0.643,
  5.0: 0.771,
  2.5: 0.899,
  1.0: 1.07,
  'Test statistic': 6.11407995911344},
 'Shapiro (P-value)': 1.499518376135711e-08,
 'K^2 (P-value)': 0.0003212129387460442}

# Can the promotion be identified as a significant factor influencing the sales of the new menu item?

$H_0$: There is no significant difference in the sales of the new menu item between different promotions.

$H_1$: There is a significant difference in the sales of the new menu item between different promotions.

In [27]:
kruskal(promotion_1, promotion_2, promotion_3)

KruskalResult(statistic=53.29475169322799, pvalue=2.6741866266697816e-12)

In [28]:
pairwise_ttests(data = df, dv = 'SalesInThousands', between = 'Promotion', parametric = False)

Unnamed: 0,Contrast,A,B,Paired,Parametric,U-val,Tail,p-unc,hedges
0,Promotion,1,2,False,False,22957.5,two-sided,5.845935e-12,0.679522
1,Promotion,1,3,False,False,18247.0,two-sided,0.0350841,0.163744
2,Promotion,2,3,False,False,12093.0,two-sided,1.197008e-07,-0.502467


Null Hypothesis Rejected. The promotion has a significant impact on the sales of the new menu item.

# Is there a notable contrast in sales among the three distinct promotions that were examined?
$H_0$: The promotion has no significant impact on the sales of the new menu item.

$H_1$: The promotion has a significant impact on the sales of the new menu item.

In [29]:
list_of_promotions = [promotion_1, promotion_2, promotion_3]
name_of_promotion = ['promotion 1', 'promotion 2', 'promotion 3']

lst_of_df = []

for (name_pr_1, pr_1), (name_pr_2, pr_2) in itertools.combinations(zip(name_of_promotion, list_of_promotions), 2):
    mannwhitneyu_p_val = mannwhitneyu(pr_1, pr_2)[1]
    lst_of_df.append([name_pr_1, name_pr_2, mannwhitneyu_p_val])

df_mannwhitneyu = pd.DataFrame(lst_of_df, columns=["group1", "group2", "p_val"])
df_mannwhitneyu

Unnamed: 0,group1,group2,p_val
0,promotion 1,promotion 2,2.922968e-12
1,promotion 1,promotion 3,0.01754205
2,promotion 2,promotion 3,5.985042e-08


Null Hypothesis Rejected. The promotion has a significant impact on the sales of the new menu item.

# Do sales exhibit a considerable variance between stores of varying ages?

In [40]:
import statsmodels.api as sm

model = sm.OLS( df["AgeOfStore"], df["Promotion"])
results = model.fit()

In [41]:
results.tvalues

Promotion    26.196811
dtype: float64

In [43]:
results.pvalues

Promotion    1.242358e-98
dtype: float64

## Correlation

In [46]:
correlation, p_value = pearsonr(df["AgeOfStore"], df["Promotion"])

print(f'correlation: {correlation}')
print(f'p-value: {p_value}')

correlation: 0.05976484020286552
p-value: 0.16237945528693515


There is a significant difference in sales between stores of different ages.