<a href="https://colab.research.google.com/github/wahyunh10/Advertising-AB-Testing/blob/main/Advertising_A_B_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Advertising A/B Testing**

The dataset where you can find [here](https://www.kaggle.com/datasets/osuolaleemmanuel/ad-ab-testing) is about advertising agency want to test whether show different type of ad will affecting respond rate of questionnaire that is shared. A/B testing is undertaken to facilitating this. Where a hypothesis is tested and bring a result. You can also found some exploratory data analysis here including seeing trends that happened

**Data Preprocessing**

Import the csv file as panda DataFrames and examine it.

In [None]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.stats.api as sms
import scipy.stats
from math import ceil

A company want to test whether show a creative and interactive ad will increase the amount of questionnaire respond rate that they received

In [None]:
#Import the excel file and examine it
df = pd.read_csv('AdSmartABdata - AdSmartABdata.csv')
df.head()

Unnamed: 0,auction_id,experiment,date,hour,device_make,platform_os,browser,yes,no
0,0008ef63-77a7-448b-bd1e-075f42c55e39,exposed,2020-07-10,8,Generic Smartphone,6,Chrome Mobile,0,0
1,000eabc5-17ce-4137-8efe-44734d914446,exposed,2020-07-07,10,Generic Smartphone,6,Chrome Mobile,0,0
2,0016d14a-ae18-4a02-a204-6ba53b52f2ed,exposed,2020-07-05,2,E5823,6,Chrome Mobile WebView,0,1
3,00187412-2932-4542-a8ef-3633901c98d9,control,2020-07-03,15,Samsung SM-A705FN,6,Facebook,0,0
4,001a7785-d3fe-4e11-a344-c8735acacc2c,control,2020-07-03,15,Generic Smartphone,6,Chrome Mobile,0,0


In [None]:
df.shape

(8077, 9)

**Cleaning the Dataset:**

Dealing with missing values, correct and create necessary date data types, and drop the duplicate.

In [None]:
#Investigate missing value
df.isna().sum()

auction_id     0
experiment     0
date           0
hour           0
device_make    0
platform_os    0
browser        0
yes            0
no             0
dtype: int64

In [None]:
df.dtypes

auction_id     object
experiment     object
date           object
hour            int64
device_make    object
platform_os     int64
browser        object
yes             int64
no              int64
dtype: object

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8077 entries, 0 to 8076
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   auction_id   8077 non-null   object
 1   experiment   8077 non-null   object
 2   date         8077 non-null   object
 3   hour         8077 non-null   int64 
 4   device_make  8077 non-null   object
 5   platform_os  8077 non-null   int64 
 6   browser      8077 non-null   object
 7   yes          8077 non-null   int64 
 8   no           8077 non-null   int64 
dtypes: int64(4), object(5)
memory usage: 568.0+ KB


In [None]:
# change date datatype
df['date']=pd.to_datetime(df['date'])

In [None]:
# check if there's any duplicate record
df.duplicated().sum()

0

In [None]:
#final DataFrame
df.head()

Unnamed: 0,auction_id,experiment,date,hour,device_make,platform_os,browser,yes,no
0,0008ef63-77a7-448b-bd1e-075f42c55e39,exposed,2020-07-10,8,Generic Smartphone,6,Chrome Mobile,0,0
1,000eabc5-17ce-4137-8efe-44734d914446,exposed,2020-07-07,10,Generic Smartphone,6,Chrome Mobile,0,0
2,0016d14a-ae18-4a02-a204-6ba53b52f2ed,exposed,2020-07-05,2,E5823,6,Chrome Mobile WebView,0,1
3,00187412-2932-4542-a8ef-3633901c98d9,control,2020-07-03,15,Samsung SM-A705FN,6,Facebook,0,0
4,001a7785-d3fe-4e11-a344-c8735acacc2c,control,2020-07-03,15,Generic Smartphone,6,Chrome Mobile,0,0


**Exploratory Data Analysis**

In [None]:
df['experiment'].value_counts()/len(df)*100

control    50.402377
exposed    49.597623
Name: experiment, dtype: float64

There are more people in the control group than the exposed one. However, the gap is not that huge

In [None]:
df_control_subset=df[df['experiment']=='control']
df_exposed_subset=df[df['experiment']=='exposed']

In [None]:
df_control_subset[['device_make', 'browser']].describe().transpose()

In control group the most used device by user is generic smartphone, while the most used browser is chrome mobile

In [None]:
df_exposed_subset[['device_make', 'browser']].describe().transpose()

The same case also happen to exposed group

In [None]:
df.groupby(by='experiment', as_index=False)[['yes', 'no']].sum()

There 264 people and 322 people who responded yes and no in the questionnaire as the control group. Also, there 308 people and 349 people who responded yes and no as the exposed group. The rest is people who are not respond to the questionnaire

In [None]:
len(df[(df['yes']==0) & df['no']==0])

To be exact there's 7406 people who didn't respond the questionnaire in both experiment group

**Exploratory Data Analysis : Identifying Trends**

In [None]:
# break out the value of yes column by time and browser used
by_browser=pd.pivot_table(df,
                               values=['yes'], 
                               index=['date'],
                               columns=['browser'],
                               aggfunc='sum',
                               fill_value=0)
by_browser=by_browser.reset_index()
by_browser

In [None]:
by_browser.plot(x='date', y='yes', figsize=(8,8))
plt.legend(loc=1)
plt.title('Respondent Who Said Yes : in Date')
plt.show()

As known before, respondent who answer yes use chrome mobile the most. The peak of using happened at 8 July 2020

In [None]:
# break out the value of yes column by hour and browser used
by_browser1=pd.pivot_table(df,
                               values=['yes'], 
                               index=['hour'],
                               columns=['browser'],
                               aggfunc='sum',
                               fill_value=0)
by_browser1=by_browser1.reset_index()
by_browser1

In [None]:
by_browser1.plot(x='hour', y='yes', figsize=(8,8))
plt.legend(loc=1)
plt.title('Respondent Who Said Yes : in Hours')
plt.show()

Some browsers mostly used at 15 o'clock by respondent who answer yes

In [None]:
# break out the value of no column by date and browser used
by_browserno=pd.pivot_table(df,
                               values=['no'], 
                               index=['date'],
                               columns=['browser'],
                               aggfunc='sum',
                               fill_value=0)
by_browserno=by_browserno.reset_index()
by_browserno

In [None]:
by_browserno.plot(x='date', y='no', figsize=(8,8))
plt.legend(loc=1)
plt.title('Respondent Who Said No : in Date')
plt.show()

While respondent who said no mostly use chrome mobile at 9 July 2020

In [None]:
# break out the value of no column by hour and browser used
by_browserno1=pd.pivot_table(df,
                               values=['no'], 
                               index=['hour'],
                               columns=['browser'],
                               aggfunc='sum',
                               fill_value=0)
by_browserno1=by_browserno1.reset_index()
by_browserno1

In [None]:
by_browserno1.plot(x='hour', y='no', figsize=(8,8))
plt.legend(loc=1)
plt.title('Respondent Who Said No : in Hours')
plt.show()

Just like respondent who said yes, some browsers mostly used at 15 o'clock

In [None]:
# break out the value of yes column by time and device used
by_device=pd.pivot_table(df,
                        values=['yes'], 
                        index=['date'],
                        columns=['device_make'],
                        aggfunc='sum',
                        fill_value=0)
by_device=by_device.reset_index()
by_device

In [None]:
by_device.plot(x='date', y='yes', figsize=(8,8))
plt.legend(loc=1)
plt.title('Respondent Who Said Yes : in Date')
plt.show()

There's huge a gap of amount between people who use generic smartphone and other devices. But most of the user responded yes in the questionnaire in 8 July 2020

In [None]:
# break out the value of yes column by hour and device used
by_device1=pd.pivot_table(df,
                               values=['yes'], 
                               index=['hour'],
                               columns=['device_make'],
                               aggfunc='sum',
                               fill_value=0)
by_device1=by_device1.reset_index()
by_device1

In [None]:
by_device1.plot(x='hour', y='yes', figsize=(8,8))
plt.legend(loc=1)
plt.title('Respondent Who Said Yes : in Date')
plt.show()

There are some peaks in terms of hours for people who responded yes in the questionnaire for different devices. But mostly it happened at 15 o'clock

In [None]:
# break out the value of no column by date and device used
by_deviceno=pd.pivot_table(df,
                               values=['no'], 
                               index=['date'],
                               columns=['device_make'],
                               aggfunc='sum',
                               fill_value=0)
by_deviceno=by_deviceno.reset_index()
by_deviceno

In [None]:
by_deviceno.plot(x='date', y='no', figsize=(8,8))
plt.legend(loc=1)
plt.title('Respondent Who Said No : in Date')
plt.show()

Respondent who said no by their mostly choosen device which is generic smartphone, happened at 9 July 2020

In [None]:
# break out the value of no column by hour and device used
by_deviceno1=pd.pivot_table(df,
                               values=['no'], 
                               index=['hour'],
                               columns=['device_make'],
                               aggfunc='sum',
                               fill_value=0)
by_deviceno1=by_deviceno1.reset_index()
by_deviceno1

In [None]:
by_deviceno1.plot(x='hour', y='no', figsize=(8,8))
plt.legend(loc=1)
plt.title('Respondent Who Said No : in Hour')
plt.show()

Respondent who said no and using generic smartphone, responded mostly at 15 o'clock

# **A/B Testing**

**Which group experiment has a higher respond rate?**

Hypothesis:
H0 : control and exposed has same or similar respond rate
H1 : exposed will have higher respond rate than control ones

Hypothesis created to ensure the interpretation of the result is correct

**Choosing Sample Size**

In [None]:
control_respondents=df[df['experiment']=='control']
conversion_control=control_respondents['yes'].sum()+control_respondents['no'].sum()
total_respondents_control=len(control_respondents)

exposed_respondents=df[df['experiment']=='exposed']
conversion_exposed=exposed_respondents['yes'].sum()+exposed_respondents['no'].sum()
total_respondents_exposed=len(exposed_respondents)

#count number of respondents who converted in each group
print('Number of control respondents who have been shown a dummy ad: ', conversion_control)
print('Percentage of control group respond rate: ', round((conversion_control / total_respondents_control) * 100, 2), '%')

print()

print('Number of exposed respondents who have been shown a creative, an online interactive ad, with the SmartAd brand: ', conversion_exposed)
print('Percentage of exposed group respond rate: ', round((conversion_exposed / total_respondents_exposed) * 100, 2), '%')

Number of control respondents who have been shown a dummy ad:  586
Percentage of control group respond rate:  14.39 %

Number of exposed respondents who have been shown a creative, an online interactive ad, with the SmartAd brand:  657
Percentage of exposed group respond rate:  16.4 %

In [None]:
baseline_rate=round(conversion_control/total_respondents_control,2)
baseline_rate

Let just say the marketing team want increasement of respond rate from **14%** to **16%**

The sample size we need is estimated through something called **Power analysis**, and it depends on a few factors:

Power of the test/sensitivity (1-b) - This represents the probability of finding a statistical difference between the groups in our test when a difference is actually present. This is usually set at 0.8 as a convention
Alpha value - The critical value we set to 0.05
Effect size - How big of a difference we expect there to be between the conversion rates

In [None]:
effect_size = sms.proportion_effectsize(0.14, 0.16)
sample_size=sms.NormalIndPower().solve_power(
    effect_size, 
    power=0.8, # user defined
    alpha=0.05, # user defined, for a 95% confidence interval 
    ratio=1
    )

sample_size=ceil(sample_size)

print('Required sample size: ', round(sample_size), ' per group')

Required sample size:  4999  per group
We need at least 4999 respondents for each group

**Sampling**

In [None]:
len(df)

In [None]:
control_group = df[df['experiment'] == 'control']
exposed_group = df[df['experiment'] == 'exposed']

In [None]:
print('length of control group:' , len(control_group))
print('length of exposed group' , len(exposed_group))

length of control group: 4071
length of exposed group 4006
Since the length of the data for both group is less than required sample, then sampling is not undertaken. In fact, the amount of data could be added to increase the analysis robustness

**Testing hypothesis**

Since the dataset is quite small t-test independent is undertaken to testing the hypotesis

In [None]:
df['respond_rate']=(df['yes']+df['no'])/len(df['auction_id'])


In [None]:
from scipy.stats import ttest_ind
result = list(ttest_ind(
    df[df['experiment'] == 'control']['respond_rate'], 
     df[df['experiment'] == 'exposed']['respond_rate'],
    equal_var = False
))

In [None]:
result

In [None]:
print('T-Statistic:', result[0])
print('P-Value:', result[1])

In [None]:
result[1]<0.05

# **Summary**
The P-value is lower than our significant level 0.05, it means we have enough evidence to reject the Null Hypothesis. So, respondents who shown the creative, an online interactive ad, with the SmartAd brand or respondents in the exposed group will have higher respond rate than the control ones or people who shown the dummy ad