## Pandas A/B Testing for Shoefly.com
A project for my professional certification in Data Science and Machine Learning Engineering on Codecademy

Robert Hall 01/09/2024

In [1]:
# import necessary libraries
import pandas as pd 

Step 1: 

Import the csv file and examine the first few 
rows of the data

In [3]:
ad_clicks = pd.read_csv('ad_clicks.csv')

ad_clicks.head()

Unnamed: 0,user_id,utm_source,day,ad_click_timestamp,experimental_group
0,008b7c6c-7272-471e-b90e-930d548bd8d7,google,6 - Saturday,7:18,A
1,009abb94-5e14-4b6c-bb1c-4f4df7aa7557,facebook,7 - Sunday,,B
2,00f5d532-ed58-4570-b6d2-768df5f41aed,twitter,2 - Tuesday,,A
3,011adc64-0f44-4fd9-a0bb-f1506d2ad439,google,2 - Tuesday,,B
4,012137e6-7ae7-4649-af68-205b4702169c,facebook,7 - Sunday,,B


Step 2: 

"Your manager wants to know which ad platform is getting you the most views.

How many views (i.e., rows of the table) came from each utm_source?"

In [7]:
most_traffic = ad_clicks.groupby('utm_source').user_id.count().reset_index()
most_traffic

Unnamed: 0,utm_source,user_id
0,email,255
1,facebook,504
2,google,680
3,twitter,215


Step 3:

"If the column ad_click_timestamp is not null, then someone actually clicked on the ad that was displayed.

Create a new column called is_click, which is True if ad_click_timestamp is not null and False otherwise."

In [9]:
ad_clicks['is_click'] = ~ad_clicks.ad_click_timestamp.isnull()


Step 4:

"We want to know the percent of people who clicked on ads from each utm_source.

Start by grouping by utm_source and is_click and counting the number of user_id‘s in each of those groups. Save your answer to the variable clicks_by_source."

In [10]:
clicks_by_source = ad_clicks.groupby(['utm_source', 'is_click']).user_id.count().reset_index()
clicks_by_source

Unnamed: 0,utm_source,is_click,user_id
0,email,False,175
1,email,True,80
2,facebook,False,324
3,facebook,True,180
4,google,False,441
5,google,True,239
6,twitter,False,149
7,twitter,True,66


Step 5:

"Now let’s pivot the data so that the columns are is_click (either True or False), the index is utm_source, and the values are user_id.

Save your results to the variable clicks_pivot."

In [12]:
clicks_pivot = clicks_by_source.pivot(
  index='utm_source',
  columns='is_click',
  values='user_id'
).reset_index()

clicks_pivot

is_click,utm_source,False,True
0,email,175,80
1,facebook,324,180
2,google,441,239
3,twitter,149,66


Step 6:

"Create a new column in clicks_pivot called percent_clicked which is equal to the percent of users who clicked on the ad from each utm_source.

Was there a difference in click rates for each source?"

In [14]:
clicks_pivot['percent_clicked'] = round(
    (clicks_pivot[True] / 
     (clicks_pivot[True]+clicks_pivot[False]))\
    .apply(lambda x: x*100),
    2)
clicks_pivot

is_click,utm_source,False,True,percent_clicked
0,email,175,80,31.37
1,facebook,324,180,35.71
2,google,441,239,35.15
3,twitter,149,66,30.7


Step 7:

"The column experimental_group tells us whether the user was shown Ad A or Ad B.

Were approximately the same number of people shown both ads?"

In [15]:
group_counts = ad_clicks\
    .groupby('experimental_group')\
    .user_id.count().reset_index()
print(group_counts)

  experimental_group  user_id
0                  A      827
1                  B      827


Step 8:

"Using the column is_click that we defined earlier, check to see if a greater percentage of users clicked on Ad A or Ad B."

In [19]:
percentages_click = ad_clicks\
    .groupby(['experimental_group', 'is_click'])\
        .user_id.count().reset_index().pivot(
            index='experimental_group',
            columns='is_click',
            values='user_id'
        ).reset_index()
percentages_click

is_click,experimental_group,False,True
0,A,517,310
1,B,572,255


Step 9:

"The Product Manager for the A/B test thinks that the clicks might have changed by day of the week.

Start by creating two DataFrames: a_clicks and b_clicks, which contain only the results for A group and B group, respectively."

In [20]:
a_clicks = ad_clicks[ad_clicks.experimental_group == 'A']
b_clicks = ad_clicks[ad_clicks.experimental_group == 'B']

Step 10:

"For each group (a_clicks and b_clicks), calculate the percent of users who clicked on the ad \[grouped] by day."

In [21]:
# first, create the new pivot tables / dataframes

a_clicks_pivot = a_clicks.groupby(['is_click', 'day']).user_id.count().reset_index().pivot(
  index='day',
  columns='is_click',
  values='user_id'
).reset_index()

b_clicks_pivot = b_clicks.groupby(['is_click', 'day']).user_id.count().reset_index().pivot(
  index='day',
  columns='is_click',
  values='user_id'
).reset_index()

In [22]:
# next, create the percent_clicked columns in each

a_clicks_pivot['percent_clicked'] = round((a_clicks_pivot[True]/(a_clicks_pivot[True]+a_clicks_pivot[False])).apply(lambda x: x*100), 2)

b_clicks_pivot['percent_clicked'] = round((b_clicks_pivot[True]/(b_clicks_pivot[True]+b_clicks_pivot[False])).apply(lambda x: x*100), 2)

In [25]:
# view the amount of clicks group A had for each day
a_clicks_pivot

is_click,day,False,True,percent_clicked
0,1 - Monday,70,43,38.05
1,2 - Tuesday,76,43,36.13
2,3 - Wednesday,86,38,30.65
3,4 - Thursday,69,47,40.52
4,5 - Friday,77,51,39.84
5,6 - Saturday,73,45,38.14
6,7 - Sunday,66,43,39.45


In [24]:
# view the amount of clicks group A had for each day
b_clicks_pivot

is_click,day,False,True,percent_clicked
0,1 - Monday,81,32,28.32
1,2 - Tuesday,74,45,37.82
2,3 - Wednesday,89,35,28.23
3,4 - Thursday,87,29,25.0
4,5 - Friday,90,38,29.69
5,6 - Saturday,76,42,35.59
6,7 - Sunday,75,34,31.19


Step 11:

"Compare the results for A and B. What happened over the course of the week?

Do you recommend that your company use Ad A or Ad B?"

*Ad A clearly has not only a higher click rate for each day of the week (with the exception of Tuesday), but has a more consistent variation in the rate of clicks it attracts. Running with ad A will have a higher and more predictable optimal outcome, therefore I would recommend going with ad A.*