# Analyzing Farmburg’s A/B test
***

Brian ran an A/B test with three different groups: A, B, and C. He has provided us with a CSV file of his results named *clicks.csv*. It has the following columns:

* *user_id* - a unique id for each visitor to the FarmBurg site
* *group* - either 'A', 'B', or 'C' depending on which group the visitor was assigned to
* *is_purchase* - either 'Yes' if the visitor made a purchase or 'No' if they did not.

In [1]:
# Load necessary libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [2]:
# Read the file into a variable 
abdata = pd.read_csv("../input/clicks/clicks.csv")

# To preview the dataframe to see what data we are working with
abdata.head()

Unnamed: 0,user_id,group,is_purchase
0,8e27bf9a,A,No
1,eb89e6f0,A,No
2,7119106a,A,No
3,e53781ff,A,No
4,02d48cf1,A,Yes


Note that we have two categorical variables: group and is_purchase. We are interested in whether visitors are more likely to make a purchase if they are in any one group compared to the others. Because we want to know if there is an association between two categorical variables, we’ll start by using a Chi-Square test to address our question.

In order to run a Chi-Square test, we first need to create a contingency table of the variables group and is_purchase. Use pd.crosstab() to create this table and name the result Xtab, then print it out. Which group appears to have the highest number of purchases? 

We learned that Group A with 316 purchases has the highest number of purchases.

In [3]:
# Create a contingency table with pd.crosstab and print
Xtab = pd.crosstab(abdata.group, abdata.is_purchase)
Xtab

is_purchase,No,Yes
group,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1350,316
B,1483,183
C,1583,83


To conduct the Chi-Square Test, import chi2_contingency from scipy.stats.

Then, use the function chi2_contingency with the data in Xtab to calculate the p-value. Remember that of the four values returned by chi2_contingency, the p-value is the second value.

Save the p-value to a variable named pval and print the result. Using a significance threshold of 0.05, is there a significant difference in the purchase rate for groups A, B, and C? 

The p-value is less than 0.05 and we can conclude that there is a significant difference in the purchase rate for groups A, B, and C.

In [4]:
# Import chi2_contingency module
from scipy.stats import chi2_contingency

# Calculate the p-value
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Print the p-value
print(pval)

2.4126213546684264e-35


In [5]:
# Determine if the p-value is significant
is_significant = True

Our day is a little less busy than expected, so we decide to ask Brian about his test.

Us: Hey Brian! What was that test you were running anyway?

Brian: We are trying to get users to purchase a small FarmBurg upgrade package. It’s called a microtransaction. We’re not sure how much to charge for it, so we tested three different price points: \\$0.99 (group A), \\$1.99 (group B), and \\$4.99 (group C). It looks like significantly more people bought the upgrade package for \\$0.99, so I guess that’s what we’ll charge.

Us: Oh no! We should have asked you this before we did that Chi-Square test. That wasn’t the right test at all. It’s true that more people wanted to purchase the upgrade at \\$0.99; you probably expected that. What we really want to know is whether each price point allows us to make enough money that we can exceed some target goal. Brian, how much do you think it cost to build this feature?

Brian: Hmm. I guess that we need to generate a minimum of $1000 in revenue per week in order to justify this project.

Us: We have some work to do!

In order to justify this feature, we will need to calculate the necessary purchase rate for each price point. Let’s start by calculating the number of visitors to the site this week.

It turns out that Brian ran his original test over the course of a week, so the number of visitors in abdata is equal to the number of visitors in a typical week. Calculate the number of visitors in the data and save the value in a variable named num_visits. Make sure to print the value.

In [6]:
# Calculate and print the number of visits
num_visits = len(abdata)

# Print the number of visits
num_visits

4998

Now that we know how many visitors we generally get each week (num_visits), we need to calculate the number of visitors who would need to purchase the upgrade package at each price point (\\$0.99, \\$1.99, \\$4.99) in order to generate Brian’s minimum revenue target of \\$1,000 per week.

To start, calculate the number of sales that would be needed to reach \\$1,000 dollars of revenue at a price point of \\$0.99. Save the result as num_sales_needed_099 and print it out.

Now that we know how many sales we need at a \\$0.99 price point, calculate the proportion of weekly visitors who would need to make a purchase in order to meet that goal. Remember that the number of weekly visitors is saved as num_visits. Save the result as p_sales_needed_099 and print it out.

Print out the proportions. Note that for higher price points, you’ll need to sell fewer upgrade packages in order to meet your minimum revenue target — so the proportions should decrease as the price points increase.

In [7]:
# Calculate the purchase rate needed at 0.99
num_sales_needed_099 = 1000/0.99
p_sales_needed_099 = num_sales_needed_099/num_visits

# Print the purchase rate needed at 0.99
print(p_sales_needed_099)

# Calculate the purchase rate needed at 1.99
num_sales_needed_199 = 1000/1.99
p_sales_needed_199 = num_sales_needed_199/num_visits

# Print the purchase rate needed at 1.99
print(p_sales_needed_199)

# Calculate the purchase rate needed at 4.99
num_sales_needed_499 = 1000/4.99
p_sales_needed_499 = num_sales_needed_499/num_visits

# Print the purchase rate needed at 4.99
print(p_sales_needed_499)

0.20210104243717691
0.10054272965467594
0.040096198800161346


Now let’s return to Brian’s question. To start, we want to know if the percent of Group A (the \\$0.99 price point) that purchased an upgrade package is significantly greater than p_sales_needed_099 (the percent of visitors who need to buy an upgrade package at \\$0.99 in order to make our minimum revenue target of \\$1,000).

To answer this question, we want to focus on just the visitors in group A. Then, we want to compare the number of purchases in that group to p_sales_needed_099.

Since we have a single sample of categorical data and want to compare it to a hypothetical population value, a binomial test is appropriate. In order to run a binomial test for group A, we need to know two pieces of information:

The number of visitors in group A (the number of visitors who were offered the \\$0.99 price point) The number of visitors in Group A who made a purchase Calculate these two numbers and save them as samp_size_099 and sales_099, respectively. Note that you can use the contingency table that you printed earlier to get these numbers OR you can use Python syntax.

In [8]:
# Calculate samp size & sales for 0.99 price point
samp_size_099 = np.sum(abdata.group == 'A')
sales_099 = np.sum((abdata.group == 'A') & (abdata.is_purchase == 'Yes'))

# Print samp size & sales for 0.99 price point
print(samp_size_099)
print(sales_099)

# Calculate samp size & sales for 1.99 price point
samp_size_199 = np.sum(abdata.group == 'B')
sales_199 = np.sum((abdata.group == 'B') & (abdata.is_purchase == 'Yes'))

# Print samp size & sales for 1.99 price point
print(samp_size_199)
print(sales_199)

# Calculate samp size & sales for 4.99 price point
samp_size_499 = np.sum(abdata.group == 'C')
sales_499 = np.sum((abdata.group == 'C') & (abdata.is_purchase == 'Yes'))

# Print samp size & sales for 4.99 price point
print(samp_size_499)
print(sales_499)

1666
316
1666
183
1666
83


For Group A (\\$0.99 price point), perform a binomial test using binom_test() to see if the observed purchase rate is significantly greater than p_sales_needed_099. Remember that there are four inputs to binom_test():

* x will be the number of purchases for Group A
* n will be the total number of visitors assigned group A
* p will be the target percent of purchases for the \\$0.99 price point
* Alternative will indicate the alternative hypothesis for this test; in this case, we want to know if the observed purchase rate is significantly 'greater' than the purchase rate that results in the minimum revenue target.

Save the results to pvalueA, and print its value. Note that you’ll first need to import the binom_test() function from scipy.stats using the following line of code:

In [9]:
# Import the binom_test module
from scipy.stats import binom_test

# Calculate the p-value for Group A
pvalueA = binom_test(sales_099, n=samp_size_099, p=p_sales_needed_099, alternative='greater')

# Print the p-value for Group A
print(pvalueA)

# Calculate the p-value for Group B
pvalueB = binom_test(sales_199, n=samp_size_199, p=p_sales_needed_199, alternative='greater')

# Print the p-value for Group B
print(pvalueB)

# Calculate the p-value for Group C
pvalueC = binom_test(sales_499, n=samp_size_499, p=p_sales_needed_499, alternative='greater')

# Print the p-value for Group C
print(pvalueC)

0.9028081076188554
0.11184562623740596
0.02794482665983064


In [10]:
# Create a lambda function, which is an anonymous function
x = lambda sales, samp_size, p_sales_needed: binom_test(sales, n=samp_size, p=p_sales_needed, alternative='greater')
print(x(316,1666,0.20210104243717691))

0.9028081076188554


Based on the three p-values you calculated for the binomial tests in each group and a significance threshold of 0.05, were there any groups where the purchase rate was significantly higher than the target? Based on this information, what price should Brian charge for the upgrade package?

pvalueC is the only p-value below the threshold of 0.05. Therefore, the C group is the only group where we would conclude that the purchase rate is significantly higher than the target needed to reach \\$1000 revenue per week. Therefore, Brian should charge \\$4.99 for the upgrade.

In [11]:
# Set the correct value for the final answer variable
final_answer = '4.99'

# Print the chosen price group
print(final_answer)

4.99


In this project, we performed a Chi-Square test to determine if the categorical data for two independent variables (ie., group and is_purchase) have a relationship. We were interested in determining whether visitors are more likely to make a purchase if they are in any one group compared to the others. Given a p-value less than 0.05, we can conclude that there is a significant difference in the purchase rate for groups A, B, and C.