## Analyze A/B Test Results


## Table of Contents
- [Introduction](#intro)
- [Part I - Probability](#probability)
- [Part II - A/B Test](#ab_test)
- [Part III - Regression](#regression)


<a id='intro'></a>
### Introduction
This A/B Test will determine if a new version of a webpage converts more users than the existing (old) version. The test will utilize data from two groups, the treatment group, which was shown the new webpage, and the control group, which was shown the old webpage. We will use this data to compare the performance of each webpage and make a final recommendation.


<a id='probability'></a>
#### Part I - Probability

To get started, let's import our libraries.

In [1]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
#We are setting the seed to assure you get the same answers on quizzes as we set up
random.seed(42)

`1.` Now, read in the `ab_data.csv` data. Store it in `df`.  **Use your dataframe to answer the questions in Quiz 1 of the classroom.**

a. Read in the dataset and take a look at the top few rows here:

In [2]:
df = pd.read_csv('ab_data.csv')
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


b. Use the below cell to find the number of rows in the dataset.

In [3]:
len(df)

294478

c. The number of unique users in the dataset.

In [4]:
df.user_id.nunique()

290584

d. The proportion of users converted.

In [5]:
df['converted'].sum() / len(df)

0.11965919355605512

e. The number of times the `new_page` and `treatment` don't line up.

In [6]:
len(df.query('group == "treatment" and landing_page != "new_page"')) + len(df.query('group == "control" and landing_page == "new_page"'))

3893

f. Do any of the rows have missing values?

In [7]:
# Check for rows with null values-- no missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   timestamp     294478 non-null  object
 2   group         294478 non-null  object
 3   landing_page  294478 non-null  object
 4   converted     294478 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


`2.` For the rows where **treatment** is not aligned with **new_page** or **control** is not aligned with **old_page**, we cannot be sure if this row truly received the new or old page.  Use **Quiz 2** in the classroom to provide how we should handle these rows.  

a. Now use the answer to the quiz to create a new dataset that meets the specifications from the quiz.  Store your new dataframe in **df2**.

In [8]:
# Create new dataset with correctly aligned rows
df2 = df.query('group == "treatment" and landing_page =="new_page"').append(df.query('group == "control" and landing_page =="old_page"'))

In [9]:
# Double Check all of the correct rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

0

`3.` Use **df2** and the cells below to answer questions for **Quiz3** in the classroom.

a. How many unique **user_id**s are in **df2**?

In [10]:
df2.user_id.nunique()

290584

b. There is one **user_id** repeated in **df2**.  What is it?

In [11]:
df2[df2.duplicated('user_id')]['user_id']

2893    773192
Name: user_id, dtype: int64

c. What is the row information for the repeat **user_id**? 

In [12]:
df2[df2.user_id == 773192]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,2017-01-09 05:37:58.781806,treatment,new_page,0
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


d. Remove **one** of the rows with a duplicate **user_id**, but keep your dataframe as **df2**.

In [13]:
# Drop one of the duplicates rows
df2.drop_duplicates(subset = ['user_id'], inplace=True)
df2[df2.user_id == 773192]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,2017-01-09 05:37:58.781806,treatment,new_page,0


`4.` Use **df2** in the below cells to answer the quiz questions related to **Quiz 4** in the classroom.

a. What is the probability of an individual converting regardless of the page they receive?

In [14]:
len(df2.query('converted == 1')) / len(df2)

0.11959708724499628

b. Given that an individual was in the `control` group, what is the probability they converted?

In [15]:
control_converted = len(df2.query('converted == 1 and group == "control"')) / len(df2.query('group == "control"'))
control_converted

0.1203863045004612

c. Given that an individual was in the `treatment` group, what is the probability they converted?

In [16]:
treatment_converted = len(df2.query('converted == 1 and group == "treatment"')) / len(df2.query('group == "treatment"'))
treatment_converted

0.11880806551510564

d. What is the probability that an individual received the new page?

In [17]:
len(df2.query('landing_page == "new_page"')) / len(df2)

0.5000619442226688

e. Consider your results from a. through d. above, and explain below whether you think there is sufficient evidence to say that the new treatment page leads to more conversions.

**There is not sufficient evidence that the new treatment page leads to more conversions. The probability of conversion for the control group is higher than the probability of conversion for the treatment group. However, the difference is approximately 0.0016, showing that there is no significant difference observed in the treatment vs. control groups.**

<a id='ab_test'></a>
### Part II - A/B Test



Null hypothesis: $$ H_{0}: p_{new} - p_{old} \leq 0 $$

Alternative hypothesis: $$ H_{1}: p_{new} - p_{old} > 0 $$



`2.` Assume under the null hypothesis, $p_{new}$ and $p_{old}$ both have "true" success rates equal to the **converted** success rate regardless of page - that is $p_{new}$ and $p_{old}$ are equal. Furthermore, assume they are equal to the **converted** rate in **ab_data.csv** regardless of the page. <br><br>

Use a sample size for each page equal to the ones in **ab_data.csv**.  <br><br>

Perform the sampling distribution for the difference in **converted** between the two pages over 10,000 iterations of calculating an estimate from the null.  <br><br>

<br><br>

In [18]:
# Create separate dataframes for treatment and control group
treatment_group = df2.query('group == "treatment"')
control_group = df2.query('group == "control"')

a. What is the **convert rate** for $p_{new}$ under the null? 

In [31]:
p_new = df2.converted.mean()
p_new

0.11959708724499628

b. What is the **convert rate** for $p_{old}$ under the null? <br><br>

In [32]:
p_old = df2.converted.mean()
p_old

0.11959708724499628

c. What is $n_{new}$?

In [21]:
n_new = len(treatment_group)
n_new

145310

d. What is $n_{old}$?

In [22]:
n_old = len(control_group)
n_old

145274

e. Simulate $n_{new}$ transactions with a convert rate of $p_{new}$ under the null.  Store these $n_{new}$ 1's and 0's in **new_page_converted**.

In [23]:
new_page_converted = treatment_group.converted.sample(n_new, replace=True)

f. Simulate $n_{old}$ transactions with a convert rate of $p_{old}$ under the null.  Store these $n_{old}$ 1's and 0's in **old_page_converted**.

In [24]:
old_page_converted = control_group.converted.sample(n_new, replace=True)

g. Find $p_{new}$ - $p_{old}$ for your simulated values from part (e) and (f).

In [25]:
new_page_converted.mean() - old_page_converted.mean()

-0.0008602298534168273

h. Simulate 10,000 $p_{new}$ - $p_{old}$ values using this same process similarly to the one you calculated in parts **a. through g.** above.  Store all 10,000 values in a numpy array called **p_diffs**.

In [33]:
# Perform sampling distribution

new_converted_simulation = np.random.binomial(n_new, p_new, 10000)/n_new
old_converted_simulation = np.random.binomial(n_old, p_old, 10000)/n_old
p_diffs = new_converted_simulation - old_converted_simulation

i. Plot a histogram of the **p_diffs**.  Does this plot look like what you expected?  Use the matching problem in the classroom to assure you fully understand what was computed here.

In [None]:
plt.hist(p_diffs);

j. What proportion of the **p_diffs** are greater than the actual difference observed in **ab_data.csv**?

In [35]:
obs_diff = df2.query('group == "treatment"').converted.mean() - df2.query('group == "control"').converted.mean()
p_diffs = np.array(p_diffs)

(p_diffs > obs_diff).mean()

0.9039

k. In words, explain what you just computed in part **j.**  What is this value called in scientific studies?  What does this value mean in terms of whether or not there is a difference between the new and old pages?

**We calculated the p-value, which is the probability of observing our statistic if the null hypothesis is true-- i.e. what is the probability that these results occur if the conversion rate for the new page is less than or equal to the new page? This p-value is 0.9039, or 90.39%. This exceeds any reasonable threshold. Therefore, we fail to reject the null.**

l. We could also use a built-in to achieve similar results.  Though using the built-in might be easier to code, the above portions are a walkthrough of the ideas that are critical to correctly thinking about statistical significance. Fill in the below to calculate the number of conversions for each page, as well as the number of individuals who received each page. Let `n_old` and `n_new` refer the the number of rows associated with the old page and new pages, respectively.

In [37]:
import statsmodels.api as sm

convert_old = len(df2.query('converted == 1 and group == "control"')) 
convert_new = len(df2.query('converted == 1 and group == "treatment"')) 
n_old = len(df2.query('group == "control"'))
n_new = len(df2.query('group == "treatment"'))

m. Now use `stats.proportions_ztest` to compute your test statistic and p-value.  [Here](http://knowledgetack.com/python/statsmodels/proportions_ztest/) is a helpful link on using the built in.

In [38]:
z_score, p_val = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative = 'smaller')

In [39]:
z_score, p_val

(1.3109241984234394, 0.9050583127590245)

n. What do the z-score and p-value you computed in the previous question mean for the conversion rates of the old and new pages?  Do they agree with the findings in parts **j.** and **k.**?

**The p-value tells us the probability of the statistics occuring under the null hypothesis. This p-value remains above the reasonable threshold of 0.05. Therefore, we fail to reject the null.
The z-score tells us the distribution of the mean conversion under the null hypothesis. If we assume the alpha value of 0.05, then the z-score must be greater than 1.96 to reject the null. This would mean that the distribution falls outside of the 95% confidence interval. However, our z-value is less than 1.96 and therefore supports the conclusion that we fail to reject the null.**

<a id='regression'></a>
### Part III - A regression approach

`1.` In this final part, you will see that the result you acheived in the previous A/B test can also be acheived by performing regression.<br><br>

a. Since each row is either a conversion or no conversion, what type of regression should you be performing in this case?

**Logistic regression**

b. The goal is to use **statsmodels** to fit the regression model you specified in part **a.** to see if there is a significant difference in conversion based on which page a customer receives.  However, you first need to create a column for the intercept, and create a dummy variable column for which page each user received.  Add an **intercept** column, as well as an **ab_page** column, which is 1 when an individual receives the **treatment** and 0 if **control**.

In [44]:
df2['intercept'] = 1

data = df2['group'] == 'treatment'
df2['ab_page'] = data *1

c. Use **statsmodels** to import your regression model.  Instantiate the model, and fit the model using the two columns you created in part **b.** to predict whether or not an individual converts.

In [45]:
logit_mod = sm.Logit(df2['converted'], df2[['intercept', 'ab_page']])
results = logit_mod.fit()

Optimization terminated successfully.
         Current function value: 0.366118
         Iterations 6


d. Provide the summary of your model below, and use it as necessary to answer the following questions.

In [46]:
results.summary()

0,1,2,3
Dep. Variable:,converted,No. Observations:,290584.0
Model:,Logit,Df Residuals:,290582.0
Method:,MLE,Df Model:,1.0
Date:,"Tue, 16 Nov 2021",Pseudo R-squ.:,8.077e-06
Time:,12:43:45,Log-Likelihood:,-106390.0
converged:,True,LL-Null:,-106390.0
Covariance Type:,nonrobust,LLR p-value:,0.1899

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-1.9888,0.008,-246.669,0.000,-2.005,-1.973
ab_page,-0.0150,0.011,-1.311,0.190,-0.037,0.007


e. What is the p-value associated with **ab_page**? Why does it differ from the value you found in **Part II**?<br><br>  

**The p-value is 0.190, which is above a reasonable threshold such as 0.05 and therefore indicates that we should reject the null hypothesis.**

**This p-value is different than our analysis above. In Part II, our null hypothesis was** $$ H_{0}: p_{new} - p_{old} \leq 0 $$ **For the logistic regression, our null is instead that the old webpage has the same probability of converting as the new webpage.**

f. Now, you are considering other things that might influence whether or not an individual converts.  Discuss why it is a good idea to consider other factors to add into your regression model.  Are there any disadvantages to adding additional terms into your regression model?

**It would be advantageous to consider other factors in our regression model so that we can be sure our conclusion is accurate. Factors such as change aversion or novelty effect can influence the actions of existing users. Additionally, other unrelated factors could be unexpectedly effecting our results, such as the gender of the user.**

**However, there can also be disadvantages to adding additional terms to the model. If any of the factors are correlated to each other, the results will not be reliable. Additionally, adding more terms will increase the chance of introducing outliers or multi-collinearity, which can again lead to loss of reliability of the model.**

g. Now along with testing if the conversion rate changes for different pages, also add an effect based on which country a user lives. You will need to read in the **countries.csv** dataset and merge together your datasets on the approporiate rows.  [Here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html) are the docs for joining tables. 

Does it appear that country had an impact on conversion?  Don't forget to create dummy variables for these country columns - **Hint: You will need two columns for the three dummy variables.** Provide the statistical output as well as a written response to answer this question.

In [47]:
countries_df = pd.read_csv('./countries.csv')
df_new = countries_df.set_index('user_id').join(df2.set_index('user_id'), how='inner')

In [48]:
### Create the necessary dummy variables
df_new[['CA', 'UK', 'US']] = pd.get_dummies(df_new['country'])

In [50]:
df_new['CA_ab_page'] = df_new['CA'] * df_new['ab_page']
df_new['UK_ab_page'] = df_new['UK'] * df_new['ab_page']


h. Though you have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if there significant effects on conversion.  Create the necessary additional columns, and fit the new model.  


In [51]:
### Fit Your Linear Model And Obtain the Results
df_new['intercept'] = 1

# Baseline country is US
logit_mod = sm.Logit(df_new['converted'], df_new[['intercept', 'ab_page', 'CA', 'UK', 'CA_ab_page', 'UK_ab_page']])
results_new = logit_mod.fit()
results_new.summary()

Optimization terminated successfully.
         Current function value: 0.366109
         Iterations 6


0,1,2,3
Dep. Variable:,converted,No. Observations:,290584.0
Model:,Logit,Df Residuals:,290578.0
Method:,MLE,Df Model:,5.0
Date:,"Tue, 16 Nov 2021",Pseudo R-squ.:,3.482e-05
Time:,12:44:31,Log-Likelihood:,-106390.0
converged:,True,LL-Null:,-106390.0
Covariance Type:,nonrobust,LLR p-value:,0.192

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-1.9865,0.010,-206.344,0.000,-2.005,-1.968
ab_page,-0.0206,0.014,-1.505,0.132,-0.047,0.006
CA,-0.0175,0.038,-0.465,0.642,-0.091,0.056
UK,-0.0057,0.019,-0.306,0.760,-0.043,0.031
CA_ab_page,-0.0469,0.054,-0.872,0.383,-0.152,0.059
UK_ab_page,0.0314,0.027,1.181,0.238,-0.021,0.084


In [59]:
# How much less likely to convert if user is shown the new page?
np.reciprocal(np.exp(-0.0206))

1.020813644503746

In [52]:
# How much less likely to convert if user is in Canada as opposed to US?
np.reciprocal(np.exp(-0.0175))

1.0176540221507617

In [54]:
# How much less likely to convert if user is in the UK as opposed to US?
np.reciprocal(np.exp(-0.0057))

1.0057162759095335

In [55]:
# How much less likely to convert if user is in Canada as opposed to US and is shown the new page?
np.reciprocal(np.exp(-0.0469))

1.048017202119183

In [57]:
# How much more likely to convert if user is in the UK as opposed to US and is shown the new page?
np.exp(0.0314)

1.0318981806179213

<a id='conclusions'></a>
## Conclusions



**The company should not transition to the new web page. We fail to reject the null hypothesis. There is no significant difference in conversion rates based on which page is shown, holding all variables equal. There is also no significant difference in conversion rate based on user country. Because of this, I do not advise the company to invest the resources needed to convert the website to the new page.**

First, our A/B test resulted in a p-value of 0.9039. 

Then, we drew the same conclusion using a logistic regression model by observing a p-value of 0.9051 and z-score of 1.31. Both support the conclusion to reject the null by indicating a high likelihood of observing the statistic under the null hypotehsis.

Lastly, we performed another logistic regression including the country of the user. Overall, the country of the user did not have a significant impact on conversion rates.

In conclusion, all of the evidence suggests that the new webpage did not lead to higher user conversion as compared to the old webpage. 

### Resources


https://stackoverflow.com/questions/17383094/how-can-i-map-true-false-to-1-0-in-a-pandas-dataframe

https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_hypothesistest-means-proportions/bs704_hypothesistest-means-proportions3.html