## Analyze A/B Test Results

You may either submit your notebook through the workspace here, or you may work from your local machine and submit through the next page.  Either way assure that your code passes the project [RUBRIC](https://review.udacity.com/#!/projects/37e27304-ad47-4eb0-a1ab-8c12f60e43d0/rubric).  **Please save regularly.**

This project will assure you have mastered the subjects covered in the statistics lessons.  The hope is to have this project be as comprehensive of these topics as possible.  Good luck!

## Table of Contents
- [Introduction](#intro)
- [Part I - Probability](#probability)
- [Part II - A/B Test](#ab_test)
- [Part III - Regression](#regression)


<a id='intro'></a>
### Introduction

A/B tests are very commonly performed by data analysts and data scientists.  It is important that you get some practice working with the difficulties of these 

For this project, you will be working to understand the results of an A/B test run by an e-commerce website.  Your goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.

**As you work through this notebook, follow along in the classroom and answer the corresponding quiz questions associated with each question.** The labels for each classroom concept are provided for each question.  This will assure you are on the right track as you work through the project, and you can feel more confident in your final submission meeting the criteria.  As a final check, assure you meet all the criteria on the [RUBRIC](https://review.udacity.com/#!/projects/37e27304-ad47-4eb0-a1ab-8c12f60e43d0/rubric).

<a id='probability'></a>
#### Part I - Probability

To get started, let's import our libraries.

In [None]:
import pandas as pd
import numpy as np
import random
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)
%matplotlib inline
#We are setting the seed to assure you get the same answers on quizzes as we set up
random.seed(42)

`1.` Now, read in the `ab_data.csv` data. Store it in `df`.  **Use your dataframe to answer the questions in Quiz 1 of the classroom.**

a. Read in the dataset and take a look at the top few rows here:

In [None]:
# Read ab_data.csv as df and view fist few rows
df = pd.read_csv('ab_data.csv')
df.head()

b. Use the cell below to find the number of rows in the dataset.

In [None]:
# Use shape to get number of rows in df
df.shape[0]

c. The number of unique users in the dataset.

In [None]:
# Get number of unique users in df
df.nunique()[0]

d. The proportion of users converted.

In [None]:
# Calculates proportion of users convertd
converted_prop = df.converted.mean()
converted_prop

e. The number of times the `new_page` and `treatment` don't match.

In [None]:
#
no_match = df.query('group == "treatment" and landing_page != "new_page"').count()[0]
no_match

In [None]:
no_match2 = df.query('group == "control" and landing_page == "new_page"').count()[0]
no_match2

In [None]:
# Total number of times the new_page and treatment dont match
total_no_match = no_match + no_match2
total_no_match

f. Do any of the rows have missing values?

In [None]:
# Finds missing values for each column
df.isnull().sum()

`2.` For the rows where **treatment** does not match with **new_page** or **control** does not match with **old_page**, we cannot be sure if this row truly received the new or old page.  Use **Quiz 2** in the classroom to figure out how we should handle these rows.  

a. Now use the answer to the quiz to create a new dataset that meets the specifications from the quiz.  Store your new dataframe in **df2**.

In [None]:
# Create df2 
df2 = df.query('group == "treatment" and landing_page == "new_page" or group == "control" and landing_page == "old_page"')
df2.head()

In [None]:
# Double Check all of the correct rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

`3.` Use **df2** and the cells below to answer questions for **Quiz3** in the classroom.

a. How many unique **user_id**s are in **df2**?

In [None]:
# Find unique user_id for new dataframe df2
unique_ids = df2.nunique()[0]
unique_ids

b. There is one **user_id** repeated in **df2**.  What is it?

In [None]:
# Check for duplicates in df2
duplicated_user = df2[df2.duplicated('user_id')].count()
duplicated_user

c. What is the row information for the repeat **user_id**? 

In [None]:
# Get row information for duplicate user in df2
duplicated_user = df2[df2.duplicated('user_id')]
duplicated_user

d. Remove **one** of the rows with a duplicate **user_id**, but keep your dataframe as **df2**.

In [None]:
# Remove first duplicate row in df2
df2.drop_duplicates(subset='user_id', keep='first', inplace=True)

In [None]:
# Check for duplicates to confirm it has been removed
sum(df2.duplicated())

`4.` Use **df2** in the cells below to answer the quiz questions related to **Quiz 4** in the classroom.

a. What is the probability of an individual converting regardless of the page they receive?

In [None]:
# Get probability of conversion for all pages
converted_prob = df2.converted.mean()
converted_prob

b. Given that an individual was in the `control` group, what is the probability they converted?

In [None]:
# Get probability of conversion in control group
converted_control = df2.query('group == "control"').converted.mean()
converted_control

c. Given that an individual was in the `treatment` group, what is the probability they converted?

In [None]:
# Get probability of converson in treatment group
converted_treatment = df2.query('group == "treatment"').converted.mean()
converted_treatment

In [None]:
# Calculate observed difference
obs_diff = converted_treatment - converted_control

d. What is the probability that an individual received the new page?

In [None]:
# Get probabilty an individual received new_page
new_page_prob = df2.query('landing_page == "new_page"').count()[0] / df2.shape[0]
new_page_prob

e. Consider your results from parts (a) through (d) above, and explain below whether you think there is sufficient evidence to conclude that the new treatment page leads to more conversions.


<div class="alert alert-block alert-info">
<b>The following were observed from the results:</b>
    
1. The probabilty of conversion for the control group which were shown the old page was 0.1204 
2. While the probability of conversion for the experiment group shown the new page is 0.1188 
3. The overall probability of conversion is approximately 0.1197

There isnt sufficient evidence to conclude that the treatment group shown new pages leads to more conversions. For now it shows the old page still has a slightly higher conversion probability than the new one.</div> 

<a id='ab_test'></a>
### Part II - A/B Test

Notice that because of the time stamp associated with each event, you could technically run a hypothesis test continuously as each observation was observed.  

However, then the hard question is do you stop as soon as one page is considered significantly better than another or does it need to happen consistently for a certain amount of time?  How long do you run to render a decision that neither page is better than another?  

These questions are the difficult parts associated with A/B tests in general.  


`1.` For now, consider you need to make the decision just based on all the data provided.  If you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should your null and alternative hypotheses be?  You can state your hypothesis in terms of words or in terms of **$p_{old}$** and **$p_{new}$**, which are the converted rates for the old and new pages.

**Put your answer here.**
No - P.old =< P.new
N1 - P.new > P.old

- Null Hypotheris: P.new =< P.old
- Alternative Hypothesis: P.new > P.old

- H0: μnew_page - μnold_page ≤ 0
- H1: μnew_page - μold_page > 0

`2.` Assume under the null hypothesis, $p_{new}$ and $p_{old}$ both have "true" success rates equal to the **converted** success rate regardless of page - that is $p_{new}$ and $p_{old}$ are equal. Furthermore, assume they are equal to the **converted** rate in **ab_data.csv** regardless of the page. <br><br>

Use a sample size for each page equal to the ones in **ab_data.csv**.  <br><br>

Perform the sampling distribution for the difference in **converted** between the two pages over 10,000 iterations of calculating an estimate from the null.  <br><br>

Use the cells below to provide the necessary parts of this simulation.  If this doesn't make complete sense right now, don't worry - you are going to work through the problems below to complete this problem.  You can use **Quiz 5** in the classroom to make sure you are on the right track.<br><br>

In [None]:
# Since p.new and p.old are equal
prop = df.converted.mean()
prop

a. What is the **conversion rate** for $p_{new}$ under the null? 

In [None]:
# Get conversion rate under the null
p_new = df2.converted.mean()
p_new

In [None]:
df2.head()

b. What is the **conversion rate** for $p_{old}$ under the null? <br><br>

In [None]:
#  Get conversion rate under the null
p_old = df2.converted.mean()
p_old

c. What is $n_{new}$, the number of individuals in the treatment group?

In [None]:
# check again Get the number of individuals in the treatment group
n_new = df2.query('group == "treatment"').count()[0]
n_new

d. What is $n_{old}$, the number of individuals in the control group?

In [None]:
# Get the number of individuals in the control group
n_old = df2.query('group == "control"').count()[0]
n_old

e. Simulate $n_{new}$ transactions with a conversion rate of $p_{new}$ under the null.  Store these $n_{new}$ 1's and 0's in **new_page_converted**.

In [None]:
# Simulate n.new transactions and get mean
new_page_converted = np.random.binomial(1, p_new, n_new)
npc_mean = new_page_converted.mean()
npc_mean

f. Simulate $n_{old}$ transactions with a conversion rate of $p_{old}$ under the null.  Store these $n_{old}$ 1's and 0's in **old_page_converted**.

In [None]:
# Simulate n.old transactions and get mean
old_page_converted = np.random.binomial(1, p_old, n_old)
opc_mean = old_page_converted.mean()
opc_mean

g. Find $p_{new}$ - $p_{old}$ for your simulated values from part (e) and (f).

In [None]:
# Get simulated difference
s_diff = npc_mean - opc_mean
s_diff

h. Create 10,000 $p_{new}$ - $p_{old}$ values using the same simulation process you used in parts (a) through (g) above. Store all 10,000 values in a NumPy array called **p_diffs**.

In [None]:
# Simulate 10000 p-new and p-old values
p_diffs = []
for _ in range(10000):
    new_page_converted = np.random.binomial(1, p_new, n_new)
    npc_mean = new_page_converted.mean()
    old_page_converted = np.random.binomial(1, p_old, n_old)
    opc_mean = old_page_converted.mean()
    p_diffs.append(npc_mean - opc_mean)

In [None]:
p_diffs = np.array(p_diffs)
p_diffs.mean()

In [None]:
p_diffs.std()

i. Plot a histogram of the **p_diffs**.  Does this plot look like what you expected?  Use the matching problem in the classroom to assure you fully understand what was computed here.

In [None]:
plt.hist(p_diffs);

The histogram is normally distributed. I expected this because with a large enough sample size, the sampling distribution 
of the mean will be normally distributed.

j. What proportion of the **p_diffs** are greater than the actual difference observed in **ab_data.csv**?

In [None]:
obs_diff = converted_treatment - converted_control
obs_diff

In [None]:
# compute the p value
pvalue = (p_diffs > obs_diff).mean()
pvalue

In [None]:
# plotting  line for observed statistic
plt.hist(p_diffs, alpha=.5)
plt.axvline(x=obs_diff, color='red');

k. Please explain using the vocabulary you've learned in this course what you just computed in part **j.**  What is this value called in scientific studies?  What does this value mean in terms of whether or not there is a difference between the new and old pages?

<div class="alert alert-block alert-info">
<b>The following were observed from the results:</b>

1. The value computed in part j is the p value.  P-values are the probability of observing your data or something more extreme in favor of the alternative given the null hypothesis is true.
2. The p-value is 0.9068
3. I also calculated the simulated difference in means which is greater than the observed difference. That is the actual difference from the dataset between the control group conversion rate and the treatment group conversion rate.
4. The p-value which is 0.9068 is greater than the 0.05 type I error threshold. The result is not significant. Therefore we fail to reject the null hypothesis.
5. My conclusion will be "the new page has less or equal conversions than the old page".</div>

l. We could also use a built-in to achieve similar results.  Though using the built-in might be easier to code, the above portions are a walkthrough of the ideas that are critical to correctly thinking about statistical significance. Fill in the below to calculate the number of conversions for each page, as well as the number of individuals who received each page. Let `n_old` and `n_new` refer the the number of rows associated with the old page and new pages, respectively.

#import statsmodels.api as sm

convert_old = 
convert_new = 
n_old = 
n_new = 

m. Now use `stats.proportions_ztest` to compute your test statistic and p-value.  [Here](https://docs.w3cub.com/statsmodels/generated/statsmodels.stats.proportion.proportions_ztest/) is a helpful link on using the built in.

In [None]:
treatment_df = df2.query('group == "treatment"')


In [None]:
control_df = df2.query('group == "control"')


In [None]:
old_conversion = df2.query('group == "control" and converted == 1').shape[0]
new_conversion = df2.query('group == "treatment" and converted == 1').shape[0]
count_old = control_df.shape[0]
count_new = treatment_df.shape[0]
old_conversion, new_conversion, count_old, count_new

In [None]:
old_conversion = df2.query('group == "control" and converted == 1').shape[0]
new_conversion = df2.query('group == "treatment" and converted == 1').shape[0]
n_new
n_old
old_conversion, new_conversion, n_new, n_old

In [None]:
#sm.stats.proportion.proportions_ztest

In [None]:
z_score, p_value = sm.stats.proportions_ztest([old_conversion, new_conversion], [n_old, n_new], alternative='smaller')
z_score, p_value

n. What do the z-score and p-value you computed in the previous question mean for the conversion rates of the old and new pages?  Do they agree with the findings in parts **j.** and **k.**?

<div class="alert alert-block alert-info">
<b>The following were observed from the results:</b>

1. A Z-score is a numerical measurement used in statistics of a value's relationship to the mean (average) of a group of values(in this case page conversions),  measured in terms of standard deviations from the mean https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/z-score/
2. The z-score here is 1.3109, Therefore the z-score is 1.3109 standard deviations above the mean
3. The z-score confirms the p-value is in the distribution
4. The p- value 0.9051 is greater than 0.05 type 1 error. With this we fail to reject the null hypothesis
5. The p_value of the preceeding sample distribution is 0.9068 while the p_value in this section is 0.9051. That is, they are almost the same.  The z-test agrees with the findings in parts j and k.</div>

<a id='regression'></a>
### Part III - A regression approach

`1.` In this final part, you will see that the result you achieved in the A/B test in Part II above can also be achieved by performing regression.<br><br> 

a. Since each row is either a conversion or no conversion, what type of regression should you be performing in this case?

**Put your answer here.**

LOGISTIC REGRESSION

b. The goal is to use **statsmodels** to fit the regression model you specified in part **a.** to see if there is a significant difference in conversion based on which page a customer receives. However, you first need to create in df2 a column for the intercept, and create a dummy variable column for which page each user received.  Add an **intercept** column, as well as an **ab_page** column, which is 1 when an individual receives the **treatment** and 0 if **control**.

In [None]:
# Get first few rows of df2
df2.head()

In [None]:
# Create dummy variables for group
df2[['ab_page', 'ab_page2']] = pd.get_dummies(df2['landing_page'])
df2[['control', 'treatment']] = pd.get_dummies(df2['group'])
df2.head()

In [None]:
# DROP AB_PAGES
df3 = df2.drop(['timestamp', 'group', 'landing_page', 'ab_page2', 'treatment', 'control'], axis=1)
df3.head()

c. Use **statsmodels** to instantiate your regression model on the two columns you created in part b., then fit the model using the two columns you created in part **b.** to predict whether or not an individual converts. 

In [None]:
# Instantiate regression model and fit to predict conversion
df3['intercept'] = 1
log_mod1 = sm.Logit(df3['converted'], df3[['intercept', 'ab_page']])
results1 = log_mod1.fit()


d. Provide the summary of your model below, and use it as necessary to answer the following questions.

In [None]:
# Get summary of regression model
results1.summary()

e. What is the p-value associated with **ab_page**? Why does it differ from the value you found in **Part II**?<br><br>  **Hint**: What are the null and alternative hypotheses associated with your regression model, and how do they compare to the null and alternative hypotheses in **Part II**?

1. The p-value associated with ab_page is 0.190
2. The p_values found in Part II are 0.9068 and 0.9051.
3. The p-values from part II differ from those in the regression model because in part II we did a one tailed test while in the regression model we used a two tailed approach (https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-the-differences-between-one-tailed-and-two-tailed-tests/) 
4. When using a one-tailed test, you are testing for the possibility of the relationship in one direction and completely disregarding the possibility of a relationship in the other direction. In our hypothesis in Part II, we wanted to prove the alternative hypothesis is true. That is the new page had greater conversions than the old page.  Let’s return to our example comparing the mean of a sample to a given value x using a t-test
5. In the regression model, a two-tailed test allots half of your alpha to testing the statistical significance in one direction and half of your alpha to testing statistical significance in the other direction. We are testing for the possibility of the relationship in both directions, i.e both the possibility of the new page having equal or less conversions to the old page and the new page having greater conversions than the old page.


**Put your answer here.**

f. Now, you are considering other things that might influence whether or not an individual converts.  Discuss why it is a good idea to consider other factors to add into your regression model.  Are there any disadvantages to adding additional terms into your regression model?

<div class="alert alert-block alert-info">
<b></b>
The purpose of regression models are: 
    
- planning and control 
- prediction or forecasting. 
The principal adventage of adding other factors to a regression model is that it gives us more of the information available to us to estimate the dependent variable. In reality a variable is usually affected by many othe variables. For example it was mentioned earlier that because of the time stamp associated with each event, you could technically run a hypothesis test continuously as each observation was observed. Looking at the time stamps for each visit and when conversion occurs we can look at for possible correlation.
Looking at only a section of the data without considering the other parts can also lead to skewed or unfair conclusions. This is explained by the Simpson's paradox, which is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined (https://en.wikipedia.org/wiki/Simpson%27s_paradox). For example: When we are able study a variety of variables in the data we can uncover the role timing and country play in the conversion of either page. More people might convert at different times of the day for one page over another.
Simpson’s paradox is important for three critical reasons. First, people often expect statistical relationships to be immutable. They often are not. The relationship between two variables might increase, decrease, or even change direction depending on the set of variables being controlled. Second, Simpson’s paradox is not simply an obscure phenomenon of interest only to a small group of statisticians. Simpson’s paradox is actually one of a large class of association paradoxes. Third, Simpson’s paradox reminds researchers that causal inferences, particularly in nonexperimental studies, can be hazardous. Uncontrolled and even unobserved variables that would eliminate or reverse the association observed between two variables might exist (https://www.britannica.com/topic/Simpsons-paradox)

Likely Disadvantages of adding variables:
1. Adding variables might bring up Multicollinearity which occurs when there are high correlations among predictor variables, leading to unreliable and unstable estimates of regression coefficients(https://statisticalhorizons.com/multicollinearity).
2. Another likely disadvantage of adding variables is Covariance. Covariance is a measure of how changes in one variable are associated with changes in a second variable. Specifically, covariance measures the degree to which two variables are linearly associated(https://medium.com/@thecodingcookie/covariance-correlation-def860c4d4ab). Covariance can make interpreting results difficult and even make R values unreliable.</div>




g. Now along with testing if the conversion rate changes for different pages, also add an effect based on which country a user lives in. You will need to read in the **countries.csv** dataset and merge together your datasets on the appropriate rows.  [Here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html) are the docs for joining tables. 

Does it appear that country had an impact on conversion?  Don't forget to create dummy variables for these country columns - **Hint: You will need two columns for the three dummy variables.** Provide the statistical output as well as a written response to answer this question.

In [None]:
# Read countries dataframe and view first few rows
countries_df = pd.read_csv('countries.csv')
countries_df.head()

In [None]:
# Merge df2 and countries df 
merged_df = df2.merge(countries_df, how='inner')
merged_df.head()

In [None]:
# Create dummies for country column
merged_df[['CA', 'UK', 'US']] = pd.get_dummies(merged_df['country'])

In [None]:
# View first few rows of merged df
merged_df.head(1)

In [None]:
merged_df = merged_df.drop(['group', 'landing_page', 'ab_page2', 'treatment', 'control', 'country', 'CA'], axis=1)


In [None]:
merged_df.head(1)

In [None]:
# Instantiate regression model and fit to predict conversion
merged_df['intercept'] = 1
log_mod2 = sm.Logit(merged_df['converted'], merged_df[['intercept', 'UK', 'US']])
results2 = log_mod2.fit()

In [None]:
# Get summary of results
results2.summary()

h. Though you have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if there significant effects on conversion.  Create the necessary additional columns, and fit the new model.  

Provide the summary results, and your conclusions based on the results.

In [None]:
#merged_df[['new_page', 'old_page']] = pd.get_dummies(merged_df['landing_page'])

In [None]:
merged_df.head()

In [None]:
merged_df['intercept'] = 1
log_mod3 = sm.Logit(merged_df['converted'], merged_df[['intercept', 'ab_page', 'UK', 'US']])
results3 = log_mod3.fit()

In [None]:
# Get results summary
results3.summary()

In [None]:
# Exponentiate in order to interpret them.
np.exp(-0.0149), np.exp(0.0506), np.exp(0.0408)

In [None]:
# Obtain reciprocal for values less than 1
1/np.exp(-0.0149)

In [None]:
# Explain negative, easier to explain as 1/np.exp()


In [None]:
 
merged_df['UK_ab_page'] = merged_df['UK'] * merged_df['ab_page']
merged_df['US_ab_page'] = merged_df['US'] * merged_df['ab_page']
merged_df = merged_df.drop('intercept', axis=1)
merged_df.head()

In [None]:
merged_df.corr(method='spearman')

<a id='conclusions'></a>
## Finishing Up

> Congratulations!  You have reached the end of the A/B Test Results project!  You should be very proud of all you have accomplished!

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric (found on the project submission page at the end of the lesson). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.


## Directions to Submit

> Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [None]:
#from subprocess import call
#call(['python', '-m', 'nbconvert', 'Analyze_ab_test_results_notebook.ipynb'])