

## Project description
    
This project have mastered the subjects covered in the statistics lessons.The hope is to have this project be as comprehensive of these topics as possible.
    
## Table of Contents
- [Part I - Probability](#probability)
- [Part II - A/B Test](#ab_test)
- [Part III - Regression](#regression)
    
## Project purpose
   
We will be working to understand the results of an A/B test run by an e-commerce website.  Our goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.
 

In [None]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
#We are setting the seed to assure you get the same answers on quizzes as we set up
random.seed(42)

## Part I- Probability


    
#### 1.A. Read in the dataset and take a look at the top few rows here:

In [None]:
user=pd.read_csv('ab_data.csv')
display(user.head())


    
#### 1.B. Use the below cell to find the number of rows in the dataset.

In [None]:
print('number of rows:',user.shape[0])


    
#### 1.C. The number of unique users in the dataset.

In [None]:
print('number of unique users:',user['user_id'].nunique())


    
#### 1.D. The proportion of users converted.

In [None]:
proportion = (user.query('converted ==1')['user_id'].nunique())/(user['user_id'].nunique())
print(proportion)


    
#### 1.E. The number of times the new_page and treatment don't line up.

In [None]:
mismatch= user.query('(group== "treatment") != (landing_page== "new_page")')
print('number of times the new_page and treatment do not match:',mismatch.shape[0])


    
#### 1.F. Do any of the rows have missing values?

In [None]:
display(user.isnull().sum())


#### 2. Now use the answer to the quiz to create a new dataset that meets the specifications from the quiz. Store your new dataframe in user_2.

In [None]:
user_2= user.query('((group=="control") & (landing_page=="old_page")) | \
                   (group=="treatment") & (landing_page=="new_page") ')
print(user_2.shape[0])


    
#### 3.a. How many unique user_ids are in user_2?

In [None]:
print('number of unique users:',user_2['user_id'].nunique())



#### 3.b. There is one user_id repeated in user_2. What is it?

In [None]:
user_2['is_duplicated'] = user_2.duplicated(['user_id'])
user_2['is_duplicated'].value_counts()



#### 3.c. What is the row information for the repeat user_id?

In [None]:
user_2_dup = user_2.loc[user_2['is_duplicated'] == True]
display(user_2_dup)



#### 3.d. Remove one of the rows with a duplicate user_id, but keep your dataframe as user_2.

In [None]:
user_2.drop_duplicates("user_id", inplace=True)
user_2.head()



#### 4.a. What is the probability of an individual converting regardless of the page they receive?

In [None]:
# since values are 1 and 0, we can calculate mean to get probability of an individual converting 
individual_probabilty= user_2['converted'].mean()
print('individual_probabilty:',individual_probabilty)


#### 4.b. Given that an individual was in the control group, what is the probability they converted?
#### 4.c. Given that an individual was in the treatment group, what is the probability they converted?

In [None]:
user_2_grp = user_2.groupby('group')
display(user_2_grp.describe())

1. Given that an individual was in the control group, the probability they converted is 0.120399
1. Given that an individual was in the treatment group, the probability they converted is 0.118920



#### 4.d. What is the probability that an individual received the new page?

In [None]:
print((user_2['landing_page'].value_counts())/(user_2.shape[0]))



#### 4.e. Consider your results from a. through d. above, and explain below whether you think there is sufficient evidence to say that the new treatment page leads to more conversions.

##### No, the treatment group has a less probability than the control group. Therefore, there is no evidence to conclude that the new treatment page leads to more conversions.



### Part II - A/B Test



#### 1. For now, consider we need to make the decision just based on all the data provided. If we want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should our null and alternative hypotheses be? We can state your hypothesis in terms of words or in terms of  𝑝𝑜𝑙𝑑  and  𝑝𝑛𝑒𝑤 , which are the converted rates for the old and new pages.

-  **Hypothesis**

  $$H_0: p_{new} \leq p_{old}$$

  $$H_1: p_{new} > p_{old}$$



#### 2. Assume under the null hypothesis,  𝑝𝑛𝑒𝑤  and  𝑝𝑜𝑙𝑑  both have "true" success rates equal to the converted success rate regardless of page - that is  𝑝𝑛𝑒𝑤  and  𝑝𝑜𝑙𝑑  are equal. Furthermore, assume they are equal to the converted rate in ab_data.csv regardless of the page.


#### Use a sample size for each page equal to the ones in ab_data.csv.


#### Perform the sampling distribution for the difference in converted between the two pages over 10,000 iterations of calculating an estimate from the null.



#### 2. a.  What is the **conversion rate** for $p_{new}$ under the null? 

In [None]:
ab_df=pd.read_csv('ab_data.csv')
display(ab_df.head())

p_new = ab_df['converted'].mean()
print(p_new)



#### 2.b. What is the **conversion rate** for $p_{old}$ under the null? 

In [None]:
p_old = ab_df['converted'].mean()
print(p_old)



#### 2.c. What is $n_{new}$, the number of individuals in the treatment group?

In [None]:
n_new = len(ab_df.query("group == 'treatment'"))
print(n_new)


#### 2.d. What is $n_{old}$, the number of individuals in the control group?

In [None]:
n_old = len(ab_df.query("group == 'control'"))
print(n_old)



#### 2.e. Simulate $n_{new}$ transactions with a conversion rate of $p_{new}$ under the null.  Store these $n_{new}$ 1's and 0's in **new_page_converted**.

In [None]:
ab_df['new_page_converted'] = ab_df.query('landing_page == "new_page"').converted


#### 2.f. Simulate $n_{old}$ transactions with a conversion rate of $p_{old}$ under the null.  Store these $n_{old}$ 1's and 0's in **old_page_converted**.

In [None]:
ab_df['old_page_converted'] = ab_df.query('landing_page == "old_page"').converted

In [None]:
display(ab_df.head())



#### 2.g. Find $p_{new}$ - $p_{old}$ for your simulated values from part (e) and (f).

In [None]:
diff_new = ab_df['new_page_converted'].mean() - ab_df['old_page_converted'].mean()
display(diff_new)



#### 2.h. Simulate 10,000 $p_{new}$ - $p_{old}$ values using this same process similarly to the one you calculated in parts **a. through g.** above.  Store all 10,000 values in a numpy array called **p_diffs**.

In [None]:
p_diffs = []
new_page_converted = np.random.binomial(n_new, p_new, 10000)/n_new
old_page_converted = np.random.binomial(n_old, p_old, 10000)/n_old
p_diffs = new_page_converted  - old_page_converted

In [None]:
p_diffs = np.array(p_diffs)



#### 2.i. Plot a histogram of the p_diffs. Does this plot look like what you expected? Use the matching problem in the classroom to assure you fully understand what was computed here.

In [None]:
plt.hist(p_diffs)

This graph follows the normal distribution. It is because of the central limit theorem



#### 2.j. What proportion of the p_diffs are greater than the actual difference observed in ab_data.csv?

In [None]:
(p_diffs > diff_new).mean()

In [None]:
null_mean = 0
null_vals = np.random.normal(null_mean, p_diffs.std(), 10000)
plt.hist(null_vals);

plt.axvline(x=diff_new, color = 'red');

In [None]:
p_val = (null_vals > diff_new).mean()
p_val



#### 2.k. In words, explain what you just computed in part j.. What is this value called in scientific studies? What does this value mean in terms of whether or not there is a difference between the new and old pages?

1. The above right line is where our observed statistics fall, the value I just computed in part j is the p-value.
1. This p-value is greater than 0.05 so that we cannot reject the null hypothesis. We can conclude there is not differene between the new and old pages



#### 2.l. We could also use a built-in to achieve similar results. Though using the built-in might be easier to code, the above portions are a walkthrough of the ideas that are critical to correctly thinking about statistical significance. Fill in the below to calculate the number of conversions for each page, as well as the number of individuals who received each page. Let n_old and n_new refer the the number of rows associated with the old page and new pages, respectively.

In [None]:
import statsmodels.api as sm

convert_old = ab_df.query('landing_page == "old_page"').converted.sum()
convert_new = ab_df.query('landing_page == "new_page"').converted.sum()
n_old = ab_df.query('landing_page == "old_page"').user_id.count()
n_new = ab_df.query('landing_page == "new_page"').user_id.count()
print(convert_old)
print(convert_new)
print(n_old)
print(n_new)

In [None]:
z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative='smaller')
print(z_score, p_value)



#### 2.n. What do the z-score and p-value you computed in the previous question mean for the conversion rates of the old and new pages? Do they agree with the findings in parts j. and k.?

1. The z-score and p-value communicate the same message as part j and k, our p-value is very large which suggest our statistic is likely to come from the null hypothesis.
1. Hence, we fail to reject the null hypothesis and conclude that new page is not better than old page.



### Part III - A regression approach




#### 1.a In this final part, we will see that the result we achieved in the A/B test in Part II above can also be achieved by performing regression.Since each row is either a conversion or no conversion, what type of regression should you be performing in this case?

**Logistic regression**



#### 1.b. The goal is to use statsmodels to fit the regression model we specified in part a. to see if there is a significant difference in conversion based on which page a customer receives. However, we first need to create a column for the intercept, and create a dummy variable column for which page each user received. Add an intercept column, as well as an ab_page column, which is 1 when an individual receives the treatment and 0 if control.

In [None]:
display(ab_df.head())

In [None]:
ab_df['intercept'] = 1
ab_df[['ab_page', 'ab_page_temp']] = pd.get_dummies(ab_df.landing_page)
ab_df.head()


In [None]:
ab_df.drop('ab_page_temp', axis=1, inplace=True)
ab_df.head()



#### 1.c. Use statsmodels to instantiate our regression model on the two columns we created in part b., then fit the model using the two columns we created in part b. to predict whether or not an individual converts.

In [None]:
import statsmodels.api as sm
logitmod = sm.Logit(ab_df['converted'], ab_df[['intercept', 'ab_page']])



#### 1.d. Provide the summary of our model below, and use it as necessary to answer the following questions.

In [None]:
results = logitmod.fit()
results.summary()



#### 1.e. What is the p-value associated with **ab_page**? Why does it differ from the value we found in **Part II**?<br><br>  **Hint**: What are the null and alternative hypotheses associated with our regression model, and how do they compare to the null and alternative hypotheses in the **Part II**?

- **Hypothesis**

  $$H_0: p_{new} - p_{old} = 0$$

  $$H_1: p_{new} - p_{old} \neq 0$$

The p-value associated with ab_page is 0.171. This is because the approach of calculating the p-value is different for each case. For the first case we calculate the probability receiving a observed statistic if the null hypothesis is true. Therefore this is a one-sided test. However, the ab_page p-value is the result of a two sided test, because the null hypothesis for this case is, here we are asking whether there is a difference in conversion rate between new page and old page. 

Based on that p_value we can say, that the conversion is not significant dependent on the page.



#### 1.f. Now, we are considering other things that might influence whether or not an individual converts. Discuss why it is a good idea to consider other factors to add into our regression model. Are there any disadvantages to adding additional terms into our regression model?

1. It is a good idea to consider other factors to add into our regression model ,for example the day of the week or the gender/income infrastructure (if this data would be available)which could extract from the time stamp. This could lead to more precise results and a higher accuracy. 

2. The disadvantages to adding additional terms into the regression model is that even with additional factors we can never account for all influencing factors or accomodate them. 

3. Multicolinearity on the other hand is more troublesome to detect because it emerges when three or more variables, which are highly correlated, are included within a model.



#### 1.g. Now along with testing if the conversion rate changes for different pages, also add an effect based on which country a user lives in. You will need to read in the countries.csv dataset and merge together your datasets on the appropriate rows. Here are the docs for joining tables.

#### Does it appear that country had an impact on conversion? Don't forget to create dummy variables for these country columns - Hint: You will need two columns for the three dummy variables. Provide the statistical output as well as a written response to answer this question.

In [None]:
ab_df_countries = pd.read_csv("countries.csv")
display(ab_df_countries.head())


In [None]:
#merge the dataframes together
ab_df_log_country = ab_df_countries.merge(ab_df, on="user_id", how = "left")
display(ab_df_log_country.head())

In [None]:
display(ab_df_log_country['country'].value_counts())

In [None]:
### Create the necessary dummy variables
ab_df_log_country[['CA', 'UK', 'US']] = pd.get_dummies(ab_df_log_country['country'])
display(ab_df_log_country.head(5))

In [None]:
ab_df_log_country['intercept'] = 1

logitmod = sm.Logit(ab_df_log_country['converted'], ab_df_log_country[['intercept','ab_page', 'UK', 'US']])
results = logitmod.fit()
results.summary()

**We test for conversion of country and page above.The P-value in "US" and "UK" are 0.181 and 0.111 both are larger than 0.005,so fail to reject null hypthoese.In other word,the countries haven't effect of conversion rate.**



#### 1.h. Though we have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if there significant effects on conversion. Create the necessary additional columns, and fit the new model.

#### Provide the summary results, and our conclusions based on the results.

In [None]:
#Create a new interaction variable between new page and country US and UK
ab_df_log_country['UK_new_page'] = ab_df_log_country['ab_page'] * ab_df_log_country['UK']
ab_df_log_country['US_new_page'] = ab_df_log_country['ab_page'] * ab_df_log_country['US']

In [None]:
lm3 = sm.Logit(ab_df_log_country['converted'], ab_df_log_country[['intercept', 'ab_page', 'UK' , 'US', 'UK_new_page', 'US_new_page']])
results = lm3.fit()
results.summary()

In [None]:
#exponentiated the CV to inteprete the result
np.exp(results.params)

#### Interpretations:

1. From the above Logit Regression Results, we test for interactions of page and countries and we can see that the only intercept's p-value is less than 0.05, which is statistically significant enough for converted rate but other variables are not statistically significant.
1. The country a user lives is not statistically significant on the converted rate considering the page the user land in.
1. The user getting Converted is 1.08 times more likely to happen for UK and new page users than CA and new page users while holding all other varible constant.
1. The user getting Converted is 1.04 times more likely to happen for US and new page users than CA and new page users while holding all other varible constant.


### Overall Conclusions and recommendation:

1. The performance of the old pages looks better as computed by different techniques.
1. So new pages couldn't bring more convesion rate and should keep the old pages.