# Imports & Initialization

In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
#  data libs
import numpy as np
import pandas as pd

#  stats imports
from scipy import stats
import statsmodels.stats.proportion as proportion
import statsmodels.api as sm

#  plotting imports
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline


def get_95_ci(x1, x2):
    """Calculate a 95% CI for 2 1d numpy arrays"""
    signal = x1.mean() - x2.mean()
    noise = np.sqrt(x1.var() / x1.size + x2.var() / x2.size)

    ci_lo = signal - 1.96 * noise
    ci_hi = signal + 1.96 * noise

    return ci_lo, ci_hi

<IPython.core.display.Javascript object>

## The setup

* You are a data scientist at a luxury jelly e-commerce retailer. In hopes of generating a recurring revenue stream, the retailer has decided to roll out a “Jelly of the Month Club.”
* The retailer is testing whether introducing a testimonial from Clark Griswold on the sidebar of the webpage will increase conversions to the Jelly of the Month Club.
* We will also test to see if, during the experiment, purchase amount increased due to the sidebar testimonial.

<img src='https://i0.wp.com/scng-dash.digitalfirstmedia.com/wp-content/uploads/2019/11/LDN-L-CHEVYCHASE-1124-02.jpg?fit=620%2C9999px&ssl=1' width='20%'>
<center>"I love this luxury jelly" - Clark Griswold... prolly</center>

In [3]:
data_url = "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/a-b-testing-drill-start-06-14-19.xlsx"

<IPython.core.display.Javascript object>

* Read the data into pandas

# Data Exploration

In [4]:
# Note it is an Excel file
df = pd.read_excel(data_url)
df.head()

Unnamed: 0,trans_id,group,cart_amount,convert
0,176890,treatment,49.46,1
1,443058,treatment,49.34,1
2,827633,treatment,44.08,1
3,277331,treatment,47.16,1
4,843324,treatment,44.19,1


<IPython.core.display.Javascript object>

In [5]:
#  find out all the unique values in the group column
df["group"].unique()

array(['treatment', 'control'], dtype=object)

<IPython.core.display.Javascript object>

In [None]:
#  find out how many records are in the group column
df["group"].size

In [None]:
#  find out the total number of each value type in the group column
df["group"].value_counts()

In [None]:
#  find the sum of null records in each field
df.isna().sum()

In [None]:
#  look at the standard descriptive stats for the amount field
df.cart_amount.describe()

In [None]:
#  look at the table grouped by the group field
df.groupby("group").mean()
#  trans id isn't a useful field value for this

In [None]:
sns.violinplot(y="group", x="cart_amount", data=df)
plt.show()

# Test for treatment's effect on `'cart_amount'`

* Separate the 2 groups into 2 separate DataFrames.

In [6]:
#  separate the df into control and treatment groups
control = df[df.group == "control"]
treatment = df[df.group == "treatment"]

<IPython.core.display.Javascript object>

## Assumptions...
1. Continuous?
2. Independent?
3. Random sample?
4. Normal?

### Continuous?

In [None]:
# Continuous? yes
control.cart_amount.hist()

### Independent?

Independent? yes

[  ] dependent if you had the same person looking at both control and treatment website

[X] independent if separate groups are used for each variable test

### Random Sample?

Always assume yes

### Normal?

In [None]:
#  check qqplot for visualization of normality of the data
sm.qqplot(control.cart_amount, line="s")
plt.show()

sm.qqplot(treatment.cart_amount, line="s")
plt.show()
#  both plots show that the distributions are normal

## T-Test

* If we meet the assumptions, perform the analysis with a t-test.
* Does the group have an effect on `'cart_amount'`?
* If the group does have an effect, how big of an effect is it?

In [None]:
t, p = stats.ttest_ind(control.cart_amount, treatment.cart_amount)
print("t = " + str(t))
print("p = " + str(p))

In [None]:
p < 0.05

## Create a bootstrap confidence interval
* Write a `for` loop to go through a process 1000 times
* The process:
    * Create a bootstrap sample of each group
    * Find the difference of means between these 2 bootstrapped samples
    * Store these bootstrapped differences of means to a list
* Create a historgram of the bootstrapped differences of means
* Calculate a 95% confidence interval using `np.percentile()` and the bootstrapped differences of means

How does this compare to the more formal confidence interval formula we used above?

In [None]:
#  function for finding the confidence interval between 2 sample sets
get_95_ci(control.cart_amount, treatment.cart_amount)

The treatment results in an increase in cart_amount from 0.10 to 0.54

In [None]:
# bootstrap confidence interval loop
# empty list to store mean differences
mean_diffs = []
for i in range(1000):
    # bootstrap sample for each group
    control_sample = control.cart_amount.sample(frac=1.0, replace=True)
    treatment_sample = treatment.cart_amount.sample(frac=1.0, replace=True)
    #  take the difference in the means
    mean_diff = control_sample.mean() - treatment_sample.mean()
    # add the mean_diff to the list of differences
    mean_diffs.append(mean_diff)

print(len(mean_diffs))

In [None]:
# calculate the 95% confidence interval of the bootstrapped differences of mean with np.percentile()
ci_lo = np.percentile(mean_diffs, 2.5)
ci_hi = np.percentile(mean_diffs, 97.5)
print(f"Bootstrapped confidence interval: {ci_lo}, {ci_hi}")

In [None]:
# create a histogram of the bootsrapped mean differences
sns.distplot(mean_diffs)
plt.axvline(ci_lo, c="r")
plt.axvline(ci_hi, c="r")
plt.show()

The bootstrapped confidence interval was within a hundreth of the original confidence interval calculated

# Test for treatment's effect on `'convert'`

* [x] Separate the 2 groups into 2 separate DataFrames (already did this in last step)
* [x] Do we meet the assumptions for a t-test?


Now we want to check on the other aspect of our experimental outcome, whether the conversion of more people to the jelly of the month club is worth it
looking at the same assumptions as before:


In [36]:
# Continuous?
df.convert.value_counts()

0    1688
1      92
Name: convert, dtype: int64

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

The values in the converts column are only 1s and 0s so the distribution isn't continuous.

In [None]:
# independent? Yes, same as before
# normal? not normal bc the distribution isn't continuous and is bimodal

Doesn't meet the assumptions for a t-test. Can't look at the difference of means.

Hypotheses:

$H_o$: no difference of proportion/rate

$H_a$: yes difference in proportion/rate


> Because the sample is a proportion, we know more about their distributions than the t-test assumes. Specifically, the distribution of the mean is normal, meaning we could use something called a two sample proportional z-test. We haven't covered this test yet, but you can read about it [here](https://newonlinecourses.science.psu.edu/stat414/node/268/). Find a python implementation for this test and go back and revise our testing. What difference does our new test make?

^From the reading.  We'll use a different test that performs exactly the same, but allows us to test more than 2 groups (not needed here but nicer to know about moving forward).

More on proportions z-test:

* [Nice intro walkthrough](https://www.youtube.com/watch?v=_58qBy9Uxks)
* ["Use and misuse"](https://influentialpoints.com/Training/z-test_for_independent_proportions_use_and_misuse.htm)
* [Python walkthrough (including confidence interval)](http://ethen8181.github.io/machine-learning/ab_tests/frequentist_ab_test.html#Comparing-Two-Proportions)

Another potential test is [`proportions_chisquare()`](https://www.statsmodels.org/stable/generated/statsmodels.stats.proportion.proportions_chisquare.html) from `statsmodels`.  It takes the same arguments.  Below is an excerpt of documentation from `proportions_ztest`.

> In the one and two sample cases with two-sided alternative, this test produces the same p-value as proportions_chisquare, since the chisquare is the distribution of the square of a standard normal distribution.

Translation:

The proportion z-test works for 2 samples, the chi square is an extension of this test to work for >2 samples.  When the number of samples is 2.  These tests work the same as the prop z-test.

So in this case, these 2 functions are the same.

What are the conclusions of the test?

In [37]:
# n_converts_control, n_converts_treament
# n_control, n_treatment
pd.crosstab(df.group, df.convert)

convert,0,1
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,804,42
treatment,884,50


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [39]:
count = [control.convert.sum(), treatment.convert.sum()]
nobs = [control.convert.size, treatment.convert.size]
proportion.proportions_chisquare(count, nobs)

(0.13689415839203997,
 0.7113883723113994,
 (array([[ 42, 804],
         [ 50, 884]], dtype=int64),
  array([[ 43.7258427, 802.2741573],
         [ 48.2741573, 885.7258427]])))

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>