In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

# Kidney Treatment Analysis

You have data collected about an experimental kidney treatmeant, and you want to decide which treatment is more effective: A or B.

## Imports

In [2]:
import pandas as pd
import numpy as np
from scipy import stats
from statsmodels.stats.power import TTestIndPower
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

data_url = "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/kidney_stone_data.csv"

<IPython.core.display.Javascript object>

## Data Inspection

* Read and inspect the data.  Do we have any missing values to deal with?

In [3]:
df = pd.read_csv(data_url)
df.head()

Unnamed: 0,treatment,stone_size,success
0,B,large,1
1,A,large,1
2,A,large,0
3,A,large,1
4,A,large,1


<IPython.core.display.Javascript object>

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   treatment   700 non-null    object
 1   stone_size  700 non-null    object
 2   success     700 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 16.5+ KB


<IPython.core.display.Javascript object>

In [5]:
# number of each treatment type
df.treatment.value_counts()

A    350
B    350
Name: treatment, dtype: int64

<IPython.core.display.Javascript object>

In [6]:
# number of successes
df.success.value_counts(normalize=True)

1    0.802857
0    0.197143
Name: success, dtype: float64

<IPython.core.display.Javascript object>

## More successful treatment?

Which treatment is more successful? How do we go about investigating this?

* Investigate the `pd.crosstab()` function and use it as a way to assess treatment A vs B.
* What do you conclude?

numeric x numeric - scatter

numeric x categorical - boxplot

categorical x categorical - crosstab

In [7]:
# compare the treatments and successes to see if you can get an idea of which treatment was more successful
pd.crosstab(df.treatment, df.success, normalize="index")
# normalize by index to see the successes within each index/row
# treatment B had an 82% success rate while A was 78%

success,0,1
treatment,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0.22,0.78
B,0.174286,0.825714


<IPython.core.display.Javascript object>

We could more formally analyze these numbers with a $\chi^2$ ("chi square") test of independence.  See more on what this procedure is doing in this video from [Khan Academy](https://www.khanacademy.org/math/ap-statistics/chi-square-tests/chi-square-tests-two-way-tables/v/chi-square-test-association-independence).

What do you conclude from this test?

In [8]:
# Input your crosstab here w/o normalizing or row/col totals
# crosstab with raw counts instead of normalized
crosstab = pd.crosstab(df.treatment, df.success)

chi2, p, dof, expected = stats.chi2_contingency(crosstab)
expected
# chi squared contigency test is comparing what we saw vs. what we expected

array([[ 69., 281.],
       [ 69., 281.]])

<IPython.core.display.Javascript object>

In [9]:
p < 0.05

False

<IPython.core.display.Javascript object>

The p value suggests that the difference is not statistically significant

## Include stone_size in crosstab

Now, include the `'stone_size'` column in your crosstab analysis.

What do you conlude?

In [10]:
pd.crosstab([df.treatment, df.stone_size], df.success, normalize="index")

Unnamed: 0_level_0,success,0,1
treatment,stone_size,Unnamed: 2_level_1,Unnamed: 3_level_1
A,large,0.269962,0.730038
A,small,0.068966,0.931034
B,large,0.3125,0.6875
B,small,0.133333,0.866667


<IPython.core.display.Javascript object>

The small effect seen in the success rates has reversed! For all stone sizes, treatment A has a higher success rate than treatment B. This is an example of Simpson's paradox:

> Simpson's paradox (or Simpson's reversal, Yule–Simpson effect, amalgamation paradox, or reversal paradox) is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined.

from [Wikipedia](https://en.wikipedia.org/wiki/Simpson%27s_paradox)

----

If we were to run a $\chi^2$ test of independence:

In [11]:
# Input your crosstab here w/o normalizing or row/col totals
crosstab = pd.crosstab([df.treatment, df.stone_size], df.success)
chi2, p, dof, expected = stats.chi2_contingency(crosstab)
p

6.626702248891721e-07

<IPython.core.display.Javascript object>

In [15]:
crosstab = pd.crosstab([df.treatment, df.stone_size], df.success)
crosstab

Unnamed: 0_level_0,success,0,1
treatment,stone_size,Unnamed: 2_level_1,Unnamed: 3_level_1
A,large,71,192
A,small,6,81
B,large,25,55
B,small,36,234


<IPython.core.display.Javascript object>

In [13]:
expected

array([[ 51.84857143, 211.15142857],
       [ 17.15142857,  69.84857143],
       [ 15.77142857,  64.22857143],
       [ 53.22857143, 216.77142857]])

<IPython.core.display.Javascript object>

## Revisit online jelly sales for effect size and power

Reminder of context:

> You are a data scientist at a luxury jelly e-commerce retailer. In hopes of generating a recurring revenue stream, the retailer has decided to roll out a "Jelly of the Month Club".  Test to see if, during the experiment, purchase amount increased due to the sidebar testimonial.

<img src='https://i0.wp.com/scng-dash.digitalfirstmedia.com/wp-content/uploads/2019/11/LDN-L-CHEVYCHASE-1124-02.jpg?fit=620%2C9999px&ssl=1' width='20%'>
<center>"I love this luxury jelly" - Clark Griswold... prolly</center>

In [16]:
def get_95_ci(x1, x2):
    """Calculate a 95% CI for 2 1d numpy arrays"""
    signal = x1.mean() - x2.mean()
    noise = np.sqrt(x1.var() / x1.size + x2.var() / x2.size)

    ci_lo = signal - 1.96 * noise
    ci_hi = signal + 1.96 * noise

    return ci_lo, ci_hi

<IPython.core.display.Javascript object>

In [17]:
# Below package is needed for pandas to read xl files
# !pip install xlrd
data_url = "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/a-b-testing-drill-start-06-14-19.xlsx"

df = pd.read_excel(data_url)

treatment = df[df["group"] == "treatment"]
control = df[df["group"] == "control"]

# The tetimon
t, p = stats.ttest_ind(treatment["cart_amount"], control["cart_amount"])
p

0.004419762574245763

<IPython.core.display.Javascript object>

In [18]:
get_95_ci(treatment["cart_amount"], control["cart_amount"])

(0.0985761443332894, 0.5415951112541858)

<IPython.core.display.Javascript object>

Based on this analysis, we are seeing a difference in sales based on whether or not customers were shown the sidebar testimonial.  The customers who were shown the testimonial spent about \\$0.10 - \\$0.54 more than customers in the control group.

### Effect size

To calculate a point estimate of the size of the effect, we might calculate Cohen's D, this measure is known as the 'effect size'.  This is similar to what our confidence interval is doing but as a single value rather than a range.  The formula for this calculation is:

$$d = \frac{\overline{x_1} - \overline{x_2}}{s_{pooled}}$$

Where $\overline{x_1}$ and $\overline{x_2}$ are the sample means & $s_{pooled}$ is a 'pooled standard deviation' (i.e. a measure of variability that includes both samples).  We can calculate this pooled standard deviation with:

$$s_{pooled} = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}$$

* Calculate Cohen's D for our `'cart_amount'` example

In [19]:
x1 = treatment["cart_amount"]
x2 = control["cart_amount"]

n1 = x1.size
n2 = x2.size

s1 = x1.std()
s2 = x2.std()

<IPython.core.display.Javascript object>

In [20]:
s_pooled_numerator = ((n1 - 1) * s1 ** 2) + ((n2 - 1) * s2 ** 2)
s_pooled_denominator = n1 + n2 - 2

s_pooled = np.sqrt(s_pooled_numerator / s_pooled_denominator)
s_pooled

2.3661475919303294

<IPython.core.display.Javascript object>

In [22]:
effect_size = (x1.mean() - x2.mean()) / s_pooled
effect_size

0.13527711833588885

<IPython.core.display.Javascript object>

So the effect size would translate to, if they ran the sidebar testimonial then you can expect the cart_amount to increase around $(0.13) per cart.

Note, this isn't dead center of our confidence interval, it's a little more conservative (it's leaning more towards a difference of zero rather than a more extreme difference).  In practice, I think you should lean more towards confidence intervals and keep this calculation in your back pocket if asked for a point estimate.  Business people are typically pretty understanding of a confidence interval style interpretation and it allows us to more explicitly state the uncertainity in the effect size.

### Power

Statistical Power - the probability of rejecting the null hypothesis when it is in fact false

A great resource for going deeper: [A Gentle Introduction to Statistical Power and Power Analysis in Python](https://machinelearningmastery.com/statistical-power-and-power-analysis-in-python/).  A quote from the resource:

> A power analysis can be used to estimate the minimum sample size required for an experiment, given a desired significance level, effect size, and statistical power.


Note, given 3/4 of these metrics (effect size, significance level, power, and sample size) we can work out the 4th.

* What is the power of our test on `'cart_amount'`
  * We'll be using the [`.solve_power()`](https://www.statsmodels.org/stable/generated/statsmodels.stats.power.TTestIndPower.solve_power.html#statsmodels.stats.power.TTestIndPower.solve_power) method of the `TTestIndPower()` object.  View it's documentation for us to figure out how to use it.

In [None]:
analysis = TTestIndPower()
analysis.solve_power(
    ____, ____, ____, ____
)

In [None]:
# Renaming just to reuse without a lot of extra typing
n = x1.size
es = effect_size

# Making up potential sample/effect sizes that relate to what we observe
sample_sizes = np.array([n * 0.1, n * 0.5, n, n * 2])
effect_sizes = np.array([es * 0.1, es * 0.5, es, es * 2])

# Plot the power if we had these sample/effect sizes
analysis.plot_power(
    dep_var="nobs", nobs=sample_sizes, alpha=0.05, effect_size=effect_sizes
)
plt.show()

# Let's p-hack!

First, let's go over the theory behind it

### Sample size and the t statistic

In a t-test, the p value is directly related to t statistic.  As t increases, p decreases.  The definition of t is below.

$$t = \frac{signal}{noise} = \frac{\overline{x}_{1}-\overline{x}_{2}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}$$

The denominator (aka $noise$) is the component that is affected by sample size.  The intuition behind this is that as your sample increases you should be drowning out the 'noisy' observations and the result is less noise overall.

This means as as `n` increases, our denominator decreases.  In fractions/division, when we hold the numerator constant and the denominator gets smaller, the result gets larger (e.g. $\frac{1}{4} = 0.25$ & $\frac{1}{2} = 0.5$).

All of this builds up to... our t statistic will get larger as `n` increases (assuming everything else stays relatively the same).

----

Enough with the theory, prove it.  We have 2 means and standard deviations defined below.

In [None]:
mean_x1 = 11
mean_x2 = 10
std_x1 = 2
std_x2 = 2

* Write a `for` loop that loops over the different values in the `ns` list
* In each iteration, calculate a `t` and `p` value for the given means, standard deviations, and value of `n` (assume both groups had `n` observations).
* Store the p values in a list to print/plot the relationship between p and n

In [None]:
ns = [10, 50, 100, 500, 1000, 5000]
ps = []
for ____:
    signal = ____
    noise = ____
    t = ____

    # Look up p value for given value of t and sample size
    p = stats.t.sf(np.abs(t), 2 * n - 2) * 2
    ____

### Sample size and the confidence interval

The formula we've been using for a 95% confidence interval for a t-test is shown below.  Reason out what will happen to our confidence interval as sample size increases.

$$\overline{X}_{1}-\overline{X}_{2} \pm 1.96 * {\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}$$

Write a for loop similar to the one above but this time with a focus on confidence intervals.  What happens to the confidence interval as n increases?