You have data collected about an experimental kidney treatmeant, and you want to decide which treatment is more effective: A or B.

In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

%matplotlib inline

data_url = "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/kidney_stone_data.csv"
kidney_df = pd.read_csv(data_url)
kidney_df.head()

Unnamed: 0,treatment,stone_size,success
0,B,large,1
1,A,large,1
2,A,large,0
3,A,large,1
4,A,large,1


<IPython.core.display.Javascript object>

Use [`pandas.crosstab`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) to analyze the `'treatment'` and the `'success'` columns together.  Use the `normalize` argument to show percentages per row.

What do you conclude?

In [3]:
pd.crosstab(kidney_df["treatment"], kidney_df["success"], normalize="index")

success,0,1
treatment,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0.22,0.78
B,0.174286,0.825714


<IPython.core.display.Javascript object>

Based on the percentages here we might conclude that treatment B is more effective.  For those that were subjected to treatment B, there was an ~83% success rate; for treatment A, there was a 78% success rate.

----

We could more formally analyze these numbers with a $\chi^2$ ("chi square") test of independence.  See more on what this procedure is doing in this video from [Khan Academy](https://www.khanacademy.org/math/ap-statistics/chi-square-tests/chi-square-tests-two-way-tables/v/chi-square-test-association-independence).

In [4]:
crosstab = pd.crosstab(kidney_df["treatment"], kidney_df["success"])
crosstab

success,0,1
treatment,Unnamed: 1_level_1,Unnamed: 2_level_1
A,77,273
B,61,289


<IPython.core.display.Javascript object>

In [5]:
chi2, p, df, expected = stats.chi2_contingency(crosstab)

p < 0.05

False

<IPython.core.display.Javascript object>

This test is showing that there doesn't appear to be any significant relationship between treatment and success.

----

Now, include the `'stone_size'` column in your crosstab analysis.

What do you conlude?

In [6]:
pd.crosstab(
    [kidney_df["stone_size"], kidney_df["treatment"]],
    kidney_df["success"],
    normalize="index",
)

Unnamed: 0_level_0,success,0,1
stone_size,treatment,Unnamed: 2_level_1,Unnamed: 3_level_1
large,A,0.269962,0.730038
large,B,0.3125,0.6875
small,A,0.068966,0.931034
small,B,0.133333,0.866667


<IPython.core.display.Javascript object>

The small effect seen in the success rates has reversed! For all stone sizes, treatment A has a higher success rate than treatment B. This is an example of Simpson's paradox:

> Simpson's paradox (or Simpson's reversal, Yule–Simpson effect, amalgamation paradox, or reversal paradox) is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined.

from [Wikipedia](https://en.wikipedia.org/wiki/Simpson%27s_paradox)

----

If we were to run a $\chi^2$ test of independence:

In [7]:
crosstab = pd.crosstab(
    [kidney_df["stone_size"], kidney_df["treatment"]], kidney_df["success"]
)

crosstab

Unnamed: 0_level_0,success,0,1
stone_size,treatment,Unnamed: 2_level_1,Unnamed: 3_level_1
large,A,71,192
large,B,25,55
small,A,6,81
small,B,36,234


<IPython.core.display.Javascript object>

In [8]:
chi2, p, df, expected = stats.chi2_contingency(crosstab)

p < 0.05

True

<IPython.core.display.Javascript object>

This test is showing that there appears to be a significant relationship between treatment, stone size, and success.

----

##### What is `aggfunc` in `pd.crosstab()`?

By default, pd.crosstab is going to count the cooccurrences of the values we pass in as the first 2 arguments.  If instead we want a calculation based on another column of values we might use the `aggfunc` argument.  In the example below, we take the mean of success for each combination of treatment of stone_size.  You can confirm this matches up to the percentages shown in the 2nd normalized crosstab above.

In [9]:
crosstab = pd.crosstab(
    kidney_df["treatment"],
    kidney_df["stone_size"],
    values=kidney_df["success"],
    aggfunc="mean",
)
crosstab

stone_size,large,small
treatment,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0.730038,0.931034
B,0.6875,0.866667


<IPython.core.display.Javascript object>

An alternative method to arrive at these numbers might be to use `groupby`:

In [10]:
kidney_df.groupby(["treatment", "stone_size"]).agg("mean")

Unnamed: 0_level_0,Unnamed: 1_level_0,success
treatment,stone_size,Unnamed: 2_level_1
A,large,0.730038
A,small,0.931034
B,large,0.6875
B,small,0.866667


<IPython.core.display.Javascript object>