## Answering Questions with Messy Data

### Basic Statistical Testing

Topics:

* Hypothesis testing
* Statistical significance
* `scipy` module for student t-tests

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

**Hypothesis Testing**:

We have two statements of interest:

1. Our actual explanation: *alternative* hypothesis
2. Our explanation is not sufficient: *null* hypothesis

The testing method is to determine whether **the null hypothesis is true or not**: If we find there's a difference between the groups, then we <u>reject</u> the null and <u>accept</u> our alternative.

In [2]:
# Let's get the grades dataset
df = pd.read_csv('../resources/week-4/datasets/grades.csv')
df.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000


In [3]:
# Let's take a look at the shape
df.shape

(2315, 13)

Let's study the difference in grades for students in two segments: those who finished the first assignment by the end of December 2015 (`early_finishers`) and those who finished it sometime after that (`late_finishers`).

In [5]:
# Get the early finishers
early_finishers = df[pd.to_datetime(df['assignment1_submission']) < '2016']
early_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000
5,D09000A0-827B-C0FF-3433-BF8FF286E15B,71.647278,2015-12-28 04:35:32.836000000,64.05255,2016-01-03 21:05:38.392000000,64.75255,2016-01-07 08:55:43.692000000,57.467295,2016-01-11 00:45:28.706000000,57.467295,2016-01-11 00:54:13.579000000,57.467295,2016-01-20 19:54:46.166000000
8,C9D51293-BD58-F113-4167-A7C0BAFCB6E5,66.595568,2015-12-25 02:29:28.415000000,52.916454,2015-12-31 01:42:30.046000000,48.344809,2016-01-05 23:34:02.180000000,47.444809,2016-01-02 07:48:42.517000000,37.955847,2016-01-03 21:27:04.266000000,37.955847,2016-01-19 15:24:31.060000000


Use the tilde (`~`) to find those students who are in the original dataframe (`df`) and are not in the `early_finishers`.

Other options to get the `late_finishers` data:

* Copy and paste the first projection for `early_finishers` and change the sign to `>= 2016` (You would need to change the date in two places)
* Do a `left` join of the dataframe `df` with `early finishers`. The left join only keeps items in the left dataframe.
* Write a function to determine if someone is early or late, and then call `apply()` on the dataframe, add a new column and filter.

In [6]:
# Get the late finishers
late_finishers = df[~df.index.isin(early_finishers.index)] 
late_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


Now, let's compare the means for our two populations:

In [7]:
print(early_finishers['assignment1_grade'].mean())
print(late_finishers['assignment1_grade'].mean())

74.94728457024304
74.0450648477065


They look very similar, but are they *different*? $\rightarrow$ This is where the **student t-test** comes in: it allows us to form the alternative hypothesis (they are *different*) and a null hypothesis (they are *the same*) and then test that null hypothesis.

**Significance level** ($\alpha$): A threshold for how much of a chance we are willing to accept. The usual threshold value is 0.05.

**`scipy`**: Contains a number of different statistical tests and forms a basis for hypothesis testing in Python. For this case, we use `ttest_ind()` function, which does an independent t-test (populations are not related to one another). The results are the t-statistic and the *p-value*. The *p-value* is the probability (between 0 and 1) that our null hypothesis is True.

In [8]:
from scipy.stats import ttest_ind

In [9]:
# Use our two populations:
ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])

Ttest_indResult(statistic=1.3223540853721598, pvalue=0.18618101101713855)

Since the *p-value* is 0.18, which is above our $\alpha$ value, we cannot reject the null hypothesis (the populations are the same). We don't have enough certainty in our evidence (p-value > $\alpha$) to come to the conclusion that they are different. However, this doesn't mean that we've proven that the populations are the same.

In [10]:
# Let's check the same for the other assignments
print(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade']))
print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade']))
print(ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade']))
print(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade']))
print(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))

Ttest_indResult(statistic=1.2514717608216366, pvalue=0.2108889627004424)
Ttest_indResult(statistic=1.6133726558705392, pvalue=0.10679998102227865)
Ttest_indResult(statistic=0.049671157386456125, pvalue=0.960388729789337)
Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492)
Ttest_indResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656)


These results show that we don't have enough evidence to suggest the populations differ with respect the grades.

However, if we pay attention to the p-values, one of the assignments (assignment 3) has a p-value around 0.1. This means that if we accepted a level of chance around 11%, this would have been considered statistically significant.

As a researcher, this result suggests that there's something worth considering: did we use a small number of participants (we didn't) or is there something unique about this assignment related to our experiment (whatever it was).

**Confidence intervals and Bayesian analysis**:

p-values are being under the fire recently for being insufficient for telling enough about the interactions that are happening. That is why two other techniques are being used more regularly: confidence intervals and Bayesian analysis.

Let's try a simulation of this:

In [13]:
# Create simulated data (100 columns, each with 100 numbers)
df1 = pd.DataFrame([np.random.random(100) for x in range(100)])
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.163571,0.365726,0.411266,0.123422,0.55611,0.784355,0.54564,0.686337,0.733955,0.968123,...,0.447513,0.204671,0.735513,0.831833,0.579715,0.856231,0.231974,0.555843,0.979681,0.81594
1,0.835632,0.482475,0.703072,0.536885,0.488249,0.723917,0.626563,0.051416,0.235521,0.841789,...,0.334187,0.902896,0.884229,0.784013,0.250457,0.256754,0.869744,0.545257,0.696169,0.066163
2,0.092397,0.217959,0.38825,0.060404,0.287974,0.302382,0.385264,0.205651,0.806872,0.622533,...,0.950586,0.698979,0.09779,0.583563,0.681155,0.551906,0.886455,0.200823,0.033642,0.173773
3,0.773214,0.489361,0.129913,0.340423,0.419901,0.259364,0.854313,0.457287,0.853844,0.047256,...,0.917362,0.540446,0.502658,0.827349,0.382888,0.998082,0.065695,0.920507,0.428359,0.478957
4,0.276483,0.785509,0.974894,0.132643,0.750661,0.653784,0.960673,0.687084,0.884599,0.769525,...,0.925676,0.076365,0.95544,0.117876,0.1103,0.757564,0.205094,0.954541,0.058184,0.943536


In [14]:
df2 = pd.DataFrame([np.random.random(100) for x in range(100)])
df2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.075358,0.638018,0.209206,0.680245,0.398664,0.556085,0.471237,0.092393,0.707842,0.796096,...,0.585898,0.558261,0.508041,0.282257,0.223826,0.74934,0.700302,0.832724,0.561529,0.756064
1,0.984683,0.985797,0.060895,0.271848,0.057927,0.574468,0.823762,0.058096,0.955872,0.379235,...,0.261881,0.097812,0.850568,0.697333,0.574599,0.485456,0.127233,0.778078,0.383247,0.073238
2,0.03611,0.569395,0.440126,0.313319,0.479847,0.509938,0.355181,0.475811,0.483853,0.150733,...,0.037448,0.856385,0.422207,0.055823,0.618974,0.535741,0.146912,0.870322,0.647512,0.588142
3,0.227503,0.001536,0.995498,0.186349,0.260511,0.350396,0.798061,0.720863,0.584783,0.046794,...,0.523995,0.502905,0.30005,0.980419,0.449239,0.128309,0.20865,0.557964,0.414863,0.224747
4,0.923375,0.685151,0.630665,0.618244,0.827742,0.397917,0.544592,0.508272,0.190995,0.949335,...,0.92222,0.743756,0.203772,0.172519,0.958681,0.911481,0.790906,0.674732,0.143005,0.21201


Are these two dataframes the same?

For a given row inside of `df1`, is it the same as the row inside `df2`? -> Our critical value is 0.1 now ($\alpha=10\%$). If the difference for each row is higher than this value, we'll report this.

In [15]:
# Let's create a custom function for this
def test_columns(alpha=0.1):

    # how many differ
    num_diff = 0

    # iterate over the columns
    for col in df1.columns:
        teststat, pval = ttest_ind(df1[col], df2[col])

        if pval <= alpha:
            print(f'Column {col} is statistically significantly different at alpha = {alpha}, pval = {pval}')
            num_diff += 1

    print(f'Total number of different was {num_diff}, which is the {float(num_diff)/len(df1.columns)*100}%')

In [16]:
test_columns()

Column 2 is statistically significantly different at alpha = 0.1, pval = 0.02703307201080291
Column 18 is statistically significantly different at alpha = 0.1, pval = 0.05437518179739224
Column 23 is statistically significantly different at alpha = 0.1, pval = 0.08821273092482569
Column 29 is statistically significantly different at alpha = 0.1, pval = 0.08088652973798749
Column 30 is statistically significantly different at alpha = 0.1, pval = 0.0993229155258371
Column 40 is statistically significantly different at alpha = 0.1, pval = 0.05997769818346789
Column 52 is statistically significantly different at alpha = 0.1, pval = 0.0854097664243104
Column 59 is statistically significantly different at alpha = 0.1, pval = 0.056464865879298276
Column 64 is statistically significantly different at alpha = 0.1, pval = 0.0926647559806292
Column 72 is statistically significantly different at alpha = 0.1, pval = 0.039818788225454115
Column 86 is statistically significantly different at alpha = 