### Basics of Statistical Testing in Python
**Hypothesis Testing**

When we do hypothesis testing, we actually have two statements of interest: the first is our actual explanation, which we call the alternative hypothesis, and the second is that the explanation we have is not sufficient, and we call this the null hypothesis. Our actual testing method is to determine whether the null hypothesis is true or not. If we find that there is a difference between groups, then we can reject the null hypothesis and we accept our alternative.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

In [2]:
df=pd.read_csv ('grades.csv')
df.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000


In [3]:
print("There are {} rows and {} columns".format(df.shape[0], df.shape[1]))

There are 2315 rows and 13 columns


In [5]:
early_finishers = df[pd.to_datetime(df['assignment1_submission']) < '2016']
early_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000
5,D09000A0-827B-C0FF-3433-BF8FF286E15B,71.647278,2015-12-28 04:35:32.836000000,64.05255,2016-01-03 21:05:38.392000000,64.75255,2016-01-07 08:55:43.692000000,57.467295,2016-01-11 00:45:28.706000000,57.467295,2016-01-11 00:54:13.579000000,57.467295,2016-01-20 19:54:46.166000000
8,C9D51293-BD58-F113-4167-A7C0BAFCB6E5,66.595568,2015-12-25 02:29:28.415000000,52.916454,2015-12-31 01:42:30.046000000,48.344809,2016-01-05 23:34:02.180000000,47.444809,2016-01-02 07:48:42.517000000,37.955847,2016-01-03 21:27:04.266000000,37.955847,2016-01-19 15:24:31.060000000


In [12]:
early_finishers.shape

(1259, 13)

In [13]:
late_finishers = df[~(pd.to_datetime(df['assignment1_submission']) < '2016')]
# late_finishers = df[~df.index.isin(early_finishers.index)]
late_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


In [14]:
late_finishers.shape

(1056, 13)

In [16]:
print(early_finishers['assignment1_grade'].mean())
print(late_finishers['assignment1_grade'].mean())

74.94728457024303
74.0450648477065


Ok, these look pretty similar. But, are they the same? What do we mean by similar? This is where the students' t-test comes in. It allows us to form the alternative hypothesis ("These are different") as well as the null hypothesis ("These are the same") and then test that null hypothesis.  When doing hypothesis testing, we have to choose a significance level as a threshold for how much of a chance we're willing to accept. This significance level is typically called alpha. For this example, let's use a threshold of 0.05 for our alpha or 5%. Now this is a commonly used number but it's really quite arbitrary.  The SciPy library contains a number of different statistical tests and forms a basis for hypothesis testing in Python and we're going to use the ttest_ind() function which does an independent t-test (meaning the populations are not related to one another). The result of ttest_index() are the t-statistic and a p-value. It's this latter value, the probability, which is most important to us, as it indicates the chance (between 0 and 1) of our null hypothesis being True.

In [17]:
# Let's bring in our ttest_ind function
from scipy.stats import ttest_ind

# Let's run this function with our two populations, looking at the assignment 1 grades
ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])

Ttest_indResult(statistic=1.322354085372139, pvalue=0.1861810110171455)

So here we see that the probability is 0.18, and this is above our alpha value of 0.05. This means that we cannot reject the null hypothesis. The null hypothesis was that the two populations are the same, and we don't have enough certainty in our evidence (because it is greater than alpha) to come to a conclusion to the contrary. This doesn't mean that we have proven the populations are the same.

In [18]:
print(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade']))
print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade']))
print(ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade']))
print(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade']))
print(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))

Ttest_indResult(statistic=1.2514717608216366, pvalue=0.2108889627004424)
Ttest_indResult(statistic=1.6133726558705392, pvalue=0.10679998102227865)
Ttest_indResult(statistic=0.049671157386456125, pvalue=0.960388729789337)
Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492)
Ttest_indResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656)


Ok, so it looks like in this data we do not have enough evidence to suggest the populations differ with respect to grade. Let's take a look at those p-values for a moment though, because they are saying things that can inform experimental design down the road. For instance, one of the assignments, assignment 3, has a p-value around 0.1. This means that if we accepted a level of chance similarity of 11% this would have been considered statistically significant. As a research, this would suggest to me that there is something here worth considering following up on. For instance, if we had a small number of participants (we don't) or if there was something unique about this assignment as it relates to our experiment (whatever it was) then there may be followup experiments we could run.

In [20]:
print(ttest_ind(early_finishers['assignment1_grade'], early_finishers['assignment1_grade']))

Ttest_indResult(statistic=0.0, pvalue=1.0)


P-values have come under fire recently for being insuficient for telling us enough about the interactions which are happening, and two other techniques, confidence intervalues and bayesian analyses, are being used more regularly. One issue with p-values is that as you run more tests you are likely to get a value which is statistically significant just by chance. Lets see a simulation of this. First, lets create a data frame of 100 columns, each with 100 numbers

In [32]:
df1 = pd.DataFrame([np.random.random(100) for x in range(100)])
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.221452,0.478651,0.521055,0.653165,0.479053,0.435263,0.311085,0.403081,0.418804,0.086557,...,0.788648,0.705975,0.842279,0.350025,0.609571,0.793392,0.894431,0.664283,0.789036,0.186932
1,0.359523,0.417974,0.700005,0.641604,0.158618,0.418617,0.045775,0.577285,0.015873,0.276847,...,0.996327,0.124228,0.30412,0.852934,0.960475,0.481115,0.730303,0.907904,0.374842,0.162419
2,0.773406,0.830779,0.47713,0.811728,0.447763,0.650372,0.33146,0.598467,0.357043,0.754844,...,0.400488,0.293846,0.795132,0.819577,0.604662,0.865844,0.218583,0.731883,0.14393,0.169845
3,0.682083,0.300571,0.434676,0.673405,0.249988,0.520477,0.033481,0.055523,0.316269,0.330418,...,0.311228,0.180681,0.36777,0.22208,0.102069,0.559394,0.920476,0.191557,0.300281,0.179856
4,0.779109,0.609656,0.584169,0.970761,0.881193,0.952702,0.397944,0.705289,0.567753,0.76193,...,0.615958,0.889344,0.076961,0.239032,0.352013,0.715696,0.173988,0.320731,0.212873,0.740818


In [33]:
df2 = pd.DataFrame([np.random.random(100) for x in range(100)])
df2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.635657,0.182528,0.596988,0.039955,0.119214,0.132137,0.883967,0.736176,0.863172,0.352351,...,0.283568,0.860838,0.669368,0.157269,0.180762,0.554046,0.035545,0.569552,0.588088,0.214956
1,0.199336,0.767676,0.576099,0.654235,0.157193,0.626929,0.598582,0.083822,0.270944,0.688214,...,0.440801,0.308682,0.159191,0.019996,0.466058,0.091275,0.75151,0.976436,0.324459,0.474164
2,0.655089,0.363204,0.639567,0.806922,0.071888,0.32227,0.406999,0.232438,0.419471,0.718685,...,0.127344,0.664537,0.821137,0.625833,0.53702,0.547613,0.457943,0.875424,0.776509,0.509397
3,0.783878,0.124703,0.038246,0.797945,0.476767,0.491464,0.672934,0.010226,0.510626,0.534995,...,0.909539,0.092671,0.071289,0.893724,0.567123,0.209523,0.77377,0.530695,0.054186,0.165794
4,0.043826,0.239061,0.694119,0.440358,0.065531,0.764984,0.354549,0.164276,0.488101,0.443035,...,0.557233,0.006086,0.455456,0.08519,0.904183,0.701973,0.581135,0.691418,0.81114,0.470397


Are these two DataFrames the same? Maybe a better question is, for a given row inside of df1, is it the same as the row inside df2? Let's take a look. Let's say our critical value is 0.1, or and alpha of 10%. And we're going to compare each column in df1 to the same numbered column in df2. And we'll report when the p-value isn't less than 10%, which means that we have sufficient evidence to say that the columns are different.

In [39]:
def test_column(alpha=0.1):
    num_diff=0
    for col in df1.columns:
        teststat, pval = ttest_ind(df1[col], df2[col])
        if pval<=alpha:
            print('Col {} is statistically significantly different at alpha={}, pval={}'.format(col,alpha,pval))
            num_diff+=1
    print("Total number different was {}, which is {}%".format(num_diff,float(num_diff)/len(df1.columns)*100))
    
test_column()    

Col 6 is statistically significantly different at alpha=0.1, pval=0.025563776953398405
Col 37 is statistically significantly different at alpha=0.1, pval=0.004908796955795252
Col 50 is statistically significantly different at alpha=0.1, pval=0.02959994333249068
Col 52 is statistically significantly different at alpha=0.1, pval=0.09164451908183323
Col 71 is statistically significantly different at alpha=0.1, pval=0.04117891560016706
Col 73 is statistically significantly different at alpha=0.1, pval=0.01337862913404817
Col 75 is statistically significantly different at alpha=0.1, pval=0.0892066302888323
Col 87 is statistically significantly different at alpha=0.1, pval=0.03240907632852115
Col 94 is statistically significantly different at alpha=0.1, pval=0.022876513594439905
Col 95 is statistically significantly different at alpha=0.1, pval=0.015850342273716577
Total number different was 10, which is 10.0%


In [40]:
test_column(0.05)

Col 6 is statistically significantly different at alpha=0.05, pval=0.025563776953398405
Col 37 is statistically significantly different at alpha=0.05, pval=0.004908796955795252
Col 50 is statistically significantly different at alpha=0.05, pval=0.02959994333249068
Col 71 is statistically significantly different at alpha=0.05, pval=0.04117891560016706
Col 73 is statistically significantly different at alpha=0.05, pval=0.01337862913404817
Col 87 is statistically significantly different at alpha=0.05, pval=0.03240907632852115
Col 94 is statistically significantly different at alpha=0.05, pval=0.022876513594439905
Col 95 is statistically significantly different at alpha=0.05, pval=0.015850342273716577
Total number different was 8, which is 8.0%


In [42]:
df2=pd.DataFrame([np.random.chisquare(df=1,size=100) for x in range(100)])
test_column()

Col 0 is statistically significantly different at alpha=0.1, pval=0.0001443169356067365
Col 1 is statistically significantly different at alpha=0.1, pval=3.0938073900894635e-05
Col 2 is statistically significantly different at alpha=0.1, pval=4.601049725028687e-05
Col 3 is statistically significantly different at alpha=0.1, pval=0.0005972141191740876
Col 4 is statistically significantly different at alpha=0.1, pval=0.02911415657091566
Col 5 is statistically significantly different at alpha=0.1, pval=0.00019419192631704848
Col 6 is statistically significantly different at alpha=0.1, pval=0.0005903490154332418
Col 7 is statistically significantly different at alpha=0.1, pval=7.97431225508124e-05
Col 8 is statistically significantly different at alpha=0.1, pval=0.00015426472178808678
Col 9 is statistically significantly different at alpha=0.1, pval=0.003251586291974789
Col 10 is statistically significantly different at alpha=0.1, pval=0.005242898835195929
Col 11 is statistically significa