# Basic Statistical Testing
using the scipy package to do hypothesis testing, t-tests, and statistical significance.

In [1]:
import pandas as pd
import numpy as np

from scipy import stats

## Hypothesis Testing
Null hypothesis, H0: the explanation we have is not sufficient
Alternative hypothesis, Ha: our actual explanation

Our actual testing method is to determine whether H0 is true or not. If we find differences between groups, then we reject H0

In [2]:
df = pd.read_csv('datasets/grades.csv')
df.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000


In [3]:
# summary stats
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns")

There are 2315 rows and 13 columns


Example) segment this population into two groups: those who finish assignment1 by the end of December 2015 (early finishers) and those who finish after that time (late finishers)

In [9]:
early_finishers = df[pd.to_datetime(df['assignment1_submission']) < '2016']
early_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000
5,D09000A0-827B-C0FF-3433-BF8FF286E15B,71.647278,2015-12-28 04:35:32.836000000,64.05255,2016-01-03 21:05:38.392000000,64.75255,2016-01-07 08:55:43.692000000,57.467295,2016-01-11 00:45:28.706000000,57.467295,2016-01-11 00:54:13.579000000,57.467295,2016-01-20 19:54:46.166000000
8,C9D51293-BD58-F113-4167-A7C0BAFCB6E5,66.595568,2015-12-25 02:29:28.415000000,52.916454,2015-12-31 01:42:30.046000000,48.344809,2016-01-05 23:34:02.180000000,47.444809,2016-01-02 07:48:42.517000000,37.955847,2016-01-03 21:27:04.266000000,37.955847,2016-01-19 15:24:31.060000000


In [10]:
# '~' is a bitwise complement operator, so we can get the indices that are in df but not early_finishers
late_finishers = df[~df.index.isin(early_finishers.index)]
late_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


Let's compare the average assignment1 grade for both groups:

In [11]:
print(early_finishers['assignment1_grade'].mean())
print(late_finishers['assignment1_grade'].mean())

74.94728457024304
74.0450648477065


These average grades look similar, but are they the same? This is when we do a t-test.
H0: means are the same
Ha: means are different

Need to choose a significance level (alpha) as a threshold for how much of a chance we're willing to accept.

Independent t-test in scipy, `ttest_ind()` means the populations are not related to each other ('dependent')
* returns a t-statistic and a p-value

**p-value**: "p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct" -Wikipedia; "measures how likely it is that any observed difference between groups is due to chance" -NIH

**significance means the evidence we observed from the data against H0, and alpha level is a measure of our tolerance of the significance**

In [12]:
from scipy.stats import ttest_ind

ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])

Ttest_indResult(statistic=1.3223540853721596, pvalue=0.18618101101713855)

Since p-value > alpha, then we **fail to reject H0**. We don't have enough certainty in our evidence to say that they are different. This doesn't mean though that we have proven the populations are the same.

Let's check the other grades:

In [14]:
print(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade']))
print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade']))
print(ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade']))
print(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade']))
print(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))

Ttest_indResult(statistic=1.2514717608216366, pvalue=0.2108889627004424)
Ttest_indResult(statistic=1.6133726558705392, pvalue=0.10679998102227865)
Ttest_indResult(statistic=0.049671157386456125, pvalue=0.960388729789337)
Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492)
Ttest_indResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656)


None of these p-values are less than alpha, so we do not have enough evidence to suggest the populations differ with respect to grades.

Assignment 3 grades could've been statistically significant if we accepted a level of chance similarity of 11%. Would want to follow-up to see if there was a small number of participants or something was unique about Assignment 3 as it relates to our experiment.

One issue with p-values is that the more you run test, you're likely to get a value which is statistically significant just by chance...

Let's do a simulation:

In [18]:
# create a dataframe 100x100
## 'np.random.random(100)' generates a list of 100 random numbers with a value between 0 and 1
## 'for x in range(100)' means it'll loop thru 100 times
df1 = pd.DataFrame([np.random.random(100) for x in range(100)])
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.733228,0.366594,0.964721,0.961578,0.484063,0.13157,0.171321,0.841881,0.351869,0.54092,...,0.378652,0.053672,0.657333,0.881828,0.715329,0.944076,0.798475,0.737232,0.084891,0.837215
1,0.268086,0.264791,0.030123,0.816405,0.766911,0.792177,0.524347,0.181601,0.605199,0.29952,...,0.55218,0.31381,0.088114,0.523216,0.801485,0.178042,0.833063,0.688618,0.295871,0.211357
2,0.299918,0.564006,0.444239,0.504171,0.009856,0.935987,0.703496,0.695048,0.580404,0.368667,...,0.685697,0.252339,0.234802,0.929907,0.28403,0.499385,0.091226,0.010809,0.990139,0.320404
3,0.79739,0.313888,0.770587,0.325242,0.979485,0.367983,0.697654,0.54394,0.395214,0.890159,...,0.512297,0.851374,0.612137,0.273717,0.835546,0.468106,0.251364,0.641091,0.179557,0.122328
4,0.262453,0.280336,0.742288,0.973137,0.052465,0.755196,0.348542,0.501298,0.011842,0.842244,...,0.695845,0.201221,0.566184,0.423939,0.169587,0.978372,0.136212,0.132963,0.272891,0.725905


In [19]:
df2 = pd.DataFrame([np.random.random(100) for x in range(100)])

# are these two DataFrames the same? Is a given row in df1 the same as a row inside df2?

# let alpha = 0.1

def test_columns(alpha=0.1):
    # counter for how many columns differe
    num_diff = 0
    
    for col in df1.columns:
        teststat, pval = ttest_ind(df1[col], df2[col])
        
        if pval <= alpha:
            print(f"Col {col} is statistically significantly different at alpha = {alpha}, pval = {pval}")
            num_diff += 1
            
    print(f"Total number different was {num_diff}, which is {float(num_diff)/len(df1.columns)*100}%")
    
test_columns()

Col 14 is statistically significantly different at alpha = 0.1, pval = 0.011209425411090351
Col 15 is statistically significantly different at alpha = 0.1, pval = 0.055545987065756766
Col 21 is statistically significantly different at alpha = 0.1, pval = 0.018123811373954943
Col 33 is statistically significantly different at alpha = 0.1, pval = 0.04084286921750802
Col 35 is statistically significantly different at alpha = 0.1, pval = 0.02683828641914503
Col 44 is statistically significantly different at alpha = 0.1, pval = 0.06867635716500986
Col 48 is statistically significantly different at alpha = 0.1, pval = 0.017587687263516637
Col 57 is statistically significantly different at alpha = 0.1, pval = 0.0698253477108304
Col 64 is statistically significantly different at alpha = 0.1, pval = 0.057929613022293165
Col 74 is statistically significantly different at alpha = 0.1, pval = 0.05258193216497991
Col 85 is statistically significantly different at alpha = 0.1, pval = 0.0857711189535

Given that we ran through 100 columns, 12 of them are different! Which checks out because that's close to our alpha value.

In [20]:
test_columns(0.05)

Col 14 is statistically significantly different at alpha = 0.05, pval = 0.011209425411090351
Col 21 is statistically significantly different at alpha = 0.05, pval = 0.018123811373954943
Col 33 is statistically significantly different at alpha = 0.05, pval = 0.04084286921750802
Col 35 is statistically significantly different at alpha = 0.05, pval = 0.02683828641914503
Col 48 is statistically significantly different at alpha = 0.05, pval = 0.017587687263516637
Total number different was 5, which is 5.0%


Let's recreate `df2` using a non-normal distribution like chi-squared:

In [21]:
df2 = pd.DataFrame([np.random.chisquare(df=1, size=100) for x in range(100)])
test_columns()

Col 0 is statistically significantly different at alpha = 0.1, pval = 0.0011197540560035028
Col 1 is statistically significantly different at alpha = 0.1, pval = 7.31568276199611e-05
Col 2 is statistically significantly different at alpha = 0.1, pval = 0.0010869864207870951
Col 3 is statistically significantly different at alpha = 0.1, pval = 2.2025680021330184e-05
Col 4 is statistically significantly different at alpha = 0.1, pval = 7.667183904667369e-05
Col 5 is statistically significantly different at alpha = 0.1, pval = 0.0034584731889979073
Col 6 is statistically significantly different at alpha = 0.1, pval = 0.00547519027951676
Col 7 is statistically significantly different at alpha = 0.1, pval = 0.000300942295659288
Col 8 is statistically significantly different at alpha = 0.1, pval = 0.0005331152071709387
Col 9 is statistically significantly different at alpha = 0.1, pval = 0.03764011678302359
Col 10 is statistically significantly different at alpha = 0.1, pval = 4.122816132530

Almost all of columns test to be statistically significant at the 10% level.

## Other Forms of Structured Data
tabular form = DataFrame

Network: individual users have attributes and can be connected to other individuals who have their own attributes; made up of nodes (people, teams, planets) that are connected by edges 

Tree: a hierarchical network of parent, child, and sibling nodes; root and leaf nodes