# Basic Statistical Testing

In this lecture, we're going to review some of the basic statistical testing in Python. We're going to talk about hypothesis testing, statistical significance, and using SciPy to run the student's t-test. We use statistics a lot in different ways in data science, and in this lecture I want to refresh your knowledge of hypothesis testing, which is a core data analysis activity behind experimentation. The goal of hypothesis testing is to determine if for instance the two different conditions we have in an experiment have resulted in different impacts.

In [10]:
import pandas as pd
import numpy as np

scipy is an interesting collection of libraries for data science. It includes NumPy and pandas, but also plotting libraries such as Matplotlib and a number of other scientific library functions as well.

In [19]:
from scipy import stats
df=pd.read_csv('Downloads/grades1.csv')
df.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,02-11-2015 06:55,83.030552,09-11-2015 02:22,67.164441,12-11-2015 08:58,53.011553,16-11-2015 01:21,47.710398,20-11-2015 13:25,38.168318,22-11-2015 18:31
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,29-11-2015 14:57,86.290821,06-12-2015 17:41,69.772657,10-12-2015 08:54,55.098125,13-12-2015 17:32,49.588313,19-12-2015 23:26,44.629482,21-12-2015 17:07
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,09-01-2016 05:36,85.512541,09-01-2016 06:39,68.410033,15-01-2016 20:22,54.728026,11-01-2016 12:41,49.255224,11-01-2016 17:31,44.329701,17-01-2016 16:24
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,30-04-2016 06:50,68.824532,30-04-2016 17:20,61.942079,12-05-2016 07:47,49.553663,07-05-2016 16:09,49.553663,24-05-2016 12:51,44.598297,26-05-2016 08:09
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,13-12-2015 17:06,51.49104,14-12-2015 12:25,41.932832,29-12-2015 14:25,36.929549,28-12-2015 01:29,33.236594,29-12-2015 14:46,33.236594,05-01-2016 01:07


In [20]:
print("there are {} rows and {} columns". format(df.shape[0],df.shape[1]))

there are 2315 rows and 13 columns


In [40]:
#Let's say those who finish the first assignment by the end of December 2015, 
#we'll call them early finishers, and those who finish at sometime after that we'll call them late finishers. 

early=df[pd.to_datetime(df['assignment1_submission']) < '2016']
early.head()
print(early.index)

Int64Index([   0,    1,    4,    5,    8,   11,   15,   17,   19,   22,
            ...
            2294, 2296, 2299, 2304, 2305, 2308, 2309, 2311, 2312, 2314],
           dtype='int64', length=1259)


In [42]:
#what all left will be late finishers
# boolean_series = ~df.student_id.isin(early['student_id'])
# late1 = df[boolean_series]  cant do it on student id because they are not all unique
late=df[~df.index.isin(early.index)]
late.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,09-01-2016 05:36,85.512541,09-01-2016 06:39,68.410033,15-01-2016 20:22,54.728026,11-01-2016 12:41,49.255224,11-01-2016 17:31,44.329701,17-01-2016 16:24
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,30-04-2016 06:50,68.824532,30-04-2016 17:20,61.942079,12-05-2016 07:47,49.553663,07-05-2016 16:09,49.553663,24-05-2016 12:51,44.598297,26-05-2016 08:09
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,05-03-2016 11:05,69.998995,09-03-2016 07:29,55.999196,16-03-2016 22:31,50.399276,18-03-2016 07:19,45.359349,19-03-2016 10:35,45.359349,23-03-2016 14:02
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,24-01-2016 18:24,72.518481,27-01-2016 13:37,65.266633,30-01-2016 14:34,65.266633,03-02-2016 22:08,65.266633,16-02-2016 14:22,65.266633,18-02-2016 08:35
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,06-03-2016 12:06,59.270882,13-03-2016 02:07,53.343794,17-03-2016 07:30,53.343794,20-03-2016 21:45,42.675035,27-03-2016 15:55,38.407532,30-03-2016 20:33


In [43]:
#comparing means for two populations
print(early['assignment1_grade'].mean())
print(late['assignment1_grade'].mean())

74.94728457018276
74.04506484775571


In [44]:
#When doing hypothesis testing, we have to choose a significance level as a threshold for how much of a 
#chance we're willing to accept. This significance level is typically called Alpha. For this example, 
#yet let's use a threshold of 0.05 for our Alpha, which is five percent. Now this is commonly used number 
#but it's really quite arbitrary.
#We're going to use the ttest_ind() function, which does an independent t-test, meaning that the populations 
#in the two groups are not related to one another. The result of t-ttest_ind() are this t statistic 
#and the p-value. It's this latter value the probability which is most important to us as it indicates the 
#chance between zero and one of our null hypothesis being true.

from scipy.stats import ttest_ind
ttest_ind(early['assignment1_grade'], late['assignment1_grade'])

Ttest_indResult(statistic=1.3223540852266782, pvalue=0.1861810110655527)

In [45]:
#So here we see that the probability is 0.18. This is above our Alpha value of 0.05. 
#This means that we cannot reject the null hypothesis. 
#The null hypothesis was that the two populations are the same.

#checking for other assignments:
print(ttest_ind(early['assignment2_grade'], late['assignment2_grade']))
print(ttest_ind(early['assignment3_grade'], late['assignment3_grade']))
print(ttest_ind(early['assignment4_grade'], late['assignment4_grade']))
print(ttest_ind(early['assignment5_grade'], late['assignment5_grade']))
print(ttest_ind(early['assignment6_grade'], late['assignment6_grade']))


Ttest_indResult(statistic=1.2514717609598656, pvalue=0.21088896265004953)
Ttest_indResult(statistic=1.6133726556911747, pvalue=0.1067999810612257)
Ttest_indResult(statistic=0.049671157436252035, pvalue=0.9603887297496588)
Ttest_indResult(statistic=-0.05279315520758355, pvalue=0.9579012741710038)
Ttest_indResult(statistic=-0.11609743331021136, pvalue=0.9075854013700602)


In [56]:
#we can see that the p values for all the assignments are more than alpha value of 0.05. so we cannot reject null 
#hypothesis in any of the cases.

#Now, p-values have come under fire recently for being insufficient for telling us enough about the 
#interactions which are happening and two other techniques confidence intervals and Bayesian analyses 
#are being used more regularly. One issue with p-values is that as you run more tests you're likely to 
#get a value which is statistically significant just by chance. So let's see a little simulation of this.

df1=pd.DataFrame([np.random.random(100) for x in range (100)])
df1.head()
df1.shape

(100, 100)

In [58]:
#lets create another dataframe
df2=pd.DataFrame([np.random.random(100) for x in range (100)])


So are these two DataFrames the same? Maybe a better question is for a given row inside of dF1 is it the same as that same row inside of df2. So let's take a look. Let's say our critical value here is 0.1 or an Alpha of 10 percent. We're going to compare each column in dF1 to the same numbered column in df2 and we'll report when the p-value isn't less than 10 percent, which means that we have sufficient evidence to say that the columns are different.

In [67]:
def test_col(alpha):
    #count of number of col different
    col_diff=0
    #iterating over columns in df1 and df2
    
    for col in df1:
        statistic, pvalue = ttest_ind(df1[col], df2[col])
        if pvalue<alpha:
            col_diff=col_diff+1
            print("there is significant difference between column {} with p-value of {} and statistics of {}". format(col, pvalue, statistic))
            
    print("there are {} number of columns with significant difference". format(col_diff))
    
test_col(0.1)

there is significant difference between column 0 with p-value of 0.08653497356846837 and statistics of 1.722530089416434
there is significant difference between column 17 with p-value of 0.016754877407367596 and statistics of 2.412509968809497
there is significant difference between column 18 with p-value of 0.06562215895465943 and statistics of 1.8512550306454334
there is significant difference between column 20 with p-value of 0.026016928676264144 and statistics of -2.2428174796044087
there is significant difference between column 24 with p-value of 0.005784827951891145 and statistics of -2.7900645562113175
there is significant difference between column 31 with p-value of 0.09797092119659528 and statistics of 1.6626189596984808
there is significant difference between column 36 with p-value of 0.048846807313423675 and statistics of -1.9821396584972646
there is significant difference between column 60 with p-value of 0.0797106336388247 and statistics of 1.7614208938445446
there is sign

In [69]:
#lets try for alpha 0.05
test_col(0.05)

there is significant difference between column 17 with p-value of 0.016754877407367596 and statistics of 2.412509968809497
there is significant difference between column 20 with p-value of 0.026016928676264144 and statistics of -2.2428174796044087
there is significant difference between column 24 with p-value of 0.005784827951891145 and statistics of -2.7900645562113175
there is significant difference between column 36 with p-value of 0.048846807313423675 and statistics of -1.9821396584972646
there is significant difference between column 61 with p-value of 0.006319751443677037 and statistics of 2.7601634242413415
there is significant difference between column 70 with p-value of 0.006581966950547492 and statistics of 2.7463361900433343
there is significant difference between column 97 with p-value of 0.019283094808979988 and statistics of 2.359295668703548
there are 7 number of columns with significant difference


Understand that this p-value isn't magic and it has a threshold for you when reporting results and trying to answer your hypothesis. What's a reasonable threshold? That depends on your question and you need to engage domain experts to better understand what they would consider significant. 

In [72]:
#now lets try for non-normal distribution data, lets choose chi-square distribution
df2=pd.DataFrame([np.random.chisquare(df=1,size=100) for x in range (100)])
test_col(0.05)

there is significant difference between column 0 with p-value of 0.017332111744844395 and statistics of -2.3997646911479746
there is significant difference between column 1 with p-value of 0.0003518055131263055 and statistics of -3.636952735569471
there is significant difference between column 2 with p-value of 2.420601313697458e-05 and statistics of -4.324576636672871
there is significant difference between column 3 with p-value of 0.005325185778819307 and statistics of -2.8178368417754234
there is significant difference between column 4 with p-value of 0.0016921614073877832 and statistics of -3.1831160794771822
there is significant difference between column 5 with p-value of 0.00024360381055035438 and statistics of -3.736990467896558
there is significant difference between column 6 with p-value of 0.0019635114930055177 and statistics of -3.137574829988468
there is significant difference between column 7 with p-value of 0.0003495055853094491 and statistics of -3.638756304938291
there 

 In this lecture, we've discussed just some of the basics of hypothesis testing in Python. I introduced you to the SciPy library. Which you can use for the student's t-tests. We've discussed some of the practical issues which arise from looking at statistical significance. Now, there's much more to learn about hypothesis testing. For instance, there are different tests to be used depending on the shape of your data and different ways to report on the results instead of just p values such as confidence intervals or Bayesian analysis.

# Other forms of structered Data

Structures like tabel, graph, tree are representations that we impose on the underlying data. The meaning of the data that we derive from these representations can change based on how we decide to apply a representation. A data scientist needs to be able to interact with a broad array of other stakeholders and as such needs to be able to be flexible with how they conceive of and represent data.