**Index**
- Statistical testing in Python
- Hypothesis testing
- Statistical significance
****

Statistics is used a lot in different ways in data science.<br>**Hypothesis testing** is a core data analysis activity behind experimentation. The goal of hypothesis testing is to determine if, for instance, the two different conditions we have in an experiment have resulted in different impacts.

In [2]:
import pandas as pd
import numpy as np

from scipy import stats

SciPy is an interesting collection of libraries for data science, and you'll use most or perhaps all of these libraries. It includes NumPy and Pandas, but also plotting libraries such as Matplotlib and a number of other scientific library functions as well.

When we do hypothesis test, we actually have two statements of interest. The first is our actual explanation, which we call the alternative hypothesis, and the second is that the explanation we have is not sufficient and we call this the null hypothesis.<br>
Our actual testing method is to determine whether the null hypothesis is true or not. If we find that there is a difference between groups then we can reject the null hypothesis and we accept our alternative.

**Example :**

In [5]:
df = pd.read_csv('./datasets/grades.csv')     # importing grades.csv dataset
df


Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.813800,2015-12-13 17:06:10.750000000,51.491040,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2310,DE88902E-C7A7-E37A-CFA7-F2C8F2D219F2,77.684611,2016-03-07 02:52:24.378000000,69.916150,2016-03-11 22:02:39.161000000,69.916150,2016-03-17 07:30:09.261000000,69.916150,2016-03-18 18:01:24.525000000,55.932920,2016-03-20 06:38:12.120000000,50.339628,2016-03-25 11:00:06.923000000
2311,DE88902E-C7A7-E37A-CFA7-F2C8F2D219F2,75.367870,2015-11-29 02:43:27.932000000,59.934296,2015-12-03 05:30:39.218000000,48.687437,2015-12-09 15:56:44.895000000,43.008693,2015-12-13 06:18:01.342000000,38.707824,2015-12-20 02:39:39.248000000,38.707824,2015-12-22 13:34:42.931000000
2312,EFDA9F93-D0C3-864F-B0F6-2E9AA3E05E31,73.269463,2015-10-20 08:09:27.418000000,58.255570,2015-11-18 19:07:06.930000000,58.955570,2015-12-10 08:54:54.871000000,52.250013,2015-11-23 19:40:00.434000000,41.800010,2015-11-29 14:23:43.659000000,41.800010,2015-12-04 09:56:07.156000000
2313,1F51E050-78F7-F270-1B90-ED1BC0376763,87.268366,2016-04-03 09:04:51.646000000,87.268366,2016-04-08 19:24:29.095000000,87.268366,2016-04-12 05:43:33.853000000,69.814693,2016-04-14 10:43:58.104000000,55.851754,2016-04-19 05:37:19.322000000,55.851754,2016-04-23 03:44:06.813000000


In [7]:
print("There are {} rows and {} columns in above dataframe".format(df.shape[0], df.shape[1]))


There are 2315 rows and 13 columns in above dataframe


Let's segment this population into 2 pieces.<br>Consider those who finish the 1st assignment by the end of December 2015 - call them as **early finishers** and,<br>those who finish at sometime later that - call them as **late finishers**.

In [9]:
early_finishers = df[pd.to_datetime(df['assignment1_submission']) < '2016']
early_finishers


Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.813800,2015-12-13 17:06:10.750000000,51.491040,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000
5,D09000A0-827B-C0FF-3433-BF8FF286E15B,71.647278,2015-12-28 04:35:32.836000000,64.052550,2016-01-03 21:05:38.392000000,64.752550,2016-01-07 08:55:43.692000000,57.467295,2016-01-11 00:45:28.706000000,57.467295,2016-01-11 00:54:13.579000000,57.467295,2016-01-20 19:54:46.166000000
8,C9D51293-BD58-F113-4167-A7C0BAFCB6E5,66.595568,2015-12-25 02:29:28.415000000,52.916454,2015-12-31 01:42:30.046000000,48.344809,2016-01-05 23:34:02.180000000,47.444809,2016-01-02 07:48:42.517000000,37.955847,2016-01-03 21:27:04.266000000,37.955847,2016-01-19 15:24:31.060000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2308,EFDA9F93-D0C3-864F-B0F6-2E9AA3E05E31,71.481182,2015-10-03 09:04:46.358000000,70.981182,2015-10-06 03:57:28.420000000,64.603064,2015-10-12 07:58:25.081000000,63.703064,2015-10-17 07:59:49.005000000,50.962451,2015-10-18 02:29:34.374000000,45.866206,2015-10-27 00:21:47.208000000
2309,6D2AB78F-44F4-2E8B-5C5E-B79119BC7EAC,82.640274,2015-10-01 23:25:20.529000000,65.752219,2015-10-05 02:06:11.522000000,53.341775,2015-10-22 23:58:36.426000000,47.197598,2015-10-16 12:32:56.809000000,47.197598,2015-10-24 12:16:54.993000000,37.758078,2015-10-26 10:34:41.293000000
2311,DE88902E-C7A7-E37A-CFA7-F2C8F2D219F2,75.367870,2015-11-29 02:43:27.932000000,59.934296,2015-12-03 05:30:39.218000000,48.687437,2015-12-09 15:56:44.895000000,43.008693,2015-12-13 06:18:01.342000000,38.707824,2015-12-20 02:39:39.248000000,38.707824,2015-12-22 13:34:42.931000000
2312,EFDA9F93-D0C3-864F-B0F6-2E9AA3E05E31,73.269463,2015-10-20 08:09:27.418000000,58.255570,2015-11-18 19:07:06.930000000,58.955570,2015-12-10 08:54:54.871000000,52.250013,2015-11-23 19:40:00.434000000,41.800010,2015-11-29 14:23:43.659000000,41.800010,2015-12-04 09:56:07.156000000


There can be multiple ways to calculate **late_finishers** dataframe.
1. You could just copy and paste the first projection and change the sign from less than to greater than or equal to. But, here if you decide you want to change the date sometime later you will have to remember to change it in two places.

2. You could also do a join of the DataFrame df with early_finishers. <br>If you do a left join you only keep the items in the left DataFrame. So this would have been a good answer.

3. You also could write a function that determines if someone is early or late and then call .apply( ) on the DataFrame and add a new column to the DataFrame.

4. Anothere method can be - since the dataframes df and early_finishers share index values, so ideally we want everything in the df whcih is not in early_finishers.<br>Take the inverse of df.index.isin(early_finishers.index)

In [12]:
# implementation of 4th method...

late_finishers = df[~df.index.isin(early_finishers.index)]
late_finishers


Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.576090,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2303,DDE0526B-7DA4-80E8-C2A6-D097F3826029,97.215052,2016-01-23 09:19:40.494000000,77.772041,2016-01-26 08:38:13.085000000,69.994837,2016-01-29 15:04:34.705000000,62.995353,2016-02-06 12:51:00.647000000,50.396283,2016-02-11 15:44:08.113000000,50.396283,2016-02-14 09:03:33.466000000
2306,DDE0526B-7DA4-80E8-C2A6-D097F3826029,47.696703,2016-06-22 20:21:58.182000000,38.157363,2016-06-23 18:22:45.622000000,38.157363,2016-07-02 22:18:59.529000000,30.525890,2016-06-28 22:05:38.100000000,30.525890,2016-07-09 20:08:14.734000000,30.525890,2016-07-17 15:17:58.502000000
2307,1F51E050-78F7-F270-1B90-ED1BC0376763,94.595758,2016-01-20 23:22:16.592000000,85.136182,2016-01-27 22:30:29.914000000,76.622564,2016-01-31 15:39:45.088000000,68.960307,2016-02-06 21:43:05.836000000,62.064277,2016-02-14 18:52:48.594000000,49.651421,2016-02-19 20:36:16.121000000
2310,DE88902E-C7A7-E37A-CFA7-F2C8F2D219F2,77.684611,2016-03-07 02:52:24.378000000,69.916150,2016-03-11 22:02:39.161000000,69.916150,2016-03-17 07:30:09.261000000,69.916150,2016-03-18 18:01:24.525000000,55.932920,2016-03-20 06:38:12.120000000,50.339628,2016-03-25 11:00:06.923000000


****
The pandas data frame object has a variety of statistical functions associated with it. If we call the mean function directly on the data frame, we see that each of the means for the assignments are calculated. Let's compare the means for our two populations :

In [20]:
print("Early Finishers :\n",early_finishers.mean())

print("_______________\n")

print("Late Finishers :\n",late_finishers.mean())

Early Finishers :
 assignment1_grade    74.947285
assignment2_grade    67.229129
assignment3_grade    61.098805
assignment4_grade    54.126001
assignment5_grade    48.604524
assignment6_grade    43.812144
dtype: float64
_______________

Late Finishers :
 assignment1_grade    74.045065
assignment2_grade    66.395813
assignment3_grade    60.056162
assignment4_grade    54.095552
assignment5_grade    48.635211
assignment6_grade    43.876394
dtype: float64


  """Entry point for launching an IPython kernel.
  """


So, these results may look pretty similar.<br>But a question may arise like, are they the same ? What is meant by similar ?<br>This is where the students' t-test comes in. It allows to form the alternative hypothesis ("These are different") as well as the null hypothesis ("These are the same") and then test that null hypothesis.

When doing hypothesis testing, we have to choose a significance level as a threshold for how much of a chance we're willing to accept. This significance level is typically called **alpha**.<br>
For this example, let's use a threshold of 0.05 or 5% for our alpha.<br>
(This is a commonly used number but it's really quite arbitrary.)

The SciPy library contains a number of different statistical tests and forms a basis for hypothesis testing in Python.<br>
**ttest_ind( )** function do an independent t-test (meaning the populations are not related to one another). The result of ttest_index( ) is a tuple, i.e. (t-statistic, p-value) in a tuple format.<br>
It's this latter value, the probability, which is most important to us, as it indicates the chance (between 0 and 1) of our null hypothesis being True.

Now, let's import ttest_ind function from Scipy and run this function with two populations - early_finishers & late_finishers - but now only for assignment 1 grades.

In [21]:
from scipy.stats import ttest_ind

ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])

Ttest_indResult(statistic=1.3223540853721596, pvalue=0.18618101101713855)

Notice here, the probability comes up as 0.18 and this is much above the alpha value of 0.05.<br>THis means that the null hypothesis can not be rejected. The null hypothesis was that the 2 populations are the same, and we do not have enough certainty in our evidence (because it is greater than alpha) to come to a conclusion to the contrary. However this does not mean that it's proven the populations are the same.

let's check  ttest_ind for other assignments...

In [22]:
print(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade']))
print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade']))
print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment4_grade']))
print(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade']))
print(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))



Ttest_indResult(statistic=1.2514717608216366, pvalue=0.2108889627004424)
Ttest_indResult(statistic=1.6133726558705392, pvalue=0.10679998102227865)
Ttest_indResult(statistic=11.14304368014965, pvalue=3.972030866174253e-28)
Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492)
Ttest_indResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656)


From above analysis, it looks like in this data we do not have enough evidence to suggest the populations differ with
respect to grade.<br>Let's take a look at those p-values for a moment though, because they are saying things that can inform experimental design down the road. For instance, one of the assignments, assignment 3, has a p-value around 0.1. This means that if we accept a level of chance similarity of 11%, this would have been considered statistically significant.

**P-values** have come under the radar recently for being insuficient for telling us enough about the interactions which are happening, and 2 other techniques, **Confidence Intervalues** and __Bayesian Analyses__, are being used more regularly.

One issue with p-values is that when more tests are run, there is a likelihood to get a value which is statistically significant just by chance.

Let's see a simulation of this :<br>
first, let's create 2 dataframes, each of 100 * 100 numbers...

In [28]:
df1 = pd.DataFrame([np.random.random(100) for x in range(100)])

df2 = pd.DataFrame([np.random.random(100) for x in range(100)])

Now question may arise like; Are these two DataFrames the same ? Or a better question can be,<br>
For a given row inside of df1, is it the same as the row inside df2 ?

Let's take a look....<br>Let's say our critical value is 0.1, or an alpha of 10%. And we're going to compare each column in df1 to the same numbered column in df2. And we'll report when the p-value isn't less than 10%, which means that we have sufficient evidence to say that the columns are different.


In [31]:
# write this in a function called as "test_columns"
def test_columns(alpha = 0.1):
    num_diff=0
    # just iterate over the columns
    for col in df1.columns:
        # to run the ttest_ind( ) between the two dataframes
        teststat, pval = ttest_ind(df1[col], df2[col])
        # to check the pvalue versus the alpha
        if pval <= alpha:
            # to just print out if they are different and increment the num_diff
            print("Col {} is statistically significantly different at alpha={}, pval={}".format(col, alpha, pval))
            num_diff = num_diff + 1
    # to print out some summary stats
    print("Total number different was {}, which is {}%".format(num_diff,float(num_diff)/len(df1.columns)*100))


test_columns()

Col 42 is statistically significantly different at alpha=0.1, pval=0.0949427444411192
Col 48 is statistically significantly different at alpha=0.1, pval=0.08813910749540192
Col 53 is statistically significantly different at alpha=0.1, pval=0.060525737856357426
Col 58 is statistically significantly different at alpha=0.1, pval=0.06408238245095281
Col 60 is statistically significantly different at alpha=0.1, pval=0.04820293132210308
Col 87 is statistically significantly different at alpha=0.1, pval=0.017457887904518066
Total number different was 6, which is 6.0%


Interesting! Notice that there are a bunch of columns that are actually different! In fact, that number looks a lot like the alpha value we chose. So what's going on - shouldn't all of the columns be the same? 

Remember that all the ttest does is check if two sets are similar given some level of confidence, in above example, it was 10%.
The more random comparisons you do, the more will just happen to be the same by chance. In this example, we checked 100 columns, so we would expect there to be roughly 10 of them to be the same if our alpha was 0.1.

In [33]:
test_columns(0.05)

Col 60 is statistically significantly different at alpha=0.05, pval=0.04820293132210308
Col 87 is statistically significantly different at alpha=0.05, pval=0.017457887904518066
Total number different was 2, which is 2.0%


> Remember, when doing a statistical tests like the t-test which has a p-value, that this p-value isn't magic, and that it's a threshold for you when reporting results and trying to answer your hypothesis.

> What's a reasonable threshold ?<br>Depends on your question, and you need to engage domain experts to better understand what they would consider significant.

****

Lst's recreate that 2nd dataframe using a non-normal distribution via chi-squared and then, run the test_column( ) function.

_Now notice in below output that all or most columns test to be statistically significant at the 10% level._

In [34]:
df2 = pd.DataFrame([np.random.chisquare(df=1, size=100) for x in range(100)])

test_columns()

Col 0 is statistically significantly different at alpha=0.1, pval=6.503724893393038e-05
Col 1 is statistically significantly different at alpha=0.1, pval=0.0019213882910425405
Col 2 is statistically significantly different at alpha=0.1, pval=0.0008014763740405239
Col 3 is statistically significantly different at alpha=0.1, pval=0.0008537927719469816
Col 4 is statistically significantly different at alpha=0.1, pval=0.00036202730314051416
Col 5 is statistically significantly different at alpha=0.1, pval=0.00027660656642037066
Col 6 is statistically significantly different at alpha=0.1, pval=0.00039663848787916523
Col 7 is statistically significantly different at alpha=0.1, pval=0.04873345787113976
Col 8 is statistically significantly different at alpha=0.1, pval=0.0029824036138885028
Col 9 is statistically significantly different at alpha=0.1, pval=0.0009382154296872901
Col 10 is statistically significantly different at alpha=0.1, pval=0.00045838285169394325
Col 11 is statistically signi

****
<u>**Certain key points about Statistical Testing :**</u>
- Significance means the evidence we observed from the data against the null hypothesis.
- "p value" is a measure of the significance.
- alpha is a measure of our tolerance of the significance.

****