# Some statistical testing in python
hypothesis testing, statistical significance e t-tests

### Hypothesis testing
É uma metodologia estatística que nos auxilia a tomar decisões sobre uma ou mais populações baseado na informação obtida da amostra.

Nos permite verificar se os dados amostrais trazem evidência que apoiem ou não uma hipótese estatística formulada.

Ao tentarmos tomar decisões, é conveniente a formulação de suposições ou de conjeturas sobre as populações de interesse, que, em geral, consistem em considerações sobre parâmetros (μ,σ2,p) das mesmas.

Essas suposições, que podem ser ou não verdadeiras, são denominadas de Hipóteses Estatísticas.

Em muitas situações práticas o interesse do pesquisador é verificar a veracidade sobre um ou mais parâmetros populacionais (μ,σ2,p) ou sobre a distribuição de uma variável aleatória.

O teste de hipóteses fornecem ferramentas que nos permitem rejeitar ou não rejeitar uma hipótese estatística através da evidencia fornecida pela amostra.

In [None]:
# Quando fazemos testes de hipóteses, na verdade temos duas declarações de interesse: a primeira é nossa explicação real, que
# chamamos de hipótese alternativa, e a segunda é que a explicação que temos não é suficiente, e chamamos de hipótese nula.

# Nosso teste é determinar se a hipótese nula é verdadeira ou não.

# Hipótese nula é o que queremos refutar


In [5]:
# O objetivo do teste de hipótese é determinar se, por exemplo, duas condições diferentes resultaram em impactos diferentes

import numpy as np
import pandas as pd
from scipy import stats

df=pd.read_csv ('datasets/grades.csv')
df.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000


In [6]:
# Vamos segmentar os dados em duas partes. Quem finalizou o 'assignment1_submission' antes de 2016 chamaremos de 'early finishers',
# e aqueles que não finalizaram, chamaremos de 'late finishers'

In [7]:
early_finishers=df[pd.to_datetime(df['assignment1_submission']) < '2016']
late_finishers=df[~df.index.isin(early_finishers.index)]

In [8]:
# Vamos chamar a média das notas
print(early_finishers['assignment1_grade'].mean())
print(late_finishers['assignment1_grade'].mean())

74.94728457024304
74.0450648477065


In [9]:
# Elas parecem iguais. Mas elas são?
# Nesse caso utilizaremos o 'student t-teste'

# A hipótese nula é aquilo que queremos refutar, ou seja , "as médias são iguais"
# A hipótese alternativa é "as médias são diferentes".

# Quando estamos fazendo um teste de hipótese, precisamos escolher um nivel de significância como um limite de quanto
# de chance estamos dispostos a aceitar. Chamamos esse nivel de significancia de alpha

In [10]:
# Nesse exemplo, vamos usar alpha = 0.05 ou 5%

# Vamos utilizar a função ttest_index().
# O resultado dessa função são o t-statistic e o p-value.

# p-value é a probabilidade que indica a chance da nossa hipotese nula ser verdadeira
# Então, se o p-value for menor do que alpha, isso significa que podemos rejeitar a hipótese nula

In [11]:
from scipy.stats import ttest_ind

# Vamos chamar essa função com nossas 2 populações
ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])

Ttest_indResult(statistic=1.3223540853721598, pvalue=0.18618101101713855)

In [12]:
# Aqui vemos que a probabilidade é 0.18, isso é acima do nosso valor alpha=0.05. Isso significa que não podemos rejeitar 
# a hipótese nula.
# A hipótese nula é que as médias são iguais, e não temos certeza para chegar numa conclusão do contrário.
# Isso não significa que isso prova que as médias são diferentes.

In [13]:
# Por que não checamos outras colunas?
print(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade']))
print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade']))
print(ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade']))
print(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade']))
print(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))

Ttest_indResult(statistic=1.2514717608216366, pvalue=0.2108889627004424)
Ttest_indResult(statistic=1.6133726558705392, pvalue=0.10679998102227865)
Ttest_indResult(statistic=0.049671157386456125, pvalue=0.960388729789337)
Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492)
Ttest_indResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656)


In [14]:
# Olhando para o assignment3, temos um p-value em cerca de 0.1. Isso significa que se nós aceitarmos uma semelhança de chance de 11%,
# então isso seria considerado estatisticamente significante. Logo poderíamos afirmar que as médias são diferentes

### Another example

In [15]:
df1=pd.DataFrame([np.random.random(100) for x in range(100)])
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.175414,0.638637,0.387504,0.211096,0.646412,0.265845,0.760867,0.34512,0.635507,0.766383,...,0.842805,0.724981,0.230463,0.897827,0.237607,0.593423,0.268959,0.883596,0.648013,0.370324
1,0.409084,0.199019,0.954848,0.054873,0.039002,0.215423,0.47613,0.025203,0.435739,0.82896,...,0.190348,0.997028,0.861174,0.90658,0.502881,0.575197,0.836858,0.104808,0.582507,0.400891
2,0.005919,0.346017,0.024536,0.489169,0.751951,0.449292,0.020725,0.207849,0.307627,0.241126,...,0.387569,0.7389,0.296571,0.480121,0.764587,0.486982,0.863845,0.533797,0.724763,0.430512
3,0.665987,0.772057,0.242747,0.463461,0.576869,0.34647,0.620944,0.4697,0.56871,0.410164,...,0.00456,0.766592,0.941964,0.215359,0.427303,0.806292,0.606227,0.769542,0.235947,0.945458
4,0.275996,0.548883,0.952487,0.039108,0.259874,0.343433,0.577427,0.263102,0.478811,0.551655,...,0.888561,0.689381,0.730141,0.012433,0.037136,0.507065,0.746489,0.1392,0.545619,0.656339


In [16]:


df2=pd.DataFrame([np.random.random(100) for x in range(100)])
df2

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.052153,0.711411,0.201364,0.867931,0.792641,0.379028,0.600188,0.390803,0.920085,0.857920,...,0.953354,0.097404,0.652310,0.902435,0.881329,0.890629,0.868611,0.425853,0.686675,0.150973
1,0.225433,0.764236,0.836200,0.030097,0.505519,0.292427,0.287547,0.692872,0.820059,0.659970,...,0.753942,0.316999,0.687975,0.645751,0.453070,0.546404,0.061463,0.503309,0.551731,0.135108
2,0.229334,0.166366,0.866248,0.121861,0.352192,0.053254,0.714748,0.375303,0.634152,0.886253,...,0.303814,0.261505,0.557772,0.745781,0.319780,0.574198,0.574208,0.896106,0.588634,0.549988
3,0.680387,0.848159,0.800199,0.648091,0.361729,0.845423,0.912884,0.462486,0.085399,0.378834,...,0.490975,0.185243,0.481163,0.829650,0.672107,0.818176,0.079886,0.683397,0.196773,0.551874
4,0.431063,0.501681,0.708114,0.929857,0.255354,0.728563,0.296990,0.757409,0.896836,0.363791,...,0.713669,0.786927,0.018224,0.714136,0.106263,0.598108,0.703949,0.940075,0.833153,0.851878
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.132620,0.585956,0.928765,0.770617,0.382717,0.148053,0.261160,0.141157,0.835474,0.769967,...,0.488052,0.671933,0.047087,0.229213,0.835982,0.012183,0.475377,0.019024,0.144194,0.655126
96,0.637832,0.752506,0.948742,0.281078,0.794968,0.800514,0.218750,0.763843,0.480052,0.804712,...,0.881077,0.305948,0.146009,0.026699,0.555614,0.700282,0.157826,0.285461,0.680099,0.856943
97,0.438417,0.768558,0.501437,0.213934,0.836146,0.981801,0.293473,0.617777,0.245427,0.206635,...,0.918112,0.412387,0.233645,0.016386,0.723763,0.139088,0.724840,0.059124,0.104629,0.546500
98,0.674817,0.874976,0.505236,0.487392,0.412766,0.560658,0.687612,0.655264,0.191222,0.088278,...,0.783297,0.034972,0.709907,0.290237,0.555737,0.782675,0.957897,0.997998,0.988232,0.704151


In [17]:
# Esses 2 dataframes possuem linhas iguais?
# Vamos comparar linha a linha e retornar aquelas que apresentarem um p-value menor do que alpha

def test_columns(alpha=0.1):   
    num_diff=0
    
    for col in df1.columns:
        
        teststat,pval=ttest_ind(df1[col],df2[col])
        # and we check the pvalue versus the alpha
        if pval<=alpha:
            # And now we'll just print out if they are different and increment the num_diff
            print("Col {} is statistically significantly different at alpha={}, pval={}".format(col,alpha,pval))
            num_diff=num_diff+1
    # and let's print out some summary stats
    print("Total number different was {}, which is {}%".format(num_diff,float(num_diff)/len(df1.columns)*100))

# And now lets actually run this
test_columns()

Col 12 is statistically significantly different at alpha=0.1, pval=0.015962071026245216
Col 20 is statistically significantly different at alpha=0.1, pval=0.03585749837895578
Col 31 is statistically significantly different at alpha=0.1, pval=0.04727008885223079
Col 37 is statistically significantly different at alpha=0.1, pval=0.08555995194378013
Col 38 is statistically significantly different at alpha=0.1, pval=0.06432440643873708
Col 39 is statistically significantly different at alpha=0.1, pval=0.0027551213400418256
Col 50 is statistically significantly different at alpha=0.1, pval=0.056883085699394334
Col 53 is statistically significantly different at alpha=0.1, pval=0.03460949162098226
Col 67 is statistically significantly different at alpha=0.1, pval=0.01087002305140466
Col 76 is statistically significantly different at alpha=0.1, pval=0.06069209480875069
Col 79 is statistically significantly different at alpha=0.1, pval=0.0696498197200743
Col 82 is statistically significantly di