# Hypothesis testing


In [9]:
import numpy as np
import pandas as pd

## Chi-squared test
Here is a short summary from [link](https://stattrek.com/chi-square-test/independence?tutorial=ap) <br>

The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.
For example, in an election survey, voters might be classified by gender (male or female) and voting preference (Democrat, Republican, or Independent). We could use a chi-square test for independence to determine whether gender is related to voting preference.<br>

<b>Assumptions</b>:
* The sampling method is simple random sampling.
* The variables under study are each categorical.
* If sample data are displayed in a contingency table, the expected frequency count for each cell of the table is at least 5.



<b> How to perform Chi-squared test</b> [link](https://www.jmp.com/en_be/statistics-knowledge-portal/chi-square-test.html)
1. State Hypothesis. <br>
    Ho: Variable A and Variable B are independent.<br>
    Ha: Variable A and Variable B are not independent.<br>
2. Significance level
3. Analyse sample data.
    * Check the assumptions for the test.
    * calculate degree of fredom $DF=(r-1)*(c-1)$. Where r and c number of levels for categorical variable.
4. Perform the test.
    * calculate expected frequency
    * test statistics
    * p-value
5. Interpret results

<b> Test statistics </b><br>
$\chi^2=\sum\frac{(O_{i}-E_{i})^2}{E_{i}}$ ;
Where $O_{i}$ is an observed value and  $E_{i}$ is an expected value

In [10]:
# calculate p_value from the $\chi^2$
from scipy.stats import chi2
Chi2=16.2
p_value=1 - chi2.cdf(16.2,2)

In [11]:
from scipy.stats import chi2_contingency
#chi2, p, dof, ex = chi2_contingency(obs, correction=False)


### Chi-squere independence test

H0: Living arragement and Exercise is independent
H1: Living arragement and Exercise is not independent

In [12]:
#No Regular Exercise
#Sporadic Exercise
#Regular Exercise

Dormitory=[32,30,28]
On_Campus=[74,64,42]
Off_Cumpus=[110,25,15]
At_Home=[39,6,5]

Survey=[Dormitory,On_Campus,Off_Cumpus,At_Home]
Survey_array=np.array(Survey)
Survey_array

from scipy.stats import chi2_contingency
chi2, p, dof, ex = chi2_contingency(Survey_array, correction=False)
print(chi2,p,dof)

# conclusion
# Living arragement and excersice are not independent

60.43944691358026 3.6644965577536217e-11 6


### Chi-square goodness of fit
https://www.jmp.com/en_be/statistics-knowledge-portal/chi-square-test/chi-square-goodness-of-fit-test.html

A genetics engineer was attempting to cross a tiger and a cheetah.  She predicted a phenotypic outcome of the traits she was observing to be in the following ratio 4 stripes only: 3 spots only: 9 both stripes and spots.  When the cross was performed and she counted the individuals she found 50 with stripes only, 41 with spots only and 85 with both.  According to the Chi-square test, did she get the predicted outcome

H0: observed==expected
H1: observed!=expected
significance level p=0.05

p_value=0.09>0.05 so we could not reject H0

In [13]:
Experiment=[50,41,85]
N_animals=sum(Experiment)

Expected_ratio=[4,3,9]
Percent=np.array(Expected_ratio)/sum(Expected_ratio)
Expected=N_animals*Percent

from scipy.stats import chisquare
chisquare(Experiment,Expected)

Power_divergenceResult(statistic=4.737373737373738, pvalue=0.09360355937725263)

## Fisher exact test
https://towardsdatascience.com/fishers-exact-fb49432e55b5

Fisher exact test is used to determine whether or not there is a significant association between two categorical variables. It is typically used when one or more of the cell counts in a 2×2 table is less than 5.[link](https://www.statology.org/fishers-exact-test/) 

Fisher's exact test of independence is used if you want to see whether the proportions of one categorical variable are different depending on the value of the other variable. Use it when the sample size is small.[link](https://stats.libretexts.org/Bookshelves/Applied_Statistics/Book%3A_Biological_Statistics_(McDonald)/02%3A_Tests_for_Nominal_Variables/2.07%3A_Fisher's_Exact_Test)

H0:Two variables are independent <br>
HA: Two variables are not independent


### Assumptions
* individual observations are independent.
*  Fisher's exact test assumes that the row and column totals are fixed


## Example

29 patients split into two groups (16,13). One group received drug_1, the other group received drug_2. For drug_1 group 13 out of 16 were cured (81%), for drug_2 group 4 out of 13 were cured (31%)

H0: The probability of getting cured are the same and does not depend on the drug. In other words, the proportion of one variable does not depend on the other.

HA: Otherwise.

numpy.stats [link](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html)

In [17]:
table=pd.DataFrame({"animal":["cured","not_cured"],"drug_1":[13,3],"drug_2":[4,9]})
table

Unnamed: 0,animal,drug_1,drug_2
0,cured,13,4
1,not_cured,3,9


In [20]:
table=np.array([[13,3],[4,9]])
table
total=np.sum(table)
print(total)

29


In [27]:
from scipy.stats import hypergeom

rv = hypergeom.pmf(3,total,16,12)
print(rv)

0.007715440525351361


In [32]:
prob=np.array([hypergeom.pmf(i,total,16,12) for i in range(0,17)]) 
p_value=np.sum(prob[prob<=rv])
print('p_value={}'.format(p_value))

p_value=0.009530322558019238


In [33]:
prob

array([2.50501316e-07, 2.40481263e-05, 6.61323474e-04, 7.71544053e-03,
       4.51353271e-02, 1.44433047e-01, 2.64793919e-01, 2.83707770e-01,
       1.77317356e-01, 6.30461712e-02, 1.20360872e-02, 1.09418975e-03,
       3.50701842e-05, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00])

In [34]:
from scipy.stats import fisher_exact
oddsr, p = fisher_exact(table, alternative='two-sided')
p

0.009530322558019238

The eastern chipmunk trills when pursued by a predator, possibly to warn other chipmunks. Burke da Silva et al. (2002) released chipmunks either  10  or  100  meters from their home burrow, then chased them (to simulate predator pursuit). Out of  24  female chipmunks released  10m  from their burrow,  16  trilled and  8  did not trill. When released 100 m from their burrow, only 3 female chipmunks trilled, while 18 did not trill. The two nominal variables are thus distance from the home burrow (because there are only two values, distance is a nominal variable in this experiment) and trill vs. no trill. Applying Fisher's exact test, the proportion of chipmunks trilling is significantly higher ( P=0.0007 ) when they are closer to their burrow.

H0: weather chipmunk trills does not depend on the distanct to the home burrow. 
HA: otherwise

In [36]:
table=np.array([[16,8],[3,18]])
oddsr, p = fisher_exact(table, alternative='two-sided')
p

0.0006862011459039608

Applying Fisher's exact test, the proportion of chipmunks trilling is significantly higher ( P=0.0007 ) when they are closer to their burrow