# BIOEE 4940 : **Introduction to Quantitative Analysis in Ecology**
### ***Spring 2021***
### Instructor: **Xiangtao Xu** ( ✉️ xx286@cornell.edu)
### Teaching Assistant: **Yanqiu (Autumn) Zhou** (✉️ yz399@cornell.edu)

---

## <span style="color:royalblue">Lecture 4</span> *Statistical Inference II: Hypothesis Testing*
*Partly adapted from [How to be a quantitative ecologist](https://www.researchgate.net/publication/310239832_How_to_be_a_Quantitative_Ecologist_The_'A_to_R'_of_Green_Mathematics_and_Statistics) and [All of Statistics](https://www.stat.cmu.edu/~larry/all-of-statistics/)*




### 1. Rationale of Hypothesis Testing

**Hypotheses** are at the heart of the scientific method. Construction of testable hypotheses is one of the most challenging and exciting process in exploring the unknown. In statistical inference, hypothesis testing serves as the most common bridge to connect data and the underlying processes. 

We will now walk through the general steps to conduct hypothesis testing:
1. For a random variable/process $\theta$, we separate the *event space* into two disjoint/non-overlapping sets $\Theta_0$ and $\Theta_1$. We then define a **null hypothesis** ($H_0: \theta \in \Theta_0$) and an **alternative hypothesis** ($H_1: \theta \in \Theta_1$).
    * Most frequently, the null hypothesis is a smaller event space, which is usually a more parsimonious/normal explanation of the real world (e.g. two quantities are strictly equal). It is like a legal trial, where we assume someone is innocent unless the evidence strongly suggests that the person is guilty.
    * Note that null hypothesis + alternative hypothesis should be equal to the whole event space
    * Examples:
       H0: Tree mortality of Species A is equal to tree mortality of Species B 
       H1: Tree mortality of Species A is **not** equal to tree mortality of Species B

2. Construct an estimator (also called a *test statistic*) based on $H_0$ and find its sampling distribution based on the data.
    * A common challenge in statistical hypothesis testing is to find an appropriate test statistic with an informative sampling distribution.
    * If we can assume a distribution of the population, we can usually use a *parametric* test. If not, we need to use a *non-parametric* test, which usually derives a sampling distribution for the estimator based on central limit theorem.

3. Find *critical value(s)* for the estimator associated with a probability $\alpha$, which is called **the significance level**. These values tell us how wild the test statistic has to be at a given significance level.
    * *Two-sided test* (most common) has the form $H_0: \theta = \theta_0$ vs $H_1: \theta \neq \theta_0$. In these tests, a significance level will map into two critical values since we do not differentiate whether $\theta$ is greater/smaller than $\theta_0$
    * *One-sided test* (we have some prior knowledge about the test statistics) has the form $H_0: \theta \leq \theta_0$ vs $H_1: \theta \gt \theta_0$ (or the reverse). In this case, a signficance level will only map into one critical value
    
4. Compare the data-based estimate and the critical values. If the estimate falls outside of the critical values (region of extremely rare events), reject $H_0$. Otherwise, we say that there is not enough evidence ot reject $H_0$ so we prefer to retain the more parsimonious/normal/simplistic explanation of the world. 

In [None]:
# example of one-sided vs two-sided
from scipy import stats
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

# create a random variable with a t-distribution (degree of freedom is 5 so that the peak is not so extreme...)
rv_t = stats.t(5)

# plot theoretical distribution

fig = plt.figure()

x_lim = (-5,5)
x = np.arange(x_lim[0],x_lim[1],0.01)

plt.plot(x,rv_t.pdf(x))

ax = plt.gca()

ax.set_ylabel('PDF')

######################
# plot critcal values

alpha = 0.05

# find two-sided alpha
x_2s_l, x_2s_h = rv_t.ppf(alpha/2.), rv_t.ppf(1. - alpha/2.)

plot_x = np.arange(x_lim[0],x_2s_l,0.001)
ax.fill_between(plot_x,rv_t.pdf(plot_x),facecolor='r',alpha=0.5, label='two-sided')

plot_x = np.arange(x_2s_h,x_lim[1],0.001)
ax.fill_between(plot_x,rv_t.pdf(plot_x),facecolor='r',alpha=0.5)

# find one-sided alpha

x_1s_h = rv_t.ppf(1. - alpha)
plot_x = np.arange(x_1s_h,x_lim[1],0.001)
ax.fill_between(plot_x,rv_t.pdf(plot_x),facecolor='k',alpha=0.5, label='one-sided')

ax.legend()



### 2. Outcomes of hypothesis testing


|| Retain $H_0$ | Reject $H_0$|
| ----| -------------| ------------|
| $H_0$ is true |Correct | type I error|
| $H_1$ is true | type II error|Correct | 


* Type I error:  reject the null hypothesis when it is in fact true  (i.e. an innocent person is convicted; false positive)
* Type II error: accept the null hypothesis when it is in fact false (i.e. a guilty person is not convicted; false negative)
* Example:
    Test whether an experimental treatment has effect on organismal behavior/functioning/...
    
    $H_0$: there is no effect vs $H_1$: there is effect
    
    Type I error -> there is no effect but we conclude there is an effect based on data
    
    Type II error -> there is real effect but we conclude there is no effect
    
    
    
*Quantify the risk of errors*
* p-values and the risk of type I error
    * Generally, if the test rejects $H_0$ at level $\alpha$, it will also reject $H_0$ at level $\alpha' < \alpha$. Therefore, we can find a smallest $\alpha$ at which the test rejects and we call this number the **p-value** (refer back to the figure above)
    * p-value is a measure of the evidence against $H_0$. It means the probability (under $H_0$) of observing a value of the test statistic the same as or more extreme than what was actually observed.
    * Note that p-value is **NOT** the probability that the null hypothesis is true. So a large p-value is not strong evidence in favor of $H_0$. It can also occur because the test has low power.
    * If p-value < a given $\alpha$, we can say the test is significant at level $\alpha$. In this case, the risk of type I error is $\alpha$
    
* statistical power and the risk of type II error
    * The power of a test is the probability to reject $H_0$ when $H_1$ is true, i.e. the probability to avoid type II error.
    * The power and thus the risk of type II error is related with (1) the effect size, or how far away the real process is from $H_0$, (2) population variance, or how likely we will observe extreme values, (3) sampling size, or how much information we have, and (4) the significance level of the test.
    * Example: 
    
        $H_0$: mean effect size $\mu$ is zero vs $H_1$: mean effect size $\mu$ is greater than zero (assume effect cannot be negative in this case)
    
        From our prior knowledge on point estimation of mean values, we can construct the following estimator
        
        $\hat{\theta}=\frac{\bar{X}}{\sigma/\sqrt{N}}$, where X is observed effect and $\sigma$ is population standard deviation.
        
        We will reject $H_0$ if $\frac{\bar{X}}{\sigma/\sqrt{N}} > Z_{1-\alpha/2}$
    
        If $H_1$ is true, let's assume the real $\mu$ is equal to $\mu_1$ and rewrite $\hat{\theta}$ as
        
        $\hat{\theta}=\frac{\bar{X}-\mu_1+\mu_1}{\sigma/\sqrt{N}}$
        
        Therefore, we will only reject $H_0$ when $\frac{\bar{X}-\mu_1}{\sigma/\sqrt{N}} > Z_{1-\alpha/2} - \frac{\mu_1}{\sigma/\sqrt{N}}$. Note that $\frac{\bar{X}-\mu_1}{\sigma/\sqrt{N}}$ is approximately a standard normal distribution if $H_1$ is true. So the power of the test is the probability to get a value larger than $Z_{1-\alpha/2} - \frac{\mu_1}{\sigma/\sqrt{N}}$ from a normal distribution while the risk of type II error is 1 - power.
        
        The power of the test is higher if (1) we set a greater $\alpha$; (2) $\mu_1$ is larger; (3) $\sigma$ is smaller; and (4) N is larger. The risk of type II error is higher if all of above conditions are inverse. Most notably, **type II error is higher if we set a lower $\alpha$ (low type I error)**. In addition, type II error is larger when sample size is small. Therefore, usually we can conduct *statistical power analysis* to determine the necessary sample size to detect a given effect at a certain significance level.

In [None]:
# type II error

# create a random variable with a t-distribution (degree of freedom is 5 so that the peak is not so extreme...)
rv_t = stats.t(5)

# plot theoretical distribution

fig = plt.figure()

x_lim = (-5,5)
x = np.arange(x_lim[0],x_lim[1],0.01)

# Distribution if Null Hypothesis is True
plt.plot(x,rv_t.pdf(x),'k--',lw=3,label='H0')

# Distribution if Alternate Hypothesis is True
effect_size = 1.
plt.plot(x+effect_size,rv_t.pdf(x),'r-',lw=3,label='H1')

ax = plt.gca()

ax.set_ylabel('PDF')

######################
# plot critcal values

alpha = 0.05

# find one-sided alpha

x_1s_h = rv_t.ppf(1. - alpha)

ax.plot([x_1s_h,x_1s_h],[0,0.35],'b-',lw=3,label='alpha=0.05')

plot_x = np.arange(x_1s_h,x_lim[1],0.001)
ax.fill_between(plot_x,rv_t.pdf(plot_x),facecolor='none',hatch='///',
                edgecolor='b',alpha=0.5, label='Type I error Prob.')

plot_x = np.arange(x_lim[0],x_1s_h-effect_size,0.001)
ax.fill_between(plot_x+effect_size,rv_t.pdf(plot_x),facecolor='none',hatch='XX',
                edgecolor='r',alpha=0.5, label='Type II error Prob.')


# increase alpha
alpha = 0.01

# find one-sided alpha

x_1s_h = rv_t.ppf(1. - alpha)

ax.plot([x_1s_h,x_1s_h],[0,0.35],'m-',lw=3,label='alpha=0.01')



ax.legend()



### 3. Common hypothesis testing and implementations in Python

A full list of test functions in `scipy` : https://docs.scipy.org/doc/scipy/reference/stats.html

Also in `statsmodels.stats` : https://www.statsmodels.org/stable/stats.html

And `scikit-posthocs` https://scikit-posthocs.readthedocs.io/en/latest/intro/

In [None]:
# prepare for testing
from statsmodels import stats as ss # differentiate from scipy.stats
import pandas as pd

# read baad data
baad_data_url = 'https://raw.githubusercontent.com/xiangtaoxu/QuantitativeEcology/main/Lab1/baad_data.csv'
baad_dictionary_url = 'https://raw.githubusercontent.com/xiangtaoxu/QuantitativeEcology/main/Lab1/baad_dictionary.csv'

# encodings are not always necessary 
# Here I include them because the raw csv is not compatible with utf-8 encoding

df_data = pd.read_csv(baad_data_url, encoding='latin_1') # can also read local files
df_dict = pd.read_csv(baad_dictionary_url, encoding='latin_1')

# create a new variable shoot to root ratio (leaf mass / fine root mass), log-transformed

df_data['log_s2r'] = np.log(df_data['m.lf'] / df_data['m.rf'])

In [None]:
df_data.boxplot(column=['log_s2r'],by=['vegetation'])

#### 3.1 Test for the mean (t-test)

* Simplest case, compare population mean ($\mu$) with a constant

$H_0: \mu = \mu_0$  vs $H_1: \mu \neq \mu_0$

One sample t-test, equivalent to examine whether a constant is within a 1-$\alpha$ confidence interval.

estimator is $\frac{\bar{X}}{s/\sqrt{N}}$, which follows a t-distribution (recall $\bar(X)$ is a normal distribution and $s^2$ (sample variance) is a chi-square distribution)

**Assumption**: sample size is large enough so that the sampling distribution is normal OR if sample size is small, the underlying distribution is normal

In [None]:
# usually in ecosystem models, we set s2r to be 1. -> log_s2r to be 0.

# test whether the average log_s2r is zero

# first plot histogram

df_data.plot(y='log_s2r',kind='hist',bins=50)

print(df_data['log_s2r'].count())
log_s2r = df_data['log_s2r'].values
log_s2r = log_s2r[~np.isnan(log_s2r)]

In [None]:
#scipy
print(stats.ttest_1samp(log_s2r,0.))
print(stats.ttest_1samp(log_s2r,0.,alternative='greater'))

* Comparing two different means

Most commonly used (e.g. to infer experimental effect by comparing control mean and experiment mean).

$H_0: \mu_1 = \mu_2$  vs $H_1: \mu_1 \neq \mu_2$

1. (Parametric) Two sample t-test

    estimator is $\frac{\bar{X_1} - \bar{X_2}}{s}$ where $s = \frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}$, if $X_1$ and $X_2$ have the same $\sigma$

    A modified version of t-test is also possible if the $\sigma$s are different (Welch's t-test)
    
    Special case: t-test for paired data (can be convert to one sample t-test, why?)
    
    **Assumption**: normality
    
2. (non-parametric) Wilcoxon rank sum test and Mann-Whitney test

    used when normality assumption is not met

In [None]:
# test whether s2r is different in tropical rainforest and seasonal forest
log_s2r_rf = df_data[df_data['vegetation'] == 'TropRF']['log_s2r'].values
log_s2r_sf = df_data[df_data['vegetation'] == 'TropSF']['log_s2r'].values
log_s2r_tf = df_data[df_data['vegetation'] == 'TempF']['log_s2r'].values

log_s2r_rf = log_s2r_rf[~np.isnan(log_s2r_rf)]
log_s2r_sf = log_s2r_sf[~np.isnan(log_s2r_sf)]
log_s2r_tf = log_s2r_tf[~np.isnan(log_s2r_tf)]

# parametric two sample t_test
print('Two Sample t-test: Rainforest vs Seasonal Forest')
print(stats.ttest_ind(log_s2r_rf,log_s2r_sf))

print('Two Sample t-test: Rainforest vs Temp Forest')
print(stats.ttest_ind(log_s2r_rf,log_s2r_tf))

In [None]:
# not assume equal variance
# parametric two sample t_test
print('Two Sample t-test: Rainforest vs Seasonal Forest')
print(stats.ttest_ind(log_s2r_rf,log_s2r_sf,equal_var=False))

print('Two Sample t-test: Rainforest vs Temp Forest')
print(stats.ttest_ind(log_s2r_rf,log_s2r_tf,equal_var=False))

In [None]:
# non-parametric Wilcoxon test
print('Ranksums test:')
print(stats.ranksums(log_s2r_rf,log_s2r_sf))
print(stats.ranksums(log_s2r_rf,log_s2r_tf))

In [None]:
# power analysis for T-test
from statsmodels.stats.power import TTestIndPower

fig, axes = plt.subplots(2,1,figsize=(3,6))

# power vs sample size

res = TTestIndPower().plot_power(
    dep_var='nobs',
    nobs=np.arange(5,100),
    effect_size= np.array([0.1, 0.5, 2]),
    ax=axes[0],
    title='Power vs N')


res = TTestIndPower().plot_power(
    dep_var='alpha',
    nobs=np.array([5,20,50,100]),
    alpha= np.arange(0.001,0.05,0.001),
    effect_size=0.5,
    ax=axes[1],
    title='Power vs alpha')

fig.tight_layout()

* Comparing multiple different means


1. One-way analysis of variance (ANOVA), test whether the mean values of two or more samples are the same.

$H_0: \mu_1 = \mu_2 = ... = \mu_n$  vs $H_1: at least one of the \mu_i is different$

We can use an F-test if all samples are normally distributed and have the same variance.

The test statistic F = between-group variability / within-group variability. If $H_0$ is true, betwee-group variability tends to be zero. The theoretical distribution for F is called an F distribution.

2. (non-parametric) Kruskal-wallis test, an extension of Mann-Whitney test

3. Tukey's test

Most of the time, we also want to know the ranking of mean values across samples and whether their differences are significant. To acheive this, we can use Tukey's test which compare every combination of $\mu$ from two samples. Useful for experiments with multiple treatment within the same level.



In [None]:
# F test
# test whether s2r is different in tropical rainforest and seasonal forest
log_s2r_rf = df_data[df_data['vegetation'] == 'TropRF']['log_s2r'].values
log_s2r_sf = df_data[df_data['vegetation'] == 'TropSF']['log_s2r'].values
log_s2r_tf = df_data[df_data['vegetation'] == 'TempF']['log_s2r'].values
log_s2r_bf = df_data[df_data['vegetation'] == 'BorF']['log_s2r'].values

log_s2r_rf = log_s2r_rf[~np.isnan(log_s2r_rf)]
log_s2r_sf = log_s2r_sf[~np.isnan(log_s2r_sf)]
log_s2r_tf = log_s2r_tf[~np.isnan(log_s2r_tf)]
log_s2r_bf = log_s2r_bf[~np.isnan(log_s2r_bf)]

print('One-way ANOVA F-test')
print(stats.f_oneway(log_s2r_rf,log_s2r_sf,log_s2r_tf,log_s2r_bf))
print('Kruskal-Wallis')
print(stats.kruskal(log_s2r_rf,log_s2r_sf,log_s2r_tf,log_s2r_bf))

In [None]:
# Tukey's test
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# create a new dataframe
df_rf = pd.DataFrame({
    'log_s2r' : log_s2r_rf,
    'vegetation' : ['TropRF'] * len(log_s2r_rf)
    
})

df_sf = pd.DataFrame({
    'log_s2r' : log_s2r_sf,
    'vegetation' : ['TropSF'] * len(log_s2r_sf)
    
})

df_tf = pd.DataFrame({
    'log_s2r' : log_s2r_tf,
    'vegetation' : ['TempF'] * len(log_s2r_tf)
    
})

df_bf = pd.DataFrame({
    'log_s2r' : log_s2r_bf,
    'vegetation' : ['BorF'] * len(log_s2r_bf)
    
})

# concatenate them together
df_tukey = pd.concat([df_rf,df_sf,df_tf,df_bf])

result = pairwise_tukeyhsd(df_tukey['log_s2r'],
                          groups=df_tukey['vegetation'])
print(result.summary())

#### 3.2 Comparing variances

Bartlett's test for equal variance (samples are normally distributed)
Levene test for equal variances (samples deviate from normality)

In [None]:
print(stats.bartlett(log_s2r_rf,log_s2r_sf))
print(stats.levene(log_s2r_rf,log_s2r_sf))


#### 3.3 Test frequency distribution of multinomial variables (qualitative data)

Compare the observed frequency across different categories (e.g. species, locations, etc.) with given probabilities.

e.g. Are the probability of liana infestation different across a few tree species.

$H_0: O_i = E_i$  vs $H_1: O_i \neq E_i$

estimator is $\chi_0^2 = \sum_{i=1}^{m} \frac{(O_i - E_i)^2}{E_i}$


**Assumption**: This test is invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the observed and expected frequencies should be at least 5.
    
    

In [None]:
# skip?
stats.chisquare

#### 3.4 Test against a distribution

1. normality test

We can first visually judge the deviation from normality through quantile-quantile plot (qq-plot)

Most commonly use shapiro-wilk test (test the null hypothesis of normality)



In [None]:
# qq plot
fig, ax = plt.subplots(1,1)
ax = fig.add_subplot(111)
res = stats.probplot(log_s2r_rf,dist=stats.norm, plot=ax)

# also try tf

In [None]:
# shaprio test
print(stats.shapiro(log_s2r_rf))

2. test whether samples come from a certain distribution

Kolmogorov-Smirnov test compares the CDFs of different samples (2 samples) or with a theoretical distribution (1 sample) and test the null hypothesis that they are from the same distribution


In [None]:
# compare rf and tf
print(stats.kstest(log_s2r_rf,log_s2r_sf))

#### 3.5 Test for correlation

One most common task in quantitative analysis is to assess the relationship of two random variables. The strength of linear association between two random variables can be formalized into the concept of **correlation**.

To get quantitative assessment of correlation, we first calculate a quantity called **covariance**. Like variance which describes the spread of a random variable, covariance descirbes the spread between two variables:

$cov(X,Y) = E((X-{\mu}_X)(Y-{\mu}_Y))$, where X, Y are random varibales.

It is easy to see that if X, Y observations deviate from their mean values in the same direction, the product will be more positive, otherwise it will be more negative.

The values of covariance depends on the units of the two variables and the spread of their respective distributions. In order to normalize the quantity to allow for consistent and universal comparisons across different relationships, we can scale them by the variance of X and Y. In this way we get the **Pearson's correlation coefficient** for a population:

$\rho(X,Y) = \frac{cov(X,Y)}{\sigma_X\sigma_Y}$

We usually want to know whether the correlatoin is significantly from zero (not correlated):

$H_0: \rho_{X,Y} = 0$ vs $H_1: \rho_{X,Y} \neq 0$

To conduct a hypothesis test, we can define an estimator **Pearson's r**

$r_{X,Y} = \frac{\sum_{i=1}^{N} (X_i-\bar{X})(Y_i-\bar{Y})}{\sqrt{\sum_{i=1}^{N} (X_i-\bar{X})^2} \sqrt{\sum_{i=1}^{N} (Y_i-\bar{Y})^2}}$

When X, Y are both normally distributed, under $H_0$ (i.e., X, Y are independent), it turns out we can construct a t-test for $r_{X,Y}$ by certain transformations and get a two-sided p-value. This can be used to reject/retain the null hypothesis.

For non-normal distributions and small sample size, we can use some computational methods such as *permutation* and *bootstrap* to estimate the confidence interval of $r_{X,Y}$ and conduct the test.

 
If we are more interested in whether X is monotonically (not necessarily linearly) with Y, we can use **spearman's correlation coefficient**


In [None]:
# pairwise plot

res = pd.plotting.scatter_matrix(df_data[['a.lf','a.ssbh','d.bh','log_s2r']])

In [None]:
la_data = df_data['a.lf']
sa_data = df_data['a.ssbh']
log_s2r_data = df_data['log_s2r']

# correlation between leaf area and sapwood area
data_mask = ~np.isnan(la_data) & ~np.isnan(sa_data)
coef, p = stats.pearsonr(la_data[data_mask],sa_data[data_mask])
print(f'LA vs SA: Pearson r = {coef}, p = {p}')
coef, p = stats.spearmanr(la_data[data_mask],sa_data[data_mask])
print(f'LA vs SA: Spearman r = {coef}, p = {p}')

# log transform
data_mask = ~np.isnan(la_data) & ~np.isnan(sa_data)
coef, p = stats.pearsonr(np.log(la_data[data_mask]),np.log(sa_data[data_mask]))
print(f'log_LA vs log_SA: Pearson r = {coef}, p = {p}')
coef, p = stats.spearmanr(np.log(la_data[data_mask]),np.log(sa_data[data_mask]))
print(f'log_LA vs log_SA: Spearman r = {coef}, p = {p}')

# leaf area and s2r
data_mask = ~np.isnan(la_data) & ~np.isnan(log_s2r_data)
coef, p = stats.pearsonr(la_data[data_mask],log_s2r_data[data_mask])
print(f'LA vs log_s2r: Pearson r = {coef}, p = {p}')
coef, p = stats.spearmanr(la_data[data_mask],log_s2r_data[data_mask])
print(f'LA vs log_s2r: Spearman r = {coef}, p = {p}')

In [None]:
# permutation to get range of r for sapwood area and s2r

data_mask = ~np.isnan(sa_data) & ~np.isnan(log_s2r_data)

sa_org = sa_data[data_mask]
s2r_org = log_s2r_data[data_mask]

print(stats.pearsonr(sa_org,s2r_org))

perm_N = 5000
r_perm = [stats.pearsonr(sa_org,np.random.permutation(s2r_org)) for i in range(perm_N)] 
r_perm = [res[0] for res in r_perm]


In [None]:
fig = plt.figure()
h = plt.hist(r_perm,bins=np.arange(-1,1,0.05))

# p value based on permutation
print(np.sum(np.array(r_perm) > 0.617)/perm_N)
