# BIOEE 4940 : **Introduction to Quantitative Analysis in Ecology**
### ***Spring 2021***
### Instructor: **Xiangtao Xu** ( ✉️ xx286@cornell.edu)
### Teaching Assistant: **Yanqiu (Autumn) Zhou** (✉️ yz399@cornell.edu)

---

## <span style="color:royalblue">Lecture 3</span> *Statistical Inference I: Point Estimation and Confidence Intervals*
*Partly adapted from [How to be a quantitative ecologist](https://www.researchgate.net/publication/310239832_How_to_be_a_Quantitative_Ecologist_The_'A_to_R'_of_Green_Mathematics_and_Statistics) and [All of Statistics](https://www.stat.cmu.edu/~larry/all-of-statistics/)*




### Overview
<img src="./img/Probability_and_inference.png" alt="Probability and Inference" style="width: 800px;"/>

*Source: All of Statistics*



**The process of statistical inference can be generally described as *Given X1,...,Xn (obervations/data) belongs to F (distribution), what we can infer from F.***

Scenarios:
* Infer population dynamics from census
* Infer trait covariations from samples over a few individuals
* Infer spatio-temporal dynamics from a few spatial/temporal samplings

Questions:
* Estimate an unknown distribution (Point Estimation)
* Uncertainty of the estimate (Confidence Interval)
* Is the estimate (significantly) different from an hypothesized value (Hypothesis Testing)

Examples:
* What is the average tree size of species X in tropical moist forests around BCI, Panama?
* What is the uncertainty of our estimated average tree size?
* Is the average tree size of species X significantly different from the average tree size of species Y?

Approaches:
* Parameteric (can be described by finite number of parameters) 
* Non-parametric (cannot be described by finite number of parameters)

### 1. Point Estimation

Point estimation refers to providing a **single "best guess"** of some quantity of interest. By convention, the **estimator** (the formula used to calculate an estimate) of $\theta$ (a **fixed, unknown** parameter, from population distribution) is usually denoted as $\hat{\theta}$ (depends on data/sample and is a **random variable**)

Some estimators can emerge very intuitively. For example, the estimator of population mean is the average of all sample values ($\hat{\mu}=\frac{1}{N}\sum_{i=1}^{N} X_i $) as we have used in the previous lectures.

#### 1.1 Quality of Estimator

There might exist various different estimators for the same parameter. How can we quantify the quality of each estimator? How good is each estimator? We can assess the quality from a few dimensions.

* The **bias** of an estimator is defined as $E(\hat{\theta})-\theta$. We call an estimator $\hat{\theta}$ **unbiased** if $E(\hat{\theta})=\theta$, which indicates the mean value of the estimator under a large number of repeated sampling from the population is (strictly) equal to the true value of $\theta$.

* However, an unbiased estimator sometimes can be hard to derive/compute. A less strict but reasonable requirement for the estimator is that $\hat{\theta}$ **converges** to the true $\theta$ as we collect more and more data (N increases). We call such estimators are **consistent**.

    * Example: $\hat{\theta}=(1-\frac{1}{N})\theta$ is biased but consistent.
    * Under small sampling size, one might need to think of corrections for biased but consistent estimators.
    
* The distribution of $\hat{\theta}$ is called the **sampling distribution** and the standard deviation of $\hat{\theta}$ is called **standard error** (of a certain estimator --> $se(\hat{\theta})$). Sometimes, the true se value depends on the population distribution and is unknown but can be estimated ($\hat{se}(\hat{\theta})$). An estimator is called **precise** or **efficient** if $se(\hat{\theta})$ is low.
    

* Finally, the overall quality of a point estimate is sometimes assessed by the **mean squared error**, or MSE defined by $E(\hat{\theta}-\theta)^2$, which includes both bias and standard error (see Ch.6 in All of Statistics for more details).

#### 1.2 Formulating Estimators: example of mean and variance

The most common estimation concerns about inferring the mean ($\mu$) and variance ($\sigma^2$) of the population. Intuitively, we can use the mean and variance of the sample distribution to estimate them. Let's try evaluating the quality of these estimators

* Population Mean:

    Assume we have N observations X1, X2, ... XN. The estimator $\hat{\mu}=\bar{X}=\frac{1}{N}\sum_{i=1}^{N} X_i$.
    
    The bias of $\hat{\mu}$ is equal to $E(\hat{\mu})-\mu$. 
    
    Since each $X_i$ is a random variable independently (in most cases) drawn from the same population, $E(\hat{\mu})=\frac{1}{N}\sum_{i=1}^{N} E(X_i)$ based on the mathematical properties of expectations. 
    
    Given $E(X_i)=\mu$, $E(\hat{\mu})=\frac{1}{N}\sum_{i=1}^{N} \mu=\mu$. Therefore, $\hat{\mu}$ is an **unbiased** estimator of $\mu$.
    
    Similarly, the variance of $\hat{\mu}$ can be expanded as 
    
    $Var(\hat{\mu})=Var(\frac{1}{N}\sum_{i=1}^{N} X_i)=\frac{1}{N^2}\sum_{i=1}^{N} Var(X_i)=\frac{1}{N}\sigma^2$. 
    
    The **standard error** of the estimator is thus $\frac{1}{\sqrt{N}}\sigma$. Therefore, the **precision** of the estimator increases with increasing number of observations.
    
    From *central limit theorem*, we even know that $\sum_{i=1}^{N} E(X_i)$ is a random variable with normal distribution (when N is large enough).

In [None]:
# Here we use tree mortality data in a Costa Rican tropical dry forests during 2015 as an example
# We aim to show the standard error of the estimated mortality rates

import pandas as pd
import numpy as np

df_cr15 = pd.read_excel('./data/CR_Mortality_2015.xlsx')
df_cr15.describe()

In [None]:
# assume all the trees from the census is the 'population'
# we calculate the 'true' value for the mean and variance of the population
mu_mort = df_cr15['isMort'].mean()
var_mort  = df_cr15['isMort'].var(ddof=0) # by default ddof = 1
print(mu_mort,var_mort)

In [None]:
# we try to estimate mu and var with N samples

# first try N = 200, what about decreasing/increasing N?
N = 200

repeat = 5000 # to get standard error numerically

mu_hat_mort = [df_cr15['isMort'].sample(N).mean() for i in range(repeat)]

# plot histogram and calculate mean and var
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(1,1)

ax.hist(mu_hat_mort,bins=np.linspace(0,0.1,20),density=True)
# overlay sample mean and population mean

mu_hat_mean = np.mean(mu_hat_mort)
mu_hat_var  = np.var(mu_hat_mort)

ax.plot([mu_hat_mean,mu_hat_mean],[0.,25],'k--',label='sample')
ax.plot([mu_mort,mu_mort],[0.,25],'r--',label='population')

print(f'mu_hat = {mu_hat_mean}, mu = {mu_mort}')
print(f'se(mu_hat) = {np.sqrt(mu_hat_var)}, sigma/sqrt(N)={np.sqrt(var_mort)/np.sqrt(N)}')

    
* Population Variance:

    Intuitively, the estimator $\hat{\sigma^2}=\frac{1}{N}\sum_{i=1}^{N} (X_i-\bar{X})^2$.
    
    For bias analysis, we need to calculate the expectation $E(\hat{\sigma^2})=\frac{1}{N}\sum_{i=1}^{N} E((X_i-\bar{X})^2)$. 
    
    Given the equality $Var(X) = E(X^2) - E(X)^2$, $E((X_i-\bar{X})^2)=Var(X_i-\bar{X})+E(X_i-\bar{X})^2$. 
    
    It is easy to know that $E(X_i-\bar{X})=E(X_i)-E(\bar{X})=\mu-\mu=0$. 
    
    The first term, $Var(X_i-\bar{X})=Var((1-\frac{1}{N})X_i - \frac{1}{N}(X_1+X_2+...+X_{i-1}+X_{i+1}+...+X_N))$. 
    
    Again, because $X_i$ is independent from each other, $Var(X_i-\bar{X})=(1-\frac{1}{N})^2Var(X) + \frac{N-1}{N^2}Var(X)=\frac{N-1}{N}\sigma^2$. 
    
    Therefore, $E(\hat{\sigma^2})=\frac{N-1}{N}\sigma^2$ and it is a **biased** but **consistent** estimator for population variance.
    
    We can easily get an **unbiased** estimator by multiplying a correcting factor of $\frac{N}{N-1}$ to $\hat{\sigma^2}$. That's why sample standard deviation is calculated as $\frac{1}{N-1}\sum_{i=1}^{N} E((X_i-\bar{X})^2)$ to account for the **loss of degree of freedom**. However, it does not really matter under lots of scenarios with large N.
    
   Furthermore, if the population obeys a normal distribution, the above unbiased $\hat{\sigma^2}$ obeys a **chi-squared** distribution with n-1 degree of freedom, which can be used to estimate its standard error. A chi-squared distribution is the summation of n squared standard normal distribution. It is a special Gamma distribution with mean of n and variance of 2n. 

In [None]:
# example of how variance estimator changes with number of observations

Ns = range(10,100+1,10) # sample only 20 trees to sample 100 trees

repeat = 2000

# we will compare the expectation and standard error of both the biased and unbiased estimator

var_hat_biased = np.zeros((len(Ns),3)) # record mean, 2.5% and 97.5%
var_hat_unbiased = np.zeros((len(Ns),3)) # record mean, 2.5% and 97.5%

for i_N, N in enumerate(Ns):
    var_biased = [df_cr15['isMort'].sample(N).var(ddof=0) for i in range(repeat)]
    
    # biased estimator
    var_hat_biased[i_N,:] = [np.mean(var_biased),np.percentile(var_biased,2.5),np.percentile(var_biased,97.5)]
    
    # unbiased estimator
    var_unbiased = [var_val * N / (N-1) for var_val in var_biased]
    var_hat_unbiased[i_N,:] = [np.mean(var_unbiased),np.percentile(var_unbiased,2.5),np.percentile(var_unbiased,97.5)]


In [None]:
fig, ax = plt.subplots(1,1)

# biased
ax.plot(Ns,var_hat_biased[:,0],'b-',lw=2,label='Biased')
ax.plot(Ns,var_hat_biased[:,1:3],'b--',lw=1)

# unbiased
ax.plot(Ns,var_hat_unbiased[:,0],'r-',lw=2,label='Unbiased')
ax.plot(Ns,var_hat_unbiased[:,1:3],'r--',lw=1)

# population
ax.plot(Ns,np.ones_like(Ns) * var_mort,'k-',lw=2,label='Population')


* More generally the moments of the sample distribution are usually **consistent and asymptotically normal** estimators for population moments (see *All of Statistics* for more details).

* Instead of using moments, another way to formulate estimators theoretically is through **maximum likelihood** approach.

    **Likelihood function** quantifies the probability of getting observed values if the data generating processes have a given set of parameters. Maximum likelhood estimators are derived by maximizing the likelihood functions (see Ch. 10.9 in How to become a quantitative ecologist for more details). It is also tightly linked with Bayesian estimation.

### 2. Confidence Interval

Confidence Interval is used to describe how confident/likely our estimator can cover the true parameter values. More quantitatively, a 1 - $\alpha$ confidence interval for a parameter $\theta$ is an interval CI = (a, b) which traps $\theta$ with probability of 1 - $\alpha$. Note here, **CI is random and $\theta$ is fixed!**.

If we know sampling distribution of estimator follows a theoretical distribution (e.g. normal distribution), we can usually calculate CI analytically (usually called a parametric approach to estimate CI).

* Example: 95% Confidence Interval for Population Mean with known variance

    We know from above that the estimator $\hat{\mu}=\bar{X}$ is a random variable with a normal distribution with mean of $\mu$ and variance of $\frac{1}{N}\sigma^2$ (usually written as $\bar{X}$ ~ $N(\mu,\frac{1}{N}\sigma^2)$).
    
    $\bar{X}$ can be converted to a standard normal variable by $Z=\sqrt{n}\frac{\bar{X}-\mu}{\sigma}$.
    
    To get 95% CI, we need to calculate $P(l \leq Z \leq u) = 0.95$, where l means the corresponding Z value with a CDF of 2.5% and u means the corresponding Z value with a CDF of 97.5%. For standard normal distribution, it can be calculated that l = -1.96 and u = 1.96.
    
    Therefore, $-1.96 \leq \sqrt{n}\frac{\bar{X}-\mu}{\sigma} \leq 1.96$.
    
    Rearrange the inequality we can get $\bar{X}-1.96\frac{\sigma}{\sqrt{n}} \leq \mu \leq \bar{X}+1.96\frac{\sigma}{\sqrt{n}}$
    
    The parametric CI is derived using a t-distribution if the population variance is unknown (Check Ch. 10.5 in How to become a quantitative ecologist)

    Below, we will further use a numerical experiment to illustrate the interpretation of the CI of population mean


In [None]:
# assume all the trees from the census is the 'population'
# we calculate the 'true' value for the mean and variance of the population
mu_dbh = df_cr15['DBH'].mean()
var_dbh  = df_cr15['DBH'].var(ddof=0) # by default ddof = 1
print(mu_dbh,var_dbh)

# we try to estimate mu and var with N samples

# first try N = 10
N = 10

repeat = 5000

# count the number of times the CI of mu_hat covers true mu

CI_hit = [mu_dbh - 1.96 * np.sqrt(var_dbh/N) <= df_cr15['DBH'].sample(N).mean() <= mu_dbh + 1.96 * np.sqrt(var_dbh/N)
          for i in range(repeat)]

print(np.mean(CI_hit))


### 3. Bootstrap

Not all standard errors and confidence intervals can be easily derived analitically, especially when the population distribution is unknown. **Bootstrap/bootstraping** (literally lifting yourself up by pulling on the laces of your own shoes...) can become useful under these nasty scenarios, although it sounds suspicious scientifically.

The key idea of bootstraping is to estimate the properties of population distribution by **resampling** the sample distribution. The general procedure for doing a bootstrap includes:
1. Draw n random samples (with replacement) from N observations
2. Compute the estimator/parameter of interest (e.g. mean, variance, median, quantile, ...)
3. Repeat steps 1 and 2 for a large number of times (e.g. 1000+) to get a distribution of the estimator
4. Calculate the properties of the estimator directly using the distribution from 3.

In [None]:
# bootstrap the median of DBH
boot_num = 5000
dbh_med_boot = [np.median(df_cr15['DBH'].sample(df_cr15.shape[0],replace=True)) for i in range(boot_num)]

fig, ax = plt.subplots(1,1)

ax.hist(dbh_med_boot,bins=np.linspace(10,20,100),density=True)

ax.boxplot(dbh_med_boot,vert=False,positions=[2])

ax.set_xlim((13,16))
plt.show()

In [None]:
# plot how mortality rates change with size classes and functional groups (fixer and non fixer)
# and esitmate the 95% CI for each mort values

dbh_class = [10,15,20,30,50]
pfts = ['F','NF']

mort_avg = np.full((2,len(dbh_class)),fill_value=np.nan,dtype=float) # fixer and non-fixer
mort_CI  = np.full((2,2,len(dbh_class)),fill_value=np.nan,dtype=float)

boot_num = 5000

for idbh, dbh_left in enumerate(dbh_class):
    if dbh_left != dbh_class[-1]:
        # not the last one
        size_mask = (df_cr15['DBH'] >= dbh_left) & (df_cr15['DBH'] < dbh_class[idbh+1]) # element-wise and operator
    else:
        # last one
        size_mask = (df_cr15['DBH'] >= dbh_left)

    for ipft, pft in enumerate(pfts):
        
        if pft == 'F':
            pft_mask = (df_cr15['N_Fixer'] == 1)
        else:
            pft_mask = (df_cr15['N_Fixer'] != 1)
            
        
        # now get the subset
        if sum(size_mask & pft_mask) == 0:
            # no entries found
            # skip
            continue
            
        df_sub = df_cr15[size_mask & pft_mask]
        
        # calculate mean mortality and bootstrap CI
        mort_avg[ipft,idbh] = df_sub['isMort'].mean()
        
        mort_boot = [df_sub['isMort'].sample(df_sub.shape[0],replace=True).mean() for ib in range(boot_num)]
        
        mort_CI[:,ipft,idbh] = np.percentile(mort_boot,[2.5,97.5])


In [None]:
fig, ax = plt.subplots(1,1)

# plot fixer
ax.errorbar(dbh_class,mort_avg[0,:],
            yerr=np.array([mort_avg[0,:]-mort_CI[0,0,:],mort_CI[1,0,:]-mort_avg[0,:]]),
            fmt='ro-',lw=2,label='Fixer')

# plot non-fixer
ax.errorbar(np.array(dbh_class)+1,mort_avg[1,:],
            yerr=np.array([mort_avg[1,:]-mort_CI[0,1,:],mort_CI[1,1,:]-mort_avg[1,:]]),
            fmt='bs-',lw=2,label='Non-Fixer')

ax.legend(loc='upper left')

ax.set_xlabel('DBH Class')
ax.set_ylabel('Mortality')
            

Note that bootstraping does not provide more real data to infer the population. Instead, it is a computational method to extract statistical properties within the samples even when the number of observations are small. The bootstraping estimation can still be significantly biased if the sample distribution cannot reflect population distribution (e.g. due to sampling bias).

### Summary:
1. Get familiar with key concepts in point estimation: bias, consistent. Usually consistent is good enough for estimators.
2. Derivation of estimators using moments/maximum likelihood
3. Interpretation and calculation of confidence interval
4. Bootstrap is a useful way to estimate the uncertainty of estimators computationally