In [1]:
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
from scipy import stats
import statsmodels.api as sm

import matplotlib
matplotlib.use('nbagg')
import matplotlib.pyplot as plt

# Inferential Statistics

Inferential statistics involves drawing conclusions about a population from a sample. 

A sample **statistic** is a quantity calculated from the sample values. Typical notation is $\bar{x}$ for the sample mean, $s$ for the sample standard deviation. The goal of inferential statistics is to estimate values for parameters of the population, such as its mean $\mu$ and its standard deviation $\sigma$. 

Suppose we draw a sample of size $n$. The particular set of values obtained are denoted $x_1, x_2 \cdots x_n$. These are particular realizations of the random variables $X_1, X_2 \cdots X_n$. We have a **random sample** if they are independent of each other and identically distributed (iid). 



## Estimators
A function $g(X_1, X_2 \cdots X_n)$ designed to estimate a population parameter is called an **estimator**. For example, the function
$$
g(X_1, X_2 \cdots X_n) = \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i
$$
is an estimator for $\mu$. Note that we could choose some other function as an estimator: a trimmed mean or a median are examples. The value computed from a particular sample is the **estimate**, and is denoted by a hat over the parameter. For example, an estimate of $\mu$ is $\hat{\mu}$.

### Bias and Variance of estimators
Estimator functions are designed to estimate population parameters. Given a choice among estimators, we can develop criteria to select the best one among them. The main criteria are bias and efficiency. For an estimator function $g$ designed to estimate the population parameter $\theta$,

$$
\begin{split}
\text{Bias} &\equiv E(g) - \theta \\
\text{var}(g) &\equiv E\left[(g-E(g))^2\right]
\end{split}
$$

where the expectations are taken over all samples from the population. Clearly, we prefer estimators with low or zero bias and small variance.

#### Mean Squared Error

A metric that incorporates both estimator bias and variance is the mean squared error (MSE). For a given estimator $g(X_1,X_2,\cdots X_n)$ and population parameter $\theta$,

$$
MSE \equiv E\left[(g-\theta)^2\right] = E\left[g^2 - 2 \theta\;g + \theta^2\right] = E(g^2) -2\theta\;E(g) + \theta^2 
$$

Now, for any random variable $Y, E(Y^2) = \text{var}(Y) + E(Y)^2$. So,

$$
\begin{split}
MSE &= E(g^2) -2\theta\;E(g) + \theta^2\\ 
&= [E(g)^2 + \text{var}(g)]-2\theta\;E(g) + \theta^2\\
&= [E(g) - \theta]^2 + \text{var}(g)
\end{split}
$$


### Estimator for population mean $\mu$

Consider estimators for the population mean $\mu$. Each random variable $X_i$ has an expected value $\mu$ and variance $\sigma^2$. So, if we pick one of them as our estimator for the mean, say $g(X_1,X_2,\cdots X_n) = X_1$, we will have an unbiased estimator with variance $\sigma^2$. However, it is possible to do better than that.The sample mean $\bar{X}$ is also unbiased but is more **efficient** because it has a lower standard deviation:

$$
\begin{split}
E(\bar{X}) &= E\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n}\sum_{i=1}^n E(X_i) = \frac{1}{n}\; n\mu = \mu \\ \\
\text{var}(\bar{X}) &= \frac{1}{n^2}\text{var}\sum X_i = \frac{1}{n^2} \sum\text{var}( X_i) +\sum_{i<j} 2 \text{cov}(X_i,X_j)\\
 &= \frac{1}{n^2} \sum\text{var} (X_i) = \frac{1}{n^2}\; n\sigma^2 = \frac{\sigma^2}{n}
\end{split}
$$

Here, we have used the fact that $\text{cov}(X_i,X_j) = 0\; \forall\; i,j$ since the $\{X_i\}$ are all independent of each other.



### Estimator for population variance $\sigma^2$
Let $g(X_1,X_2,\cdots X_n) = \sum (X_i - \bar{X})^2$. Then

$$
\begin{split}
g &= \sum (X_i^2 -2\bar{X}X_i + \bar{X}^2) = \sum X_i^2 -2n\bar{X}^2 + n \bar{X}^2 = \sum X_i^2 - n \bar{X}^2\\
  \\
E(g) &= \sum E(X_i^2) - n E(\bar{X}^2)
\end{split}
$$

We now use the fact that for any random variable $Y, E(Y^2) = \text{var}(Y) + E(Y)^2$. So,
$$
\begin{split}
E(g)&= \sum(\sigma^2 + \mu^2) - n\left(\frac{\sigma^2}{n}+\mu^2\right)\\
&= n(\sigma^2 + \mu^2) - (\sigma^2 + n\mu^2)\\
&= (n-1)\sigma^2
\end{split}
$$

This implies that the sum of squared deviations from the mean is a biased estimator of the population variance. However, this is easily fixed. The **sample variance**, $s^2$, defined as
$$
s^2 \equiv \frac{1}{n-1} \sum_{i=1}^n (X_i-\bar{X})^2
$$
gives us an unbiased estimator of the population variance $\sigma^2$.


## One sample t-test

With a sample of size $n$, mean $\bar{x}$, standard deviation $s$, and population mean $\mu$, the test statistic 
$$
t = \frac{\bar{x}-\mu}{s/\sqrt{n}}
$$
has the t distribution with $n-1$ degrees of freedom.

In hypothesis testing, we normally compute the p-value, which is the probability of occurrence of values more extreme than the computed test statistic. This depends on the form of the alternative hypothesis.
$$
\begin{split}
H_0 &: \mu = 0 \\
H_a &: \mu \ne 0 \\
p &= P[T \le -|t|] + P[T \ge |t|] \hspace{1in}\text{Two-tailed t test}\\
\\
H_0 &: \mu = 0 \\
H_a &: \mu \lt 0 \\
p &= P[T \le -|t|] \hspace{1in}\text{Lower-tailed t test}\\
\\
H_0 &: \mu = 0 \\
H_a &: \mu \gt 0 \\
p &= P[T \ge |t|] \hspace{1in}\text{Upper-tailed t test}
\end{split}
$$

In [25]:
# Demo of scipy.stats one sample t test function

np.random.seed(12345)
n=10
v1 = stats.norm.rvs(size=n)  # Standard normal random variates

xbar = np.mean(v1)
s = np.std(v1, ddof=1)    # ddof=1 uses (n-1) for the denominator, so gives us sample sd

# Default behavior of ttest_1samp:
# H0: mu = 0, Ha: mu not equal to zero

t =  xbar/ (s/np.sqrt(n))    
p = 2*(1-stats.t.cdf(t,n-1)) # Two-sided test, so twice the tail probability. Note that degrees of freedom = n-1
t,p
stats.ttest_1samp(v1,0)


(1.8498368520005377, 0.09737624227318431)

Ttest_1sampResult(statistic=1.8498368520005373, pvalue=0.09737624227318435)

In [37]:
mu = 43
xbar = 40
n = 125
s = 15
t =  (xbar-mu)/ (s/np.sqrt(n))
p = stats.t.cdf(t,n-1)
n, t,p

(125, -2.23606797749979, 0.013567791664809134)

In [39]:
stats.t.cdf(-1.658,124)
stats.t.ppf(0.05,124)    # Percent Point Function (inverse of cdf)

stats.t.ppf(0.025,185)

0.04992258704151217

-1.6572349701441826

-1.9728699462074992

In [44]:
(3822396/3)/(96092902/185)

2.452984716810821

In [47]:
stats.f.ppf(0.95,3,185)

2.6534284283390934