<a href="https://colab.research.google.com/github/tyro2001/hello-world/blob/master/DataScienceNotes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Central Limit Theorem**

If $X_1, X_2,..., X_n$ are $n$ random samples drawn from a population with overall mean $\mu$ and finite variance $\sigma^2$, and if $\overline{X}_n$ is the sample mean, then the limiting form of the distribution, $Z = \lim_{n \to \infty} \sqrt{n}\left(\frac{\overline{X}_n - \mu}{\sigma} \right)$, is a standard normal distribution


**Bias-variance tradeoff**

* Mean squared error (MSE): $MSE_\theta(\widehat{\theta}) = E \left((\widehat{\theta}-\theta)^2\right)$


* Bias: $bias_{\theta}(\widehat{\theta}) = E(\widehat{\theta}) - \theta$


* Result: $MSE_\theta(\widehat{\theta}) = V(\widehat{\theta}) + bias_{\theta}^{2}(\widehat{\theta})$

**Three important cases**

1. Estimation of the a population proportion.  $X$ has the binomial($n, p$) distribution.  The sample proportion, $\hat{p} = X/n$, estimator is unbiased with standard error of $\sqrt{\frac{p(1-p)}{n}}$


2. Suppose $X_1, X_2,..., X_n$ are uncorrelated (as opposed to independent) RVs with $E(X_i) = \mu$ and $V(X_i) = \sigma^2$ for all $i$. Then the standard estimator for $\mu$ is $\overline{X}$, and the standard estimator for $\sigma^2$ is $s^{2}=\frac{1}{n-1}\sum_{i=1}^{n}(X_i-\overline{X})^{2}$


3. Suppose $X_1, X_2,..., X_n$ are i.i.d. N($\mu, \sigma^{2}$) then we can use the four classic results below.

**4 Classic Results**

Assuming $X_1, X_2,..., X_n$ are i.i.d. N($\mu, \sigma^{2}$) then:

1. $\overline{X}$ and $s^{2}=\frac{1}{n-1}\sum_{i=1}^{n}(X_i-\overline{X})^{2}$ are independent.

2. The quantity $\frac{\overline{X} - \mu}{\sigma/\sqrt{n}}$ has the standard normal distribution.

3. The quantity $\frac{(n-1)s^{2}}{\sigma^{2}}$ has the Chi-squared distribution with $n-1$ degrees of freedom

4. The quantity $\frac{\overline{X} - \mu}{s/\sqrt{n}}$ has the $t$-distribution with $n-1$ degrees of freedom

**Facts about the [$\chi_k^{2}$distribution](https://en.wikipedia.org/wiki/Chi-squared_distribution)**

* $f_{X} = \frac{1}{2^{k/2}\Gamma(k/2)}x^{k/2 -1}e^{-x/2}$ with support $x$ postive for $k=1$ and non negative for $k>1$

* Mean is $k$

* Variance is $2k$





**Facts about the [Gamma distribution](https://en.wikipedia.org/wiki/Gamma_distribution)**

* pdf: $\frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}$ for $ 0 < x < \infty$ and $\alpha,\beta > 0 $


* Mean is $\frac{\alpha}{\beta}$


* Variance is $\frac{\alpha}{\beta^2}$


* Gamma$(\alpha=1, \beta)$ is Exponential$(\beta)$


* $cX \sim $ Gamma$(\alpha, \beta/c)$ for $X \sim $ Gamma$(\alpha, \beta)$


* $\sum_{i=1}^{n}X_i$ is distributed Gamma$(n, \beta)$ where $X_i$ i.i.d. Exponential$(\beta)$


* $\overline{X}$ is distributed Gamma$(n, n\beta)$ where $X_i$ i.i.d. Exponential$(\beta)$

**Delta Method**

* One dimension: if $Y_n$ is appoximately $N(\mu, \sigma_n^2)$ and $\sigma_n^2 \to 0$ as $n$ increases, then $g(Y_n)$ is approximately $N(g(\mu), (g^\prime(\mu))^2 \sigma_n^2)$, assuming that $g^\prime(\mu)$ exists and is not zero.

* Multi dimension: Suppose $Y_n$ is approximately $N_{p}(\mu, \Sigma/n)$.  Now consider an $m$-dimensional function $g()$ where $m \le p$:
$g(Y_n) = [g_1(Y_n) \: g_2(Y_n) \: ... \: g_m(Y_n)]^{T}$.  Then $g(Y_n)$ is approximately $N_m(g(\mu), G \Sigma G^T/n)$ where $G_{ij}$ is $\partial g_i(y)/\partial y_i$ and $G$ is full rank




**Fisher Information**

* One dimensional
$$I(\theta_0) = E\left(-\frac{\partial^2 log f(X; \theta)}{\partial\theta^2} \right)$$


* Multi dimensional.  $\theta = (\theta_1, \theta_2,..., \theta_n)$
$$I_{ij}(\theta) = E\left[-\frac{\partial^2}{\partial\theta_{i}\partial\theta_{j}}log f(x; \theta)  \right]$$

**Method of Moments**

* $E(X^k)$ is called the $k^{th}$ moment for the distribution of X.  We might call this the $k^{th}$ ***population*** moment to be more clear.


* The $k^{th}$ ***sample*** moment is $M_k = \sum_{i=1}^{n}X_i^k/n$


* NB: $M_k$ is our best estimator for $E(X^k)$, ie $E(M_k) = E(X^k)$


* Suppose that $X_1, X_2, ..., X_n$ are i.i.d with density $f_X(x; \theta)$ where $\theta$ is a vector of length p.  Set up a system of equations $M_i = E(X^i)$ with $i = 1, 2,..., p$ and solve them to find $\theta_{MOM}$. Note this is built on the incorrect assummption that $M_i = E(X^i)$.  In fact, $M_i \approx E(X^i)$, so the result is only approximately correct.

**Maxium Likelihood Estimation (MLE)**

* The likelihood function is constructed by calculating the distribution for the data, then holding the data constant,  and then allowing the parameters to vary.


* $L(\theta; x) = f_{\theta}(x)$


* The maximum likelihood esimator for the parameters is the choice of parameters which maximizes the likelihood function


* A common case is where $X_1, X_2, ..., X_n$ are i.i.d. with density $f_{X_i}(x; \theta)$.  In this case the joint distribution of the random variables is:
$$ f_X(x; \theta) = \prod_{i=1}^n f_{X_i}(x_i; \theta) $$
and the likelihood function is:
$$ L(\theta; x) = \prod_{i=1}^n f_{X_i}(x_i; \theta) $$


* The log likelihood function: $\ell(\theta; x) = log L(\theta; x)$


* In the case where the $X_i$ are i.i.d. then: 
$$\ell(\theta; x) = \sum_{i=1}^n log f_{X_i}(x_i; \theta)$$

**Facts about the MLE**

* Invariant: If $\widehat{\theta}_{MLE}$ is the MLE for $\theta$, then $g(\widehat{\theta}_{MLE})$ is the MLE for $g(\theta)$.


* Consistent: An estimator $\widehat{\theta}$ is consistent for $\theta$ if as the sample size increases, the estimator converges to the true value of the parameter. Formally, we would write: For any $\epsilon$ > 0, 
$\lim_{n \to \infty} P(|\widehat{\theta} - \theta_0 | > \epsilon) = 0$


* Asympotically Normal: If $X_1, X_2,..., X_n$ are i.i.d $f(x; \theta_0)$, then, under suitable regularity conditions, as $n$ increases,
$$\sqrt{n}(\widehat{\theta} - \theta_0) \to N(0, I^{-1}(\theta_0))$$
where $I(\theta_0)$ is the Fisher Information


* Practial result: $\widehat{\theta} \approx N(\theta_0, I^{-1}(\theta_0)/n$


* Apply delta method by replacing  $Y_n$ with the MLE for $\theta$, $\mu$ with $\theta_0$, and $\sigma_n^2$ with $I^{-1}(\theta_0)/n$, and we can conclude that
$$g(\widehat{\theta}) \approx N\left(g(\theta_0), (g^\prime(\mu))^2 \frac{I^{-1}(\theta_0)}{n} \right)$$


* Multi dimensional delta method.  Supposed $Y_n$ is approximately $N_p(\mu, \Sigma/n)$.  Now consider an $m$-dimensional function $g()$ where $m \le p$:
$$ g(Y_n) = \left[g_1(Y_n)\: g_2(Y_n) \: ... \: g_m(Y_n)\right]^{T}$$
then $g(Y_n)$ is approximately $N_m(g(\mu), G \Sigma G^{T}/n)$ where the $(i, j)$ entry of $G$ is $\partial g_i(y)/\partial y_j.$ This assumes that $G$ is full rank



* MLE for the normal distribution parameters are $\overline{X}$, and $(\frac{n-1}{n})s^2$

**Four important confidence intervals**

1. Confidence interval for population mean, $\mu$, when $n$ is large ($n \ge 30$):  
Assume the the sample is i.i.d. from a distribution with mean $\mu$ and variance $\sigma^2$, with both $\mu$ and $\sigma^2$ unknown.  Then a $100(1 - \alpha$)% confidence interval for $\mu$ is:
$$\overline{X} \pm z_{\alpha/2} \left( \frac{s}{\sqrt{n}} \right)$$
where $z_{\alpha}$ is such that $P(Z>z_{\alpha}) = \alpha$ when $Z$ has the standard normal distribution.


2. Confidence interval for population mean, $\mu$, when $n$ is small ($n < 30$):  
Assume that the sample is i.i.d. from the normal distribution with mean $\mu$ and variance $\sigma^2$, with both $\mu$ and $\sigma^2$ unknown.  Then a $100(1 - \alpha$)% confidence interval for $\mu$ is:
$$\overline{X} \pm t_{\alpha/2,n-1} \left( \frac{s}{\sqrt{n}} \right)$$
where $t_{\alpha,\nu}$ is such that $P(T>t_{\alpha,\nu}) = \alpha$ when $T$ has the $t$-distribution with $\nu$ degrees of freedom.


3. Confidence interval for population variance, $\sigma^2$:  
Assume that the sample is i.i.d. from the normal distribution with mean $\mu$ and variance $\sigma^2$, with both $\mu$ and $\sigma^2$ unknown. Then a $100(1 - \alpha$)% confidence interval for $\sigma^2$ is:
$$\left(\frac{(n-1)s^2}{\chi_{\alpha/2,n-1}^2}, \frac{(n-1)s^2}{\chi_{(1-\alpha/2),n-1}^2}\right)$$
where $\chi_{\alpha,\nu}^2$ is such that $P(U>\chi_{\alpha,\nu}^2) = \alpha$ when $U$ has the chi-squared distribution with $\nu$ degrees of freedom.


4. Confidence interval for population proportion, p:  
Assume that $n$ is large enough that you are confident that $np \ge 5$ and $n(1-p) \ge 5$.  Let $\hat{p} = X/n$ where $X$ is binomial$(n, p)$, and $p$ is unknown.  Then a $100(1 - \alpha$)% confidence interval for $p$ is:
$$\widehat{p} \pm z_{\alpha/2} \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}$$
where $z_{\alpha}$ is such that $P(Z>z_{\alpha}) = \alpha$ when $Z$ has the standard normal distribution.