# Estimation

An estimator is some statistic that gives some approximation about some fact/attribute about the population. How do we choose an appropriate estimator? What kind of estimator should we choose? Here, we will go over two methods to find estimators.

# Method of Moments

The method of moments is one of the oldest methods of deriving point estimators. Although they are not the best estimators usually, they almost always produce asymptotically unbiased esttimators. Furthermore, this method utilizes the substitution principle, an important idea in statistics. Additionally, in general, the method of moment estimators are consistent (${\hat \theta}_{MOM} \overset{p }{\rightarrow} \theta$). As a result, it is worth understanding how the method of moments works. 

Suppose we want to estimate $k$ parameters, $\theta_{1}, ... \theta_{k}$ from a probability distribution. Furthermore, we have $f(x | \theta_{1},..., \theta_{k})$ from i.i.d. samples $X_{1},..., X_{n}$ from this distribution. We first compute the first $k$ moments and we express our parameters in terms of these moments. Then, our method of moment estimators are found by substituting these moments with sample moments. Since we are creating a system of equations, the number of equations that we need is the same as the number of parameters that we will estimate.

### Example: Method of Moments for Poisson Distribution

Suppose we have $n$ observations, $X_{1},... ,X_{n}$ and each $X_{i} \sim Pois(\lambda)$. Find a MOM estimator for $\lambda$.

Let $X \sim Pois(\lambda)$, then $E[X] = \lambda$. Then, our method of moments estimator is ${\hat \lambda}_{MOM} = {\bar X}$. 


### Example: Method of Moments for Gamma Distribution

Suppose we have $n$ observations, $X_{1},... ,X_{n}$ and each $X_{i} \sim Gamma(r, \lambda)$. Find a MOM estimator for $r$ and $\lambda$.


We know that $E[X] = \frac{r}{\lambda}$ and $Var(X) = \frac{r}{\lambda^{2}}$. Then $E[X^{2}] = \frac{r}{\lambda^{2} } + \frac{r^{2}}{\lambda^{2}}$

$$E[X^{2}] = Var(X) + (E[X])^{2} = \frac{1}{\lambda} E[X] + (E[X])^{2}$$

$$\lambda = \frac{E[X]}{E[X^{2}]  - (E[X])^{2} }$$

$$r = \frac{(E[X])^{2}}{E[X^{2}]  - (E[X])^{2} }$$

Then our method of moments estimators are: 

$${\hat \lambda}_{MOM} = \frac{{\hat \mu_{1}}}{ {\hat \mu_{2}} - {\hat \mu_{1}}^{2}}$$

$${\hat r} = \frac{\hat \mu_{1}^{2}}{ \hat \mu_{2}  - \hat \mu_{1}^{2} }$$

where $\hat \mu_{1} = \frac{1}{n} \sum_{i=1}^{n} X_{i} =  {\bar X}$ and $\hat \mu_{2} = \frac{1}{n}{ \sum_{i=1}^{n} X_{i}^{2}}$

The weak law of large numbers implies that the sample moments converge (in probability) to population moments. Additionally, if we can represent our estimates as a continuous, smooth function of the sample moments, the estimates will also converge in probability to the parameters of interest. The weak law of large numbers implies that the sample moments converge (in probability) to population moments. Additionally, if we can represent our estimates as a continuous, smooth function of the sample moments, the estimates will also converge in probability to the parameters of interest.

# Maximum Likelihood Estimation

Maximum likelihood estimation aims to maximize the likelihood function, denoted as $L = f_{X_{1},..., X_{n}}(x_{1},..., x_{n} | \theta )$, such that under the model, the observed data is the most probable. A lot of the times, it is difficult to deal with maximizing the product, so we aim to maximize the log likelihood of a function instead. Since the log function is a monotonically increasing function on $x>0$. Thus, the logarithm of a function achieves its max value at the same point as the function. Denote $\ell = \log(L)$, where $\log$ is the natural logarithm. Then, we can take the derivative and set to 0 and solve for $\lambda$ to obtain our maximum likelihood estimate.

Put simply, we find our maximum likelihood estimator as ${\hat \theta}_{ML} = \arg \max_{\theta \in \Theta} L = \arg \max_{\theta \in \Theta} \ell$


### Example: Maximum Likelihood Estimation for Poisson Distribution

$X_{1},..., X_{n}$ be i.i.d. $Pois(\lambda)$. Then, we have that our likelihood function is

$$L = f_{X_{1},..., X_{n}}(x_{1},..., x_{n} | \theta ) = \prod_{i=1}^{n} \frac{\lambda^{x_{i}} \exp\{ -\lambda\}}{x_{i}!}$$


$$\ell = \log( \lambda) \sum_{i=1}^{n} x_{i} -\sum_{i=1}^{n} \lambda  - \sum_{i=1}^{n}\log( x_{i}!)$$

$$\frac{\partial  \ell }{ \partial \lambda} = \frac{\sum_{i=1}^{n} x_{i}}{\lambda} -  n \overset{set}{=}0$$

Solving for ${\lambda}$, we have that ${\hat \lambda}_{MLE} = {{\bar X}}$

Notice that our method of moments and maximum likelihood estimator agree in the case where we have a Poisson distribution.

### Example: Maximum Likelihood Estimation for Exponential

$X_{1},..., X_{n}$ be i.i.d. $Exp(\lambda)$. Then, we have that our likelihood function is

$$L = f_{X_{1},..., X_{n}}(x_{1},..., x_{n} | \theta ) = \prod_{i=1}^{n} \lambda \exp \left\{ -\lambda x_{i} \right\}=\lambda^{n} \exp \left\{ -\lambda \sum_{i=1}^{n} x_{i} \right\}$$


$$\ell = n \log \lambda - \lambda \sum_{i=1}^{n} x_{i}$$

$$\frac{\partial  \ell }{ \partial \lambda} = \frac{n}{\lambda} -  \sum_{i=1}^{n} x_{i}\overset{set}{=}0$$

Solving for ${\lambda}$, we have that ${\lambda}_{MLE} = \frac{1}{{\bar X}}$


## Properties of MLE

Under certain conditions, the MLE possesses several properties that make MLE more favorable than the method of moments estimator. They are:

1. MLE is equivariant: Let ${\hat \theta}_{ML}$ be the MLE of $\theta$. Then $f({\hat \theta}_{ML})$ is the MLE of $f(\theta)$.

2. MLE is consistent: ${\hat \theta}_{MLE} \overset{ p  }{ \rightarrow } \theta$

3. MLE is asymptotically normal: ${\hat \theta}_{ML} \rightarrow {\mathcal N} \left(\theta,\frac{1}{n} \left( I(\theta) \right)^{-1} \right)$ where $I(\theta)$ is the Fisher information. $I(\theta) := E_{\theta} \left[ \left( \frac{\partial}{ \partial \theta } \log(f(x|\theta)) \right)^{2} \right]$

4. MLE is asymptotically unbiased $\lim_{n \rightarrow \infty} E[ {\hat \theta}_{ML} ] = \theta$