# Sufficient statistic
Any function of observations, i.e., $f(x_1, x_2, \ldots, x_n)$ is technically a "statistic" (whether it is useful or not). The point of a statistic is to usually make some inferences about the distribution parameters from the statistic. Since a statistic compresses information contained in several observations into one value, a natural question is, once a statistic meant for making inferences about a distribution parameter is calculated, can we throw away the observations? - i.e., does the statistic contain all relevant information needed for us to make an inference about the distribution parameter of interest? The reason why we try to answer that question is not only that we don't need to store hundreds or thousands of observations (in the interest of conserving hard disk space, which, in this day and age, may not be a concern at all), but more importantly, it makes for convenient mathematics in many situations as we will see later.

**Definition**
Let $ X_1, X_2, \ldots, X_n $ be a random sample from a probability distribution with unknown parameter $\theta$. Then, the statistic:

$$
Y = u(X_1, X_2, ... , X_n)
$$

is said to be sufficient for $\theta$ if the conditional distribution of $X_1, X_2, \ldots, X_n$, given the statistic $Y$, doesn't depend on the parameter $\theta$

Note that this is not a theorem, but a definition. The intuition behind this choice of definition is as follows:

1. The probability distribution of $ \mathbf X = [X_1, X_2, \ldots, X_n]^T $ is parameterised by $\theta$
2. If, now, knowing the value of $Y$ we could tell the probability of $\mathbf{X}$ without knowing anything about $\theta$, then it means, the impact that $\theta$ had on the probability of $\mathbf{X}$ is "contained" within $Y$. So the inverse must be true as well. 

## Factorisation theorem
Let $ X_1, X_2, \ldots, X_n $ denote random variables with joint probability density function or joint probability mass function $ f(x_1, x_2, \ldots, x_n; \theta) $, which depends on the parameter $\theta$. Then, the statistic $ Y = u(X_1, X_2, \ldots, X_n) $ is sufficient for $\theta$ if and only if the p.d.f (or p.m.f.) can be factored into two components, that is:

$$
\begin{align}
f(x_1, x_2, ... , x_n;\theta) &= \phi[ u(x_1, x_2, ... , x_n);\theta ] h(x_1, x_2, ... , x_n) \\
         f(\mathbf{x};\theta) &= \phi[ y;\theta ] h(\mathbf{x})
\end{align}
$$

where, $\phi$ is a function that depends on the data $ x_1, x_2, \ldots, x_n $ only through the function $ u(x_1, x_2, \ldots, x_n) $ and the function $h((x_1, x_2, \ldots, x_n)$ does not depend on the parameter $\theta$

*Proof* (For the discrete case): Suppose we can factorize we can factorise the pmf as: $ P(\mathbf{X};\theta) = \phi[u(\mathbf{x});\theta] h(\mathbf{x})$, then we can find $P(Y = y)$ as shown below:

$$
\begin{align}
P(Y=y) &= \sum_{\mathbf{x}: u(\mathbf{x}) = y} P(X=x) \\
       &= \sum_{\mathbf{x}: u(\mathbf{x}) = y} \phi[u(\mathbf{x});\theta] h(\mathbf{x}) \\
       &= \phi [y;\theta] \sum_{\mathbf{x}: u(\mathbf{x}) = y} h(\mathbf{x})
\end{align}
$$

Now, using this, we find that the conditional probability of $\mathbf{x}$ conditioned on $Y=y$ is as given below:

$$
\begin{align}
P(\mathbf{X} = \mathbf{x} | Y = y) &= \dfrac{\mathbf{X} = \mathbf{x} , Y = y}{P(Y = y)} \\
                          &= \dfrac{\phi[y;\theta] h(\mathbf{x})}{\phi [y;\theta] \sum_{\mathbf{x}: u(\mathbf{x}) = y} h(\mathbf{x})} \\
                          &= \dfrac{h(\mathbf{x})}{\sum_{\mathbf{x}: u(\mathbf{x}) = y} h(\mathbf{x})}
\end{align}
$$

Since this conditional probability doesn't depend on $\theta$, $y = u(\mathbf{x})$ is a sufficient statistic.

Conversely conditional distribution of $\mathbf{X}$ given $Y=y$ is independent of $\theta$, then, 

$$
P(\mathbf{X} = \mathbf{x} | \theta) = P(\mathbf{X} = \mathbf{x} | u(\mathbf{x}) = y, \theta) P(u(\mathbf{x}) = y|\theta) + P(\mathbf{X} = \mathbf{x} | u(\mathbf{x}) \neq y, \theta) P(u(\mathbf{x}) \neq y|\theta) 
$$

The second term in the RHS obviously goes to zero. So we have,

$$
P(\mathbf{X} = \mathbf{x} | \theta) = P(\mathbf{X} = \mathbf{x} | u(\mathbf{x}) = y, \theta) P(u(\mathbf{x}) = y | \theta)
$$

Since we started with the assumption that $Y$ is a sufficient statistic, given $Y=y$, the conditional probability of $\mathbf{X}$ doesn't depend on $\theta$: P(\mathbf{X} = \mathbf{x} | u(\mathbf{x}) = y, \theta) = P(\mathbf{X} = \mathbf{x} | u(\mathbf{x}) = y $. Substituting this in the previus equation, we have:

$$
P(\mathbf{X} = \mathbf{x} | \theta) = P(\mathbf{X} = \mathbf{x}) P(Y = y | \theta)
$$

which is in the desired form: $ P(\mathbf{x};\theta) = h(\mathbf{x}) \phi[y;\theta] $

# Exponential family
## One parameter exponential family
Exponential family of distributions have the following form:

$$
f_X(x|\theta) = exp[\ \eta(\theta)T(x) - A(\theta)\ ]\ h(x)
$$

where, $T(x)$, $A(\theta)$ and $h(x)$ are real valued functions, and $h(x) \geq 0$. The term $exp(-A(\theta))$ can be thought of as normalising the rest of the RHS above so that the area under the *pdf* will be 1. It is entirely determined by other functions.

If it is a discrete RV, then the *pmf* looks the same. Sometimes an alternative notation is used: $ f_X(x|\theta) = exp[\ \eta(\theta)T(x) - A(\theta) + B(x)\ ] $. But we will use the first one in this notes. 

Bernoulli, Binomial, Poisson, Exponential, Normal, Gamme, Chi-squared are all examples of exponential family and their *pdf*s/*pmf*s can be written in the form stated above. All members of the exponential family have some common properties that make some mathematics related to them to become easy.

A one parameter exponential family distribution where $\eta$ is a one-to-one function of $\theta$, i.e., then the distribution is said to be in **canonical exponential family**. In this case, $A(\theta)$ can be written directly in terms of $\eta$ as below:

$$
f(x|\eta) = exp[\ \eta T(x) - A(\eta)\ ]\ h(x)
$$

For canonical exponential family distributions, $\eta$ is called the **natural parameter**. Again, $A(\eta)$'s job is to normalise the distribution, i.e., 

$$
\begin{align}
\mathcal{e}^{A(\eta)} &= \int_{R^d} \mathcal{e}^{\eta T(x)} h(x) < \infty \text{ (for continuous R.V.), or,} \\
\mathcal{e}^{A(\eta)} &= \sum_{x\in\mathcal{X}} \mathcal{e}^{\eta T(x)} h(x) < \infty \text{ (for discrete R.V.)}
\end{align}
$$

where $\mathcal{R}^d$ is the range over which $X$ is defined. 

## Properties of canonical exponential family 
### Convexity
It can be shown that the space $\mathcal{N}$, on which which $\eta$ is defined, is convex and that $A(\eta)$ is convex function over that space [(more explanation)](https://www.stat.purdue.edu/~dasgupta/expfamily.pdf). i.e.,

$$
A(\alpha\eta_1 + (1-\alpha)\eta_2) \leq  \alpha A(\eta_1) + (1-\alpha) A(\eta_2),\ \forall \eta_1, \eta_2 \in \mathcal{N} 
$$

### Moments and MGF
The function $A(\eta)$ is infinitely differentiable at every $\eta$. Furthermore, in the continuous case, $\mathcal{e}^{A(\eta)} = \int_{R^d} \mathcal{e}^{\eta T(x)} h(x)$ can be  differentiated any number of times inside the integral, and in the discrete case, $ \mathcal{e}^{A(\eta)} = \sum_{x\in\mathcal{X}} \mathcal{e}^{\eta T(x)} h(x) $ can be differentiated any number of times inside the summation. 
+ In the continuous case, for any $k \geq 1$,
$$
\dfrac{d^k}{d\eta^k} \mathcal{e}^{A(\eta)} = \int_{R^d} [T(x)]^k \mathcal{e}^{\eta T(x)} h(x)
$$
+ In the discrete case, for any $k \geq 1$,
$$
\dfrac{d^k}{d\eta^k} \mathcal{e}^{A(\eta)} = \sum_{x\in\mathcal{X}} [T(x)]^k \mathcal{e}^{\eta T(x)} h(x)
$$

Using this, we have the following results:
+ $ E_\eta[T(X)] = A'(\eta) $
+ $ Var_\eta[T(X)] = A''(\eta) $
+ At any $t$ such that $\eta+t \in \mathcal{N}$, the *mgf* of $T(X)$ exists and is: $ M_\eta(t) = \mathcal{e}^{A(\eta+t) - A(\eta)} $

Another property corollary to these is that:
+ $ E_\eta[T(X)] $ is strictly increasing in $\eta$. [Apparently]((https://www.stat.purdue.edu/~dasgupta/expfamily.pdf)) this means that the canonical exponential family can be reparameterised by using $E_\eta[T(X)]$ instead of $\eta$. I don't know what is the use of it. But it sounds cool.

## Multi-dimensional one-parameter exponential family
So far, we have assumed $X$ to be a scalar RV. But the exponential family can be extended to vectors just as well. One example of a multi-dimensional exponential family is when we make $n$ observations of an RV from an exponential family distribution and then look at their $n$-dimensional joint *pdf*. It would be:

$$
f_\mathbf{X}(\mathbf{x}|\theta) = exp[\ \eta(\theta)T(\mathbf{x}) - A(\theta)\ ]\ h(\mathbf{x})
$$

Note that, although $\mathbf{X}$ is a vector now, the function $T(\mathbf{x})$ and $h(\mathbf{x})$ are still scalars. All the theory about one-parameter scalar RV family that we discussed so far also apply to the multi-dimensional case.

In the case of a multi-dimensional RV resulting from multiple observations of a single-dimensional exponential family RV, the function $T(\mathbf{X})$ is a sufficient statistic, and is called the **natural sufficient statistic** of the parameter of that exponential family 

The following properties are true for any dimensional exponential family (we use multidimensional notation). For an RV 
+ $T = T(\mathbf{X})$ is also an exponential family distribution
+ Any RV $\mathbf{Y} = A\mathbf{X} + c$ also has a distribution in the exponential family
+ If $\mathcal{X}_0$ is a proper subset of $\mathcal{X}_d$, then the joint conditional distribution of $\mathbf{X} \in \mathcal{I}_0$ given $\mathbf{X}^* \in \mathcal{I} - \mathcal{I}_0$ is also an exponential family

## Multi-parameter exponential family
We have so far discussed single and multi-dimensional RV's but only in the context of one-parameter exponential families. For. ex., an n-dimensional gaussian RV with fixed variance (so mean is the only parameter). But we can have multi-parameter exponential families as well, and they have the same characteristics as the one-parameter ones. We state here the most general form of exponential family distributions where the parameter as well as the RV can be of any dimension (including 1-d). The *pdf* looks like:

$$
f_\mathbf{X}(\mathbf{x}; \boldsymbol{\theta}) = exp[\ \boldsymbol{\eta}^T(\boldsymbol{\theta}) \mathbf{T}(\mathbf{x}) - A(\boldsymbol{\theta})\ ]\ h(\mathbf{x})
$$

Note that, because $\boldsymbol{\theta}$ is a vector, $\boldsymbol{\eta}$ is a vector function, and consequently, $\mathbf{T}$ is also a vector function. However, $A$ and $h$ are still scalar funcitons, although their arguments $\boldsymbol{\theta}$ and $\mathbf{x}$ are both vectors.

Canonical form (when $\boldsymbol{\eta}$ has a one-to-one relationship with $\boldsymbol{\theta}$) is as shown below:

$$
f_\mathbf{X}(\mathbf{x};\boldsymbol{\eta}) = exp[\ \boldsymbol{\eta}^T \mathbf{T}(\mathbf{x}) - A(\boldsymbol{\eta})\ ]\ h(\mathbf{x})
$$

## References
1. [Exponential family - Purdue](https://www.stat.purdue.edu/~dasgupta/expfamily.pdf) 
2. [Sufficiency - Missouri](https://people.missouristate.edu/songfengzheng/Teaching/MTH541/Lecture%20notes/Sufficient.pdf)
3. [Exponential family - Berkeley](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf)
