# Sufficient statistic
Any function of observations, i.e., $f(x_1, x_2, \ldots, x_n)$ is technically a "statistic" (whether it is useful or not). The point of a statistic is to usually make some inferences about the distribution parameters from the statistic. Since a statistic compresses information contained in several observations into one value, a natural question is, once a statistic meant for making inferences about a distribution parameter is calculated, can we throw away the observations? - i.e., does the statistic contain all relevant information needed for us to make an inference about the distribution parameter of interest? The reason why we try to answer that question is not only that we don't need to store hundreds or thousands of observations (in the interest of conserving hard disk space, which, in this day and age, may not be a concern at all), but more importantly, it makes for convenient mathematics in many situations as we will see later.

**Definition**
Let $ X_1, X_2, \ldots, X_n $ be a random sample from a probability distribution with unknown parameter $\theta$. Then, the statistic:

$$
Y = u(X_1, X_2, ... , X_n)
$$

is said to be sufficient for $\theta$ if the conditional distribution of $X_1, X_2, \ldots, X_n$, given the statistic $Y$, doesn't depend on the parameter $\theta$

Note that this is not a theorem, but a definition. The intuition behind this choice of definition is as follows:

1. The probability distribution of $ \mathbf X = [X_1, X_2, \ldots, X_n]^T $ is parameterised by $\theta$
2. If, now, knowing the value of $Y$ we could tell the probability of $\mathbf{X}$ without knowing anything about $\theta$, then it means, the impact that $\theta$ had on the probability of $\mathbf{X}$ is "contained" within $Y$. So the inverse must be true as well. 

## Factorisation theorem
Let $ X_1, X_2, \ldots, X_n $ denote random variables with joint probability density function or joint probability mass function $ f(x_1, x_2, \ldots, x_n; \theta) $, which depends on the parameter $\theta$. Then, the statistic $ Y = u(X_1, X_2, \ldots, X_n) $ is sufficient for $\theta$ if and only if the p.d.f (or p.m.f.) can be factored into two components, that is:

$$
\begin{align}
f(x_1, x_2, ... , x_n;\theta) &= \phi[ u(x_1, x_2, ... , x_n);\theta ] h(x_1, x_2, ... , x_n) \\
         f(\mathbf{x};\theta) &= \phi[ y;\theta ] h(\mathbf{x})
\end{align}
$$

where, $\phi$ is a function that depends on the data $ x_1, x_2, \ldots, x_n $ only through the function $ u(x_1, x_2, \ldots, x_n) $ and the function $h((x_1, x_2, \ldots, x_n)$ does not depend on the parameter $\theta$

*Proof* (For the discrete case): Suppose we can factorize we can factorise the pmf as: $ P(\mathbf{X};\theta) = \phi[u(\mathbf{x});\theta] h(\mathbf{x})$, then we can find $P(Y = y)$ as shown below:

$$
\begin{align}
P(Y=y) &= \sum_{\mathbf{x}: u(\mathbf{x}) = y} P(X=x) \\
       &= \sum_{\mathbf{x}: u(\mathbf{x}) = y} \phi[u(\mathbf{x});\theta] h(\mathbf{x}) \\
       &= \phi [y;\theta] \sum_{\mathbf{x}: u(\mathbf{x}) = y} h(\mathbf{x})
\end{align}
$$

Now, using this, we find that the conditional probability of $\mathbf{x}$ conditioned on $Y=y$ is as given below:

$$
\begin{align}
P(\mathbf{X} = \mathbf{x} | Y = y) &= \dfrac{\mathbf{X} = \mathbf{x} , Y = y}{P(Y = y)} \\
                          &= \dfrac{\phi[y;\theta] h(\mathbf{x})}{\phi [y;\theta] \sum_{\mathbf{x}: u(\mathbf{x}) = y} h(\mathbf{x})} \\
                          &= \dfrac{h(\mathbf{x})}{\sum_{\mathbf{x}: u(\mathbf{x}) = y} h(\mathbf{x})}
\end{align}
$$

Since this conditional probability doesn't depend on $\theta$, $y = u(\mathbf{x})$ is a sufficient statistic.

Conversely conditional distribution of $\mathbf{X}$ given $Y=y$ is independent of $\theta$, then, 

$$
P(\mathbf{X} = \mathbf{x} | \theta) = P(\mathbf{X} = \mathbf{x} | u(\mathbf{x}) = y, \theta) P(u(\mathbf{x}) = y|\theta) + P(\mathbf{X} = \mathbf{x} | u(\mathbf{x}) \neq y, \theta) P(u(\mathbf{x}) \neq y|\theta) 
$$

The second term in the RHS obviously goes to zero. So we have,

$$
P(\mathbf{X} = \mathbf{x} | \theta) = P(\mathbf{X} = \mathbf{x} | u(\mathbf{x}) = y, \theta) P(u(\mathbf{x}) = y | \theta)
$$

Since we started with the assumption that $Y$ is a sufficient statistic, given $Y=y$, the conditional probability of $\mathbf{X}$ doesn't depend on $\theta$: P(\mathbf{X} = \mathbf{x} | u(\mathbf{x}) = y, \theta) = P(\mathbf{X} = \mathbf{x} | u(\mathbf{x}) = y $. Substituting this in the previus equation, we have:

$$
P(\mathbf{X} = \mathbf{x} | \theta) = P(\mathbf{X} = \mathbf{x}) P(Y = y | \theta)
$$

which is in the desired form: $ P(\mathbf{x};\theta) = h(\mathbf{x}) \phi[y;\theta] $

# Exponential family
## One parameter exponential family
Exponential family of distributions have the following form:

$$
f_X(x|\theta) = exp[\ \eta(\theta)T(x) - A(\theta)\ ]\ h(x)
$$

where, $T(x)$, $A(\theta)$ and $h(x)$ are real valued functions, and $h(x) \geq 0$. The term $exp(-A(\theta))$ can be thought of as normalising the rest of the RHS above so that the area under the *pdf* will be 1. It is entirely determined by other functions.

If it is a discrete RV, then the *pmf* looks the same. Sometimes an alternative notation is used: $ f_X(x|\theta) = exp[\ \eta(\theta)T(x) - A(\theta) + B(x)\ ] $. But we will use the first one in this notes. 

Bernoulli, Binomial, Poisson, Exponential, Normal, Gamme, Chi-squared are all examples of exponential family and their *pdf*s/*pmf*s can be written in the form stated above. All members of the exponential family have some common properties that make some mathematics related to them to become easy.

A one parameter exponential family distribution where $\eta$ is a one-to-one function of $\theta$, i.e., then the distribution is said to be in **canonical exponential family**. In this case, $A(\theta)$ can be written directly in terms of $\eta$ as below:

$$
f(x|\eta) = exp[\ \eta T(x) - A(\eta)\ ]\ h(x)
$$

For canonical exponential family distributions, $\eta$ is called the **natural parameter**. Again, $A(\eta)$'s job is to normalise the distribution, i.e., 

$$
\begin{align}
\mathcal{e}^{A(\eta)} &= \int_{R^d} \mathcal{e}^{\eta T(x)} h(x) < \infty \text{ (for continuous R.V.), or,} \\
\mathcal{e}^{A(\eta)} &= \sum_{x\in\mathcal{X}} \mathcal{e}^{\eta T(x)} h(x) < \infty \text{ (for discrete R.V.)}
\end{align}
$$

where $\mathcal{R}^d$ is the range over which $X$ is defined. 

## Properties of canonical exponential family 
### Convexity
It can be shown that the space $\mathcal{N}$, on which which $\eta$ is defined, is convex and that $A(\eta)$ is convex function over that space [(more explanation)](https://www.stat.purdue.edu/~dasgupta/expfamily.pdf). i.e.,

$$
A(\alpha\eta_1 + (1-\alpha)\eta_2) \leq  \alpha A(\eta_1) + (1-\alpha) A(\eta_2),\ \forall \eta_1, \eta_2 \in \mathcal{N} 
$$

### Moments and MGF
The function $A(\eta)$ is infinitely differentiable at every $\eta$. Furthermore, in the continuous case, $\mathcal{e}^{A(\eta)} = \int_{R^d} \mathcal{e}^{\eta T(x)} h(x)$ can be  differentiated any number of times inside the integral, and in the discrete case, $ \mathcal{e}^{A(\eta)} = \sum_{x\in\mathcal{X}} \mathcal{e}^{\eta T(x)} h(x) $ can be differentiated any number of times inside the summation. 
+ In the continuous case, for any $k \geq 1$,
$$
\dfrac{d^k}{d\eta^k} \mathcal{e}^{A(\eta)} = \int_{R^d} [T(x)]^k \mathcal{e}^{\eta T(x)} h(x)
$$
+ In the discrete case, for any $k \geq 1$,
$$
\dfrac{d^k}{d\eta^k} \mathcal{e}^{A(\eta)} = \sum_{x\in\mathcal{X}} [T(x)]^k \mathcal{e}^{\eta T(x)} h(x)
$$

Using this, we have the following results:
+ $ E_\eta[T(X)] = A'(\eta) $
+ $ Var_\eta[T(X)] = A''(\eta) $
+ At any $t$ such that $\eta+t \in \mathcal{N}$, the *mgf* of $T(X)$ exists and is: $ M_\eta(t) = \mathcal{e}^{A(\eta+t) - A(\eta)} $

Another property corollary to these is that:
+ $ E_\eta[T(X)] $ is strictly increasing in $\eta$. [Apparently]((https://www.stat.purdue.edu/~dasgupta/expfamily.pdf)) this means that the canonical exponential family can be reparameterised by using $E_\eta[T(X)]$ instead of $\eta$. I don't know what is the use of it. But it sounds cool.

## Multi-dimensional one-parameter exponential family
So far, we have assumed $X$ to be a scalar RV. But the exponential family can be extended to vectors just as well. One example of a multi-dimensional exponential family is when we make $n$ observations of an RV from an exponential family distribution and then look at their $n$-dimensional joint *pdf*. It would be:

$$
f_\mathbf{X}(\mathbf{x}|\theta) = exp[\ \eta(\theta)T(\mathbf{x}) - A(\theta)\ ]\ h(\mathbf{x})
$$

Note that, although $\mathbf{X}$ is a vector now, the function $T(\mathbf{x})$ and $h(\mathbf{x})$ are still scalars. All the theory about one-parameter scalar RV family that we discussed so far also apply to the multi-dimensional case.

In the case of a multi-dimensional RV resulting from multiple observations of a single-dimensional exponential family RV, the function $T(\mathbf{X})$ is a sufficient statistic, and is called the **natural sufficient statistic** of the parameter of that exponential family 

The following properties are true for any dimensional exponential family (we use multidimensional notation). For an RV 
+ $T = T(\mathbf{X})$ is also an exponential family distribution
+ Any RV $\mathbf{Y} = A\mathbf{X} + c$ also has a distribution in the exponential family
+ If $\mathcal{X}_0$ is a proper subset of $\mathcal{X}_d$, then the joint conditional distribution of $\mathbf{X} \in \mathcal{I}_0$ given $\mathbf{X}^* \in \mathcal{I} - \mathcal{I}_0$ is also an exponential family

## Multi-parameter exponential family
We have so far discussed single and multi-dimensional RV's but only in the context of one-parameter exponential families. For. ex., an n-dimensional gaussian RV with fixed variance (so mean is the only parameter). But we can have multi-parameter exponential families as well, and they have the same characteristics as the one-parameter ones. We state here the most general form of exponential family distributions where the parameter as well as the RV can be of any dimension (including 1-d). The *pdf* looks like:

$$
f_\mathbf{X}(\mathbf{x}; \boldsymbol{\theta}) = exp[\ \boldsymbol{\eta}^T(\boldsymbol{\theta}) \mathbf{T}(\mathbf{x}) - A(\boldsymbol{\theta})\ ]\ h(\mathbf{x})
$$

Note that, because $\boldsymbol{\theta}$ is a vector, $\boldsymbol{\eta}$ is a vector function, and consequently, $\mathbf{T}$ is also a vector function. However, $A$ and $h$ are still scalar funcitons, although their arguments $\boldsymbol{\theta}$ and $\mathbf{x}$ are both vectors.

Canonical form (when $\boldsymbol{\eta}$ has a one-to-one relationship with $\boldsymbol{\theta}$) is as shown below:

$$
f_\mathbf{X}(\mathbf{x};\boldsymbol{\eta}) = exp[\ \boldsymbol{\eta}^T \mathbf{T}(\mathbf{x}) - A(\boldsymbol{\eta})\ ]\ h(\mathbf{x})
$$

**Example**: The univariate Gaussian distribution parameterised by $\boldsymbol{\theta} = [\mu\ \sigma]^T$ can be rewritten in canonical exponential family form as shown below:

$$
\begin{align}
f_X(x;\mu,\sigma) &= \dfrac{1}{\sqrt{2\pi}\sigma} exp \left( -\dfrac{1}{2} \dfrac{(x-\mu)^2}{\sigma^2} \right) \\
                  &= \dfrac{1}{\sqrt{2\pi}} exp \left( \dfrac{\mu}{\sigma^2}x - \dfrac{1}{2\sigma^2}x^2 - \dfrac{1}{2\sigma^2}\mu^2 - ln(\sigma) \right) \\
                  &= \dfrac{1}{\sqrt{2\pi}} exp \left( [\mu/\sigma^2\ 1/2\sigma^2][x\ x^2]^T - \left( \dfrac{\mu^2}{2\sigma^2}+ln(\sigma)\right) \right) \\
f(x;\eta)         &= h(x) exp\left( \boldsymbol{\eta}^T \mathbf{T}(x) - A(\boldsymbol{\eta}) \right) \\
\end{align}
$$

where,

$$
\begin{align}
\boldsymbol{\eta} &= \begin{bmatrix} \mu/\sigma^2\\ 1/2\sigma^2 \end{bmatrix} \\
\mathbf{T}(x) &= \begin{bmatrix} x\\ x^2 \end{bmatrix} \\
A(\boldsymbol{\eta}) &= \dfrac{\mu^2}{2\sigma^2}+ln(\sigma) = -\dfrac{1}{2} \dfrac{\eta_1^2}{\eta_2} - \dfrac{1}{2} ln(-2\eta_2)\\
h(x) &= \dfrac{1}{\sqrt{2\pi}}
\end{align}
$$

Note that, $h(x)$ doesn't actually have an $x$ term, which is OK, because one can think of $h(x)$ having the same value for all $x$s. 

We had seen, in the one-parameter canonical exponential family case, that, $E_\eta[T(X)] = A'(\eta)$ and $ Var_\eta[T(X)] = A''(\eta) $. In the multi-parameter case, these still hold with the differentiation replaced with gradient. i.e.,:

$$
\begin{align}
E_\boldsymbol{\eta}[\mathbf{T}(X)] &= \nabla_\boldsymbol{\eta} A(\boldsymbol{\eta}) \\
Var_\boldsymbol{\eta}[\mathbf{T}(X)] &= \nabla^2_\boldsymbol{\eta} A(\boldsymbol{\eta})
\end{align}
$$

In our example of mutlivariate gaussian, we hence get:

$$
\begin{align}
E[ \begin{bmatrix} X\\ X^2 \end{bmatrix} ] &= \begin{bmatrix} \dfrac{d}{d\eta_1} \left( -\dfrac{1}{2} \dfrac{\eta_1^2}{\eta_2} - \dfrac{1}{2} ln(-2\eta_2) \right)\\ \dfrac{d}{d\eta_2} \left( -\dfrac{1}{2} \dfrac{\eta_1^2}{\eta_2} - \dfrac{1}{2} ln(-2\eta_2) \right) \end{bmatrix} \\
\text{and}\\
Var[ \begin{bmatrix} X\\ X^2 \end{bmatrix} ] &= \begin{bmatrix} \dfrac{d^2}{d\eta_1^2} \left( -\dfrac{1}{2} \dfrac{\eta_1^2}{\eta_2} - \dfrac{1}{2} ln(-2\eta_2) \right)\\ \dfrac{d^2}{d\eta_2^2} \left( -\dfrac{1}{2} \dfrac{\eta_1^2}{\eta_2} - \dfrac{1}{2} ln(-2\eta_2) \right) \end{bmatrix}
\end{align}
$$

One can work out the above equations. In the end, one would get:
$$
\begin{align}
E[ \begin{bmatrix} X\\ X^2 \end{bmatrix} ] &= \begin{bmatrix} \mu\\ \mu^2+\sigma^2 \end{bmatrix} \\
Var[ \begin{bmatrix} X\\ X^2 \end{bmatrix} ] &= \begin{bmatrix} \sigma^2\\ 4\mu^2\sigma^2+2\sigma^4 \end{bmatrix}
\end{align}
$$

which are as expected.

## Maximum likelihood estimates
When we make N observations from a canonical exponential family distribution, if they are independent observations, then their joint density funciton will be simply the multiplication of individual density functions. i.e., it will look like;

$$
\begin{align}
f(\mathbf{x};\boldsymbol{\eta}) &= \prod_{n=1}^N h(x_n) exp \left( \boldsymbol{\eta}^T \mathbf{T}(x_n) - A(\boldsymbol{\eta}) \right) \\
&= \left(  \prod_{n=1}^N h(x_n) \right) exp \left( \boldsymbol{\eta}^T \sum_{n=1}^N \mathbf{T}(x_n) - N A(\boldsymbol{\eta}) \right) 
\end{align}
$$

When we try to find the MLE, we try to find the parameter that maximises the joint *pdf*, given the observations $(x_1, x_2, \ldots, x_n)$. The joint *pdf* is renamed as the likelihood function with the role of the random variable and parameter flipped, like below:

$$
\mathcal{L}(\theta; \mathbf{x}) = f(\mathbf{x};\theta)
$$

$\theta$ here is just a placeholder of the parameter of interest and isn't the actual $\theta$ in the exponential family contest.

It is often the practice to try to maximize the log of the likelihood funciton instead of the likelihood function itself as it gives the same result, but is mathematically easier. In our case, the log likelihood function is as given below.

$$
\ell(\boldsymbol{\eta}) = log \left( \prod_{n=1}^N h(x_n) \right) + \boldsymbol{\eta}^T \sum_{n=1}^N \mathbf{T}(x_n) - N A(\boldsymbol{\eta})
$$

To get the MLE of $\boldsymbol{\eta}$, we take the gradient of $\ell(\boldsymbol{\eta})$ and equate it to zero:

$$
\nabla_\boldsymbol{\eta}(\ell) = \sum_{n=1}^N \mathbf{T}(x_n) - N \nabla_\boldsymbol{\eta}A(\boldsymbol{\eta}) = 0 \\
\nabla_\boldsymbol{\eta}A(\boldsymbol{\eta}^*) = \dfrac{1}{N} \sum_{n=1}^N \mathbf{T}(x_n)
$$

where, $\boldsymbol{\eta}^*$ is our MLE for $\boldsymbol{\eta}$. Something interesting happens when we use the earlier result about canonical exponential family members, where, $E_\boldsymbol{\eta}[\mathbf{T}(X)] = \nabla_\boldsymbol{\eta} A(\boldsymbol{\eta})$. Substituting this in the above MLE solution, we get,

$$
E_\boldsymbol{\eta}[\mathbf{T}(X)] = \dfrac{1}{N} \sum_{n=1}^N \mathbf{T}(x_n)
$$

We always knew, intuitively, that the sample mean is a good estimate for mean of the original distribution (we are talking about the sample estimate and mean of $\mathbf{T}$, not $X$) . Here we see that, mathematically it is accurate and is extendable to not just one parameter, but even to the multiparameter case (because here, $\mathbf{T}$ is a vector). For example, in our Gaussian example case, we get,

$$
E_\boldsymbol{\eta}[\begin{bmatrix} X\\ X^2 \end{bmatrix}] = \dfrac{1}{N} \sum_{n=1}^N \begin{bmatrix} x\\ x^2 \end{bmatrix}
$$

One can show that, for the exponential family the MLE of the sufficient statistic is unbiased.

## Generalised Linear Models
In regression analysis, we are commonly tasked with finding the relationship between a dependent variable and one or more predictor variables. And the dependent variable is almost always a random variable, because, no matter how many predictors we use, in the end, we can never come up with single value for the dependent variable, but a bunch of possible values with associated probabilities, i.e., a probability distribution. For example, suppose we want to predict the number of people arriving at a restaurant in a given period of time, we know that it is always going to be a poisson RV, but the day of the week might affect the mean number of arrivals, i.e., $\lambda$ of that poisson RV. Even if we factor everything that can conceivably after the number of people arriving at a restaurant, still, in real world, we will get some prediction error - The error may be very small, but it will be present and hence we will always have a probability distribution for the dependent variable. 

Similarly, if we are predicting the price of a house based on square footage, number of bedrooms, pin code etc., we will always end up with a Gaussian distribution for the dependent variable, with some $\mu$ and $\sigma$ values that are affected by the predictor variables. Another example is predicting whether a Presidential candidate will win in a district or not based on the number of coloured voters, amount of funding raised, whether the candidate's party won most recent local elections there, unemployment figures in that area etc. But at the end of the day, you cannot tell, for sure, whether the candidate will win or not - At best, you can tell how likely the candidate is to win, which is basically a Bernoulli RV. 

Basically, one can say that, in regression analysis, what we are actually doing is not predicting the value of a dependent variable, but predicting the one or more parameters of its probability distribution: i.e., $\lambda$ for Poisson R.V., $\mu$ and/or $\sigma$ for a Gaussian, $p$ for Bernoulli etc. Before we continue this discussion, it is important to understand the following:
1. While regression analysis is used for estimating *pdf*/*pmf* parameters, what family of distributions the dependent variable will be from (Ex. Poisson or Gaussian) is a design choice. For instance, it make no sense to fashion a candidate's win/loss as a Gaussian R.V. or the number of arriavals in a restaurant in a given period of time as a Gamma R.V. The design choice is made based on logical reasoning and/or domain expertise
2. While the distribution fmaily the dependent variable is a design choice, the distributions of the predictor variables is not. For example, while housing price may be a Gaussian, number of bedrooms need not be - in fact, given that it can only be integer values, it will not be a Gaussian for sure.

Generalised Linear Models (GLM) is about regression analysis of the Exponential family of RV's. It turs out, since every sub-family in the Exponential family, be it Gaussian or Bernoulli or Poisson or anything else, have the same form, the mathematics associated with regression analysis of dependent variables from any of these families, is the same. Suppose the distribution of the dependent variable $Y$ conditioned on a random vector of predictors, $\mathbf{X}$ of length $n$, is an exponential family pdf:

$$
f(y|\mathbf{x};\eta) = exp[\ \eta T(y) - A(\eta)\ ]\ h(y)
$$

For the sake of mathematical simplicity, without losing generality, we are considering a one-parameter family. The idea is to predict $\eta$ based on the $\mathbf{X} = \mathbf{x}$. As the name suggest, we are interested in linear models here. So this translates to having $\eta$ as:

$$
\eta = \mathbf{c}^T\mathbf{X}
$$

Our job is to come up with the $\mathbf{c}$s that scales the predictors, given a set of $m$ training $(\mathbf{x}^i, y^i)$ pairs. The classical algorithm for doing this is to use gradient ascent on the log likelihood function, which is,

$$
\begin{align}
\ell(\eta) &= log \left( P(Y^1 = y^1, Y^2 = y^2, \ldots, Y^m = y^m |\mathbf{x}^1 \mathbf{x}^2, \dots, \mathbf{x}^m;\eta) \right)\\
\text{Assuming the training pairs are independent of one another,} \\
           &= log \left( \prod_{i=1}^{m} P(y^i|\mathbf{x}^i;\eta) \right) \\
           &= \sum_{i=1}^{m} log \left( P(y^i|\mathbf{x}^i;\eta) \right) 
\end{align}
$$

and the algorithm is,

1. Update $c_j:= c_j + \alpha \dfrac{\partial}{\partial c_j}\ell(\eta(\mathbf{c})), \forall j \in [1,n]$
2. Update $\eta = \mathbf{c}^T\mathbf{X}$
3. Go back to step 1.

We can expand step 1 as follows:

$$
\begin{align}
c_j &:= c_j + \alpha \dfrac{\partial}{\partial c_j}\ell(\eta(\mathbf{c})) \\
    &:= c_j + \alpha \dfrac{\partial}{\partial c_j} \left(\sum_{i=1}^{m} log \left( P(y^i|\mathbf{x}^i;\eta) \right) \right) \\
    &:= c_j + \alpha \dfrac{\partial}{\partial c_j} \left(\sum_{i=1}^{m} log \left( exp[\ \eta T(y^i) - A(\eta)\ ]\ h(y^i) \right)\right) \\
    &:= c_j + \alpha \sum_{i=1}^{m} \left( \dfrac{\partial}{\partial c_j} \eta T(y^i) - \dfrac{\partial}{\partial c_j} A(\eta) + \dfrac{\partial}{\partial c_j} log(h(y^i)) \right) \\
    &:= c_j + \alpha \sum_{i=1}^{m} \left( \dfrac{\partial}{\partial c_j} \mathbf{c}^T\mathbf{x}^i T(y^i) - \dfrac{\partial A(\eta)}{\partial \eta}  \dfrac{\partial \eta}{\partial c_j} + 0 \right) \\
    &:= c_j + \alpha \sum_{i=1}^{m} \left( x^i_j T(y^i) - E_\eta[T(Y)] x^i_j \right) \\
    &:= c_j + \alpha \sum_{i=1}^{m} \left( T(y^i) - E_\mathbf{c}[T(Y)] \right) x^i_j
\end{align}
$$

where, in the penultimate step, we have utilised the result, $ E_\eta[T(X)] = A'(\eta) $, for exponential family RVs. 

In the case of Gaussian R.V. with known variance, we have, $ T(y) = y $, and $E[T(Y)] = E[Y] = \eta$. Therefore, the adaptation step above becomes:

$$
\begin{align}
c_j &:= c_j + \alpha \sum_{i=1}^{m} \left( y^i - \mathbf{c}^T\mathbf{x}^i \right) x^i_j \\
    &:= c_j - \alpha \sum_{i=1}^{m} \left( \mathbf{c}^T\mathbf{x}^i - y^i \right) x^i_j
\end{align}
$$

This is the familiar gradient descent formula used for linear regression. In the case of a Bernoulli R.V., we have, $ T(y) = y $, and $E[T(Y)] = p = \dfrac{1}{1+\mathcal{e}^{-\eta}}$. Therefore, the adaptation step above becomes:

$$
\begin{align}
c_j &:= c_j + \alpha \sum_{i=1}^{m} \left( y^i - \dfrac{1}{1+\mathcal{e}^{-\mathbf{c}^T\mathbf{x}^i}} \right) x^i_j \\
    &:= c_j - \alpha \sum_{i=1}^{m} \left( sigmoid(\mathbf{c}^T\mathbf{x}^i) - y^i \right) x^i_j
\end{align}
$$

which is the gradient descent formula used for logistic regression.

When does the gradient descent algorithm stop? In the generic GLM case we have,

$$
c_j := c_j + \alpha \sum_{i=1}^{m} \left( T(y^i) - E_\mathbf{c}[T(Y)] \right) x^i_j
$$

Clearly, if $\sum_{i=1}^{m} \left( T(y^i) - E_\mathbf{c}[T(Y)] \right) = 0$, the algorithm would stop. That means, the algorithm would stop when,

$$
\begin{align}
\sum_{i=1}^{m} T(y^i) &= \sum_{i=1}^{m} E_\mathbf{c}[T(Y)] \\
                      &= m E_\mathbf{c}[T(Y)] \\
\dfrac{\sum_{i=1}^{m} T(y^i)}{m} &= E_\mathbf{c}[T(Y)] 
\end{align}
$$

This is the same result as we got while dealing with the maximum likelihood of exponential family earlier

## References
1. [Exponential family - Purdue](https://www.stat.purdue.edu/~dasgupta/expfamily.pdf) 
2. [Sufficiency - Missouri](https://people.missouristate.edu/songfengzheng/Teaching/MTH541/Lecture%20notes/Sufficient.pdf)
3. [Exponential family - Berkeley](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf)

# Hypothesis testing
## Basics
+ The null hypothesis must be the least harmful choice (ex. accused is not guilty). One needs to present a strong evidence to have it rejected. If there is a confusion as to what is the least harmful assumption, try to remove the hypothesis testing out of the picture and ask yourself, "If I were to just assume the null hypothesis, don't test it, and just continue doing things based on that assumption, how harmful is it compared to doing the same with the counter-assumption". For ex., if you assume that the accused isn't guilty, don't test it and continue to act based purely on that assumption, you will simply let the accused walk free. But if you assumed the accused is guilty, don't test it and act on that assumption, you will punish the accused. In the former case, some criminals may get away. But in the latter case, some innocents may be punished. Sometimes the harm is measurable (Ex. in terms of money when it comes to hypothesis testing related an investment or expense). In some cases, like our court example, it is obvious. In some cases, you need domain experts to tell you which is the most harmful case
+ The null hypothesis implies a specific probability distribution for the data. Using that you test how likely that you would see that data you have if it were a random observation of an RV belonging to that distribution. 
    + Type I error is said to have been committed when you reject the null hypothesis when the null hypothesis is actually true. 
    + Type II error is said to have been committed when you fail to reject the null hypothesis when it is fase. This is less harmful that committing a Type I error: Failing to reject a not-guilty assumption when the accused is actually guilty is less harmful than rejecting the not-guilty assumption when the accused is actually innocent.
    + $\alpha$ = Probability(Type I error) is called the statistical significance of the test. In the "critical value approach" $\alpha$ is a design choice from which the critical region is derived. In the "p-value approach", $\alpha$ is a derived from the test statistic.
    + $\beta$ = Probability(Type II error). To compute $\beta$ one must specify a specific alternative hypothesis and from its implications, and the critical region as decided by $\alpha$, derive $\beta$
        + In this [example](https://online.stat.psu.edu/stat415/lesson/9/9.1), instead of specifying the alternative thesis as simply "not $H_0$, or "p \neq 0.25", give a specific value for $p$, like 0.27 and then ask what would be $\beta$ for a given $\alpha$ if the truth was actually that $p = 0.27$. 
    + The implications of the null hypothesis (that decides the probability distributin) can be computed theoretically in some cases - like if we assume a coin is unbiased, the implication is that the data must be from a Bernoulli R.V. with $p = 0.5$. In some other case, the implication must come from the data. For ex., if we are testing if a speicial programs reduces high school dropouts, you can use the nation-wide data available about dropouts to derive the implication (such as mean dropout rate) of your null hypothesis (which would be that the program does *not* reduce dropouts. If one cannot derive the implication from theory and if there is no existence data to go to either, one would have to use a control group and use data from that group to derive the implications of the null hypothesis
+ The distribution of the sample statistic (proportion of females in our sample) is often normally distributed (as sample proportion involves summing of Bernoulli R.V.s) even though the observations are from a non-normal distribution (like Bernoulli R.V.)
    
    