# Gaussian discriminant analysis

## 1.1 The multivariate normal distribution

<font size=4>
&emsp;&emsp;The multivariate normal distribution in $n$-dimensions, also called the multivariate Gaussian distribution, is parameterized by a **mean vector** $\mu\in\mathbb{R}^n$ and a **covariance matrix** $\Sigma\in\mathbb{R}^{n×n}$, where $\Sigma\ge0$ is symmetric and positive semi-definite. Also written "$\mathcal{N}\sim(\mu,\Sigma)$", its density is given by:
</font>

<font size=4>
$$
p(x;\mu,\Sigma)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\,exp\left( -\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu) \right)
$$
</font>

<font size=4>
&emsp;&emsp;In the equation above, "$|\Sigma|$" denotes the determinant of the matrix $\Sigma$.
</font>

<font size=4>
&emsp;&emsp;For a random variable $X$ distributed $\mathcal{N}(\mu,\Sigma)$, the mean is (unsurprisingly) given by $\mu$:
</font>

<font size=4>
$$
E[X]=\int_xx\,p(x;\mu,\Sigma)dx=\mu
$$
</font>

<font size=4>
&emsp;&emsp;The **covariance** of a vector-valued random variable $Z$ is defined as $Cov(Z)=E[(Z-E[Z])(Z-E[Z])^T]$. This generalizes the notion of the variance of a real-valued random variable. The covariance can also be defined as $Cov(Z)=E[ZZ^T]-(E[Z])(E[Z])^T$. If $X\sim\mathcal{N}(\mu,\Sigma)$,then
</font>

<font size=4>
$$
Cov(X)=\Sigma
$$
</font>

## 1.2 The Gaussian Discriminant Analysis model

<font size=4>
&emsp;&emsp;When we have a classification problem in which the input features $x$ are continuous-valued random variables, we can then use the Gaussian Discriminant Analysis (GDA) model, which models $p(x\,|\,y)$ using a multivariate normal distribution. The model is:
</font>

<font size=4>
$$
\begin{align}
y&\sim{Bernoulli(\phi)} \\
x\,|\,y=0&\sim\mathcal{N}(\mu_0,\Sigma) \\
x\,|\,y=1&\sim\mathcal{N}(\mu_1,\Sigma) 
\end{align}
$$
</font>

<font size=4>
&emsp;&emsp;Writing out the distribution, this is:
</font>

<font size=4>
$$
\begin{align}
p(y) &= \phi^y(1-\phi)^{1-y} \\
p(x\,|\,y=0) &= \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\,exp\left( -\frac{1}{2}(x-\mu_0)^T\Sigma^{-1}(x-\mu_0) \right) \\
p(x\,|\,y=1) &= \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\,exp\left( -\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1) \right)
\end{align}
$$
</font>

<font size=4>
&emsp;&emsp;Here, the parameters of our model are $\phi,\Sigma,\mu_0,\mu_1$. (Note that while there're two different mean vectors $\mu_0$ and $\mu_1$, this model is usually applied using only one covariance matrix $\Sigma$.) The log-likelihood of the data is given by
</font>

<font size=4>
$$
\begin{align}
l(\phi,\mu_0,\mu_1,\Sigma)
&= log\,\prod^m_{i=1}p(x^{(i)},y^{(i)};\phi,\mu_0,\mu_1,\Sigma) \\
&= log\,\prod^m_{i=1}p(x^{(i)}\,|\,y^{(i)};\phi,\mu_0,\mu_1,\Sigma)p(y^{(i)};\phi)
\end{align}
$$
</font>

<font size=4>
&emsp;&emsp;By maximizing $l$ with respect to the parameters, we find the maximum likelihood estimate of the parameters to be:
</font>

<font size=4>
$$
\begin{align}
\phi &= \frac{1}{m}\sum^{m}_{i=1}1\{y^{(i)}=1 \} \\
\mu_0 &= \frac{\sum^{m}_{i=1}1\{y^{(i)}=0\}x^{(i)}}{\sum^{m}_{i=1}1\{y^{(i)}=0\}} \\
\mu_1 &= \frac{\sum^{m}_{i=1}1\{y^{(i)}=1\}x^{(i)}}{\sum^{m}_{i=1}1\{y^{(i)}=1\}} \\
\Sigma &= \frac{1}{m}\sum^{m}_{i=1}(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T
\end{align}
$$
</font>

## 1.3 详细推导

### 1.3.1 准备工作

<font size=4>
&emsp;&emsp;在推导高斯判别分析的过程中，需要用到以下四个公式：
</font>

<font size=4>
$$
\triangledown_xx^TAx=2Ax，其中，A为对称矩阵\quad(1)
$$
</font>

<font size=4>
$$
\triangledown_A\big|\,A\,\big|=\big|\,A\,\big|\,(A^{-1})^T\quad(2)
$$
</font>

<font size=4>
$$
\triangledown_Alog\,\big|\,A\,\big|=A^{-1}，其中，A为正定矩阵\quad(3)
$$
</font>

<font size=4>
$$
\triangledown_Ax^TAx=xx^T，其中，A为对称矩阵\quad(4)
$$
</font>

<font size=4>
&emsp;&emsp;因为，式（1）的矩阵 A 为对称矩阵，所以 $x^TAx$ 为二次型，因此，$\triangledown_xx^TAx=2Ax$。
</font>

<font size=4>
下证式（2）：
</font>

<font size=4>
&emsp;&emsp;由
</font>

<font size=4>
$$
\big|\,A\,\big|=\sum^n_{i=1}(-1)^{i+j}A_{ij}\big|\,A_{\backslash{i},\backslash{j}}\big| \quad(\,对任意\,j\in1,\cdots,n)
$$
</font>

<font size=4>
&emsp;&emsp;可得
</font>

<font size=4>
$$
\frac{\partial}{\partial{A_{kl}}}\big|\,A\,\big|=\frac{\partial}{\partial{A_{kl}}}\sum^n_{i=1}(-1)^{i+j}A_{ij}\big|\,A_{\backslash{i},\backslash{j}}\big|=(-1)^{k+l}\big|\,A_{\backslash{k},\backslash{l}}\big|=(adj(A))_{lk}
$$
</font>

<font size=4>
&emsp;&emsp;其中，adj(A) 表示矩阵 A 的伴随矩阵。因此
</font>

<font size=4>
$$
\triangledown_A\big|\,A\,\big|=(adj(A))^T=\big|\,A\,\big|\,(A^{-1})^T
$$
</font>

<font size=4>
下证式（3）：
</font>

<font size=4>
&emsp;&emsp;因为，矩阵 A 为正定矩阵，所以，$\big|\,A\,\big|>0$，即 $log\,\big|\,A\,\big|$ 存在，由
</font>

<font size=4>
$$
\frac{\partial\,log\,\big|\,A\,\big|}{\partial\,A_{ij}}=\frac{\partial\,log\,\big|\,A\,\big|}{\partial\,\big|\,A\,\big|}\frac{\partial\,\big|\,A\,\big|}{\partial\,A_{ij}}=\frac{1}{\big|\,A\,\big|}\frac{\partial\,\big|\,A\,\big|}{\partial\,A_{ij}}
$$
</font>

<font size=4>
&emsp;&emsp;以及式（2）可得
</font>

<font size=4>
$$
\triangledown_Alog\,\big|\,A\,\big|=\frac{1}{\big|\,A\,\big|}\triangledown_A\,\big|\,A\,\big|=A^{-1}
$$
</font>

<font size=4>
&emsp;&emsp;因为，矩阵 A 为对称矩阵，所以，上式最后的结果没有转置符号。
</font>

<font size=4>
下证式（4）：
</font>

<font size=4>
&emsp;&emsp;由
</font>

<font size=4>
$$
\frac{\partial\,(x^TAx)}{\partial\,A_{lk}}=\frac{\partial}{\partial\,A_{lk}}\sum_i\sum_jA_{ij}x_ix_j=x_lx_k
$$
</font>

<font size=4>
&emsp;&emsp;可得
</font>

<font size=4>
$$
\triangledown_Ax^TAx=xx^T
$$
</font>

### 1.3.2 推导GDA最大似然估计最佳参数

<font size=4>
对数似然函数：
</font>

<font size=4>
$$
\begin{align}
l(\phi,\mu_0,\mu_1,\Sigma)
&= log\prod^m_{i=1}p(x^{(i)},y^{(i)}) \\
&= log\prod^m_{i=1}p(x^{(i)}\,|\,y^{(i)})p(y^{(i)}) \\
&= \sum^m_{i=1}log\,p(x^{(i)}\,|\,y^{(i)})+\sum^m_{i=1}log\,p(y^{(i)}) \\
&= \sum^m_{i=1}log\left( p(x^{(i)}\,|\,y^{(i)}=0)^{1-y^{(i)}} \cdot p(x^{(i)}\,|\,y^{(i)}=1)^{y^{(i)}} \right)+\sum^m_{i=1}log\,p(y^{(i)}) \\
&= \sum^m_{i=1}(1-y^{(i)})log\left(p(x^{(i)}\,|\,y^{(i)}=0)\right)+\sum^m_{i=1}y^{(i)}log\left(p(x^{(i)}\,|\,y^{(i)}=1)\right)+\sum^m_{i=1}log\,p(y^{(i)})
\end{align}
$$
</font>

<font size=4>
&emsp;&emsp;注意，此函数分为三个部分，$\mu_0$ 只与第一部分有关，$\mu_1$ 只与第二部分有关，$\phi$ 只与第三部分有关，$\Sigma$ 与第一和第二部分有关。 
</font>

<font size=4>
首先，求 $\phi$，即
</font>

<font size=4>
$$
\begin{align}
\triangledown_\phi\,l(\phi,\mu_0,\mu_1,\Sigma)
&= \triangledown_\phi\,\sum^m_{i=1}log\,p(y^{(i)})\\
&= \triangledown_\phi\,\sum^m_{i=1}log\,\phi^{y^{(i)}}(1-\phi)^{(1-y^{(i)})} \\
&= \triangledown_\phi\,\sum^m_{i=1}\left( y^{(i)}log\,\phi+(1-y^{(i)})log\,(1-\phi) \right) \\
&= \sum^m_{i=1}\left( y^{(i)}\frac{1}{\phi}-(1-y^{(i)})\frac{1}{1-\phi} \right) \\
&= \sum^m_{i=1}\left( 1\{y^{(i)}=1\}\frac{1}{\phi}-1\{y^{(i)}=0\}\frac{1}{1-\phi} \right)
\end{align}
$$
</font>

<font size=4>
&emsp;&emsp;令其为零，即
</font>

<font size=4>
$$
\begin{align}
\sum^m_{i=1}\left( 1\{y^{(i)}=1\}\frac{1}{\phi}-1\{y^{(i)}=0\}\frac{1}{1-\phi} \right) &= 0 \\
\frac{\sum^m_{i=1}1\{y^{(i)}=1\}}{\phi}-\frac{\sum^m_{i=1}1\{y^{(i)}=0\}}{1-\phi} &= 0 \\
\frac{\sum^m_{i=1}1\{y^{(i)}=1\}}{\phi} &= \frac{\sum^m_{i=1}1\{y^{(i)}=0\}}{1-\phi} \\
\sum^m_{i=1}1\{y^{(i)}=1\}-\phi\sum^m_{i=1}1\{y^{(i)}=1\} &= \phi\sum^m_{i=1}1\{y^{(i)}=0\} \\
\phi\left( \sum^m_{i=1}1\{y^{(i)}=0\}+\sum^m_{i=1}1\{y^{(i)}=1\} \right) &= \sum^m_{i=1}1\{y^{(i)}=1\} \\
\phi &= \frac{\sum^m_{i=1}1\{y^{(i)}=1\}}{\sum^m_{i=1}1\{y^{(i)}=0\}+\sum^m_{i=1}1\{y^{(i)}=1\}}
\end{align}
$$
</font>

<font size=4>
&emsp;&emsp;注意到，$\sum^m_{i=1}1\{y^{(i)}=0\}+\sum^m_{i=1}1\{y^{(i)}=1\}=m$，因此，
</font>

<font size=4>
$$
\phi = \frac{\sum^m_{i=1}1\{y^{(i)}=1\}}{m}
$$
</font>

<font size=4>
其次，求 $\mu_0$，即
</font>

<font size=4>
$$
\begin{align}
\triangledown_{\mu_0}\,l(\phi,\mu_0,\mu_1,\Sigma)
&= \triangledown_{\mu_0}\,\sum^m_{i=1}(1-y^{(i)})log\,p(x^{(i)}\,|\,y^{(i)}=0)\\
&= \triangledown_{\mu_0}\,\sum^m_{i=1}(1-y^{(i)})(log\,\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}-\frac{1}{2}(x^{(i)}-\mu_0)^T\Sigma^{-1}(x^{(i)}-\mu_0)) \\
&= \sum^{m}_{i=1}(1-y^{(i)})\Sigma^{-1}(x^{(i)}-\mu_0) \\
&= \sum^{m}_{i=1}1\{y^{(i)}=0\}\Sigma^{-1}(x^{(i)}-\mu_0)
\end{align}
$$
</font>

<font size=4>
&emsp;&emsp;令其为零，可得
</font>

<font size=4>
$$
\mu_0=\frac{\sum^{m}_{i=1}1\{y^{(i)}=0\}x^{(i)}}{\sum^{m}_{i=1}1\{y^{(i)}=0\}}
$$
</font>

<font size=4>
同理可得
</font>

<font size=4>
$$
\mu_1=\frac{\sum^{m}_{i=1}1\{y^{(i)}=1\}x^{(i)}}{\sum^{m}_{i=1}1\{y^{(i)}=1\}}
$$
</font>

<font size=4>
最后，求 $\Sigma$，在此之前，先证明
</font>

<font size=4>
$$
\triangledown_{\Sigma}\Sigma^{-1}=-\Sigma^{-1}\Sigma^{-1}
$$
</font>

<font size=4>
&emsp;&emsp;由
</font>

<font size=4>
$$
\begin{align}
\frac{\partial\,I}{\partial\,x}
&= \frac{\partial\,(A^{-1}A)}{\partial\,x} \\
&= A^{-1}\frac{\partial\,A}{\partial\,x}+\frac{\partial\,A^{-1}}{\partial\,x}A \\
&= 0
\end{align}
$$
</font>

<font size=4>
&emsp;&emsp;可得
</font>

<font size=4>
$$
\frac{\partial\,A^{-1}}{\partial\,x}A=-A^{-1}\frac{\partial\,A}{\partial\,x}
$$
</font>

<font size=4>
&emsp;&emsp;两边右乘 $A^{-1}$，可得
</font>

<font size=4>
$$
\frac{\partial\,A^{-1}}{\partial\,x}=-A^{-1}\frac{\partial\,A}{\partial\,x}A^{-1}
$$
</font>

<font size=4>
&emsp;&emsp;因此
</font>

<font size=4>
$$
\triangledown_{\Sigma}\Sigma^{-1}=-\Sigma^{-1}\Sigma^{-1}
$$
</font>

<font size=4>
&emsp;&emsp;于是
</font>

<font size=3>
$$
\begin{align}
\triangledown_{\Sigma}\,l(\phi,\mu_0,\mu_1,\Sigma) 
&= \triangledown_{\Sigma}\left( \sum^m_{i=1}(1-y^{(i)})\,log\,p(x^{(i)}\,|\,y^{(i)}=0\,;\mu_0,\Sigma)+\sum^m_{i=1}y^{(i)}\,log\,p(x^{(i)}\,|\,y^{(i)}=1\,;\mu_1,\Sigma) \right) \\
&= \triangledown_{\Sigma}\left( \sum^m_{i=1}(1-y^{(i)})\,log\,\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}e^{-\frac{1}{2}(x^{(i)}-\mu_0)^T\Sigma^{-1}(x^{(i)}-\mu_0)}+\sum^m_{i=1}y^{(i)}\,log\,\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}e^{-\frac{1}{2}(x^{(i)}-\mu_1)^T\Sigma^{-1}(x^{(i)}-\mu_1)} \right) \\
&= \triangledown_{\Sigma}\left( \sum^m_{i=1}log\,\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}-\frac{1}{2}\sum^m_{i=1}(x^{(i)}-\mu_{y^{(i)}})^T\Sigma^{-1}(x^{(i)}-\mu_{y^{(i)}}) \right) \\
&= \triangledown_{\Sigma}\left( \sum^m_{i=1}(-\frac{n}{2}log\,2\pi-\frac{1}{2}log\,|\Sigma|)-\frac{1}{2}\sum^m_{i=1}(x^{(i)}-\mu_{y^{(i)}})^T\Sigma^{-1}(x^{(i)}-\mu_{y^{(i)}}) \right) \\
&= -\frac{m}{2}\Sigma^{-1}+\frac{1}{2}\sum^m_{i=1}(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T\Sigma^{-1}\Sigma^{-1}
\end{align}
$$
</font>

<font size=4>
&emsp;&emsp;令其为零，可得
</font>

<font size=4>
$$
\Sigma = \frac{1}{m}\sum^{m}_{i=1}(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T
$$
</font>

### 1.3.3 小结

<font size=4>
&emsp;&emsp;通过最大化似然函数，得到四个参数的估计值为：
</font>

<font size=4>
$$
\begin{align}
\phi &= \frac{1}{m}\sum^{m}_{i=1}1\{y^{(i)}=1 \} \\
\mu_0 &= \frac{\sum^{m}_{i=1}1\{y^{(i)}=0\}x^{(i)}}{\sum^{m}_{i=1}1\{y^{(i)}=0\}} \\
\mu_1 &= \frac{\sum^{m}_{i=1}1\{y^{(i)}=1\}x^{(i)}}{\sum^{m}_{i=1}1\{y^{(i)}=1\}} \\
\Sigma &= \frac{1}{m}\sum^{m}_{i=1}(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T
\end{align}
$$
</font>

## 1.4 Discussion: GDA and logistic regression

In [None]:
<font size=4>
&emsp;&emsp;
</font>