# Overview
In this notebook we establish basic properites and proofs related to the normal distribution family. This includes the general normal distribution as well as the standard normal which is a special case of the normal distribution.

Note: The normal distribution is also referred to as the Gaussian distribution or Laplace-Gauss to acknowledge the discoverers..

It's important to note that the standard normal distributions are particularely important because of the connection with the law of large numbers.

# 1. Univariate Normal Distribution

## 1.1. Overview
## 1.2. Definition

We start with the definition of the univariate gaussian distribution:

$$ f(x) = \frac{1}{\sigma\sqrt{2\pi}}
e^{-\frac{1}{2}\left( \frac{x - \mu}{\sigma} \right)^2}  
\tag{1.2.1}
$$

Which can also be stated using the $exp()$ function rather than the symbol $e$:

$$ = \frac{1}{\sigma\sqrt{2\pi}}
exp \left\{ -\frac{1}{2}\left( \frac{x - \mu}{\sigma} \right)^2  \right\} 
\tag{1.2.2} 
$$

It is also common to reformulate the expression in terms of variance $\sigma^2$ rather than the standard deviation $\sigma$. This is particularaly useful once working in multimensional spaces where $\sigma^2$ is denoted as $\Sigma$.

$$ = \frac{1}{\sigma\sqrt{2\pi}}
exp \left\{ -\frac{1}{2}
\frac{(x - \mu)^2}{\sigma^2}  \right\}  
$$

### 1.2.3. Intuition
Below are a few interesting notes on the structure of the equation:

We can think of the leaging term $\frac{1}{\sigma\sqrt{2\pi}}$ as a "normalization factor" which is used to ensure that the integral of the function sums to 1 and satisfies the axioms of a probability space.

Another interesting observation/connection is the quadratic expression in the exponential is what gives the distribution it's shape. We see that an even power keeps the expression inside the exponential negative which yields our non-zero values. This connection with the quadratic is an important connection that will surface again when this equation is generalized to n-dimenaions.

<center><img src='images/normal_distribution_quadratic.png' height='400px' width='600px'></center>

## 1.3. Derevation

We can see the makings for the derevation in the proof for the CLT mentioned in section 2 of this article.

We see the original derevation, which is standard normal, can be generalized by expressing a variable as a linear combination. This also applies to the derevation of a multivariate normal distribution.

A proof of deriving the general normal distribution function can be found [here](https://web.sonoma.edu/users/w/wilsonst/papers/Normal/default.html).

## 1.3. History

The formula for the normal density function was discovered by Abraham de Moivre in 1738 while he was working on solving a gambling problem. The solution depended on finding the sum of the terms of a binomial distribution. Ultimately de Movre proved that the binomial distribution converged to the gaussian (standard normal) distribution. A nice history on the subject can be found [here](https://higherlogicdownload.s3.amazonaws.com/AMSTAT/1484431b-3202-461e-b7e6-ebce10ca8bcd/UploadedImages/Classroom_Activities/HS_2__Origin_of_the_Normal_Curve.pdf). You can also find this information in *A History of Probability and Statistics and Their Applications Before 1750* and the follow up volume by Anders Hald.


We can see a proof of de Moivre's claim [here](https://noahgolmant.com/writings/derivationsunivariatemultivariate.pdf). 

This discovery laid the fondations for the discovery of the CLT.

**TODO** Something to verify:
> Gauss and the Irish American mathematician Robert Adrain first derived the normal distribution as the only continuous distribution for which the sample mean is the value that maximises what Fisher later called the likelihood function, i.e. the joint probability of the observations considered as a function of the parameter—the actual observations being known. 
>
> https://www.quora.com/How-did-humans-derive-the-normal-distribution


# 2. Central Limit Theorem (CLT)

## 2.1. Overview

## 2.2. Definition
Using the characteristic function of a random variable it can be shown that a finite sequence $S_n$ of independent identically idstributed ($iid.$) variables with a common mean and variance will converge to a standard normal distribution.

$$ S_n = \sum X_i $$

$$ \mu = \mathbb{E}[X_i] $$

$$ \sigma = Var[X_i] $$

$$\lim\limits_{n \to \infty}
\mathbb{P}\left( \frac{S_n - n\mu}{\sigma \sqrt{n}} \le c\right)
= \phi(c)
= \int_{-\infty}^{c} \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}x^2} dx
\tag{2.2.1}
$$

## 2.3. Derevation

The proof relies on the characteristic function, fourier transform, and algebra.

Proof can be found [here](https://noahgolmant.com/writings/derivationsunivariatemultivariate.pdf)

## 2.4. History
See section 1.3.

# 3. Multivariate Normal Distribution

## 3.1. Overview

The multivariate normal (MVN) distribution is a probability distribution that models the linear combination of independent standard normal random variables. In other words it models a joint random variable composed of standard normal random variables.

## 3.2. Definition
There are many many definitions for a multivariate normal distribution floating around. To put things simply, a MVN distribution is a symetric probability distribution which is completely described by it's two moments $\mu$ and $\Sigma$. This translates into two constraints: the distribution functions (pdf/cdf) must satisty the axioms of a probability space and the structure of the function must contain certain structural elements.

## 3.3. Derivation

There are several methods to derive the multivariate normal distribution that I am aware of.

> To my knowledge, there are two primary approaches to developing the theory of multivariate Gaussian distributions. The first, and by far the most common approach in machine learning textbooks, is to define the multivariate gaussian distribution in terms of its density function, and to derive results by manipulating these density functions. With this approach, a lot of the work turns out to be elaborate matrix algebra calculations happening inside the exponent of the Gaussian density. One issue with this approach is that the multivariate Gaussian density is only defined when the covariance matrix is invertible. To keep the derivations rigorous, some care must be taken to justify that the new covariance matrices we come up with are invertible. For my taste, I find the rigor in our textbooks to be a bit light on these points. We’ve included the proof to Theorem 4 to give a flavor of the details one should add. The second major approach to multivariate Gaussian distributions does not use density functions at all and does not require invertible covariance matrices. This approach is much cleaner and more elegant, but it relies on the theory of characteristic functions and the Cramer-Wold device to get started, and these are beyond the prerequisites for this course. You can often find this development in more advanced probability and statistics books, such as Rao’s excellent Linear Statistical Inference and Its Applications (Chapter 8).
>
> https://davidrosenberg.github.io/mlcourse/in-prep/multivariate-gaussian.pdf


My derrevation of the multivariate normal distribution starts with a derrevation assuming variables are independent and then matures to abandon that presumption. It also relies on several other prerequisite proofs which we will establish below.

### 3.3.1. Setup Space

We define objects prerequisite to our proofs

"A" for afine transform??


$$A = \begin{bmatrix}
a_{i,k}
\end{bmatrix}, \ for \ i,k \in \{1, 2, \cdots, n\} $$

$$ X = \begin{bmatrix}
X_1 \\
X_2 \\
\vdots \\
X_n
\end{bmatrix}$$

$$ X_i \sim \mathcal{N} $$

$$ Z = \begin{bmatrix}
Z_1 \\
Z_2 \\
\vdots \\
Z_n
\end{bmatrix}$$

$$Z_i \sim \mathcal{N}(0,1)$$

### 3.3.2 Derive Joint Distribution For Independent Univariate Normal Variables

In this proof we will derive the joint density function and thus the joint distribution for a set of random variables $X$



We start by acknowledging that the variables are independent. The joint distribution tells us the probability of a set of variables (in our case two) realizing specific values at the same time.

Using the definition of conditional probability we know that

$$ f_X = p(X_1 | X_2) = \frac{p(X_1 \cap X_2)}{p(X_2)}$$

However, using the definition of independence, the joint probability for this specific case can be restated as:

$$ p(X_1 \cap X_2) = p(X_1 | X_2)p(X_2)$$

$$ p(X_1 \cap X_2) = p(X_1)p(X_2)$$

Generalizing this to n-dimensions we have:

$$ p(X_1 \cap X_2 \cap \cdots \cap X_n) = \prod p(X_i) $$

When $X_i \perp X_j \ \ \forall i,j$.

Injecting the unviariate normal density equation (1.3.1) we have:

$$ = \prod 
\frac{1}{\sigma_{X_i}\sqrt{2\pi}}
exp \left\{ -\frac{1}{2}
\frac{(X_i - \mu_{X_i})^2}{\sigma_{X_i}^2}  \right\}  
$$

$$ = 
\frac{1}{\sqrt[n]{2\pi}}
\prod 
\frac{1}{\sigma_{X_i}}
exp \left\{ -\frac{1}{2}
\frac{(X_i - \mu_{X_i})^2}{\sigma_{X_i}^2}  \right\}  
$$

Removing the exponential term from the product operator using the properties of exponents $(x^ax^b = x^{a+b})$ we have:

$$ = 
\frac{1}{\sqrt[n]{2\pi}}
exp 
\left\{ 
-\frac{1}{2}
\frac{(X_i - \mu_{X_i})^2}{\sigma_{X_i}^2}  
\cdots
-\frac{1}{2}
\frac{(X_n - \mu_{X_n})^2}{\sigma_{X_n}^2}  
\right\} 
\prod 
\frac{1}{\sigma_{X_i}} 
$$

Adjusting the esponent in the exponential term to use multi-dimensional matrix notation.

$$ = 
\frac{1}{\sqrt[n]{2\pi}}
exp 
\left\{ 
-\frac{1}{2}
(X - \mu_{X})^T\Sigma^{-1}(X - \mu_{X})  
\right\} 
\prod 
\frac{1}{\sigma_{X_i}} 
$$

We can finally remove the product operator by introducing the covariance matrix amd its determinant. The covariance matrix will hold the variances $\sigma_{X_i} = \sqrt{\Sigma_i}$ of the $X_i$. The determinant will perform the operation of multiplying the diagonals of the covariance matrix. If we take the square root of the covariance matrix or the determiniant we then have an equivalent result as the product operator.

$$ = 
\frac{1}{\sqrt[n]{2\pi}}
exp 
\left\{ 
-\frac{1}{2}
(X - \mu_{X})^T\Sigma^{-1}(X - \mu_{X})  
\right\} 
\frac{1}{\sqrt{|\Sigma_{X_i}|}} 
$$

$$ = 
\frac{1}{\sqrt{(2\pi)^n |\Sigma_{X_i}|}}
exp 
\left\{ 
-\frac{1}{2}
(X - \mu_{X})^T\Sigma^{-1}(X - \mu_{X})  
\right\}
\tag{3.3.2.1}
$$

This this we now have a mulativariate density function for mutually independent normal random variables.

If the variables are standard normal such that $Z_i \sim \mathcal{N}(0,1)$ we can reduce the equation even further to see that:

$$ f_Z(Z) = 
\frac{1}{\sqrt{(2\pi)^n}}
exp 
\left\{ 
-\frac{1}{2}
ZZ^T 
\right\} 
\tag{3.3.2.2}
$$

But the assumption we made about univariate random variables may not be appropriate for our situation, we may want to extend things a bit further.

### 3.3.3 Derive General Formula for Joint Normal Distribution

In this proof we will further generalize (3.2.2).

We will do this using the basic laws of probability and change of variables to define the deneral formula in terms of the previously derived standard iid. formula.

#### 3.3.3.1. Define Linear combination Of Standard Normal Random Variables

We define a new random variable $Y$ that is a linear transformation of $Z$. Recall that $Z$ defines a joint variable that spans multiple dimensions. Its components $Z_i$ are the mutually independent univariate standard normal random variables.

$$ Y = c_1Z_1 + \cdots + c_nZ_n + d$$

$$ = cZ+d $$

$$ = t(Z) $$

Note: We see here that the $Y$ defined in 3.3.2. was such that $c=1$ and $d=0$. See 3.4.1 and 3.4.2 for additional verification.

#### 3.3.3.2. Create An Equality Between Probability Spaces
We know that the probability density function of $Z$ and $Y$ must add up to 1 when summed over the corresponding supports (non-trival domains) in order to satisfy the axioms of a probability space. 

$$ \underset{S_Z} {\int \cdots \int} f_Z(Z) \ \partial Z = 1$$

$$ \underset{S_Y} {\int \cdots \int} f_Y(Y) \ \partial Y = 1$$


As such we can construct an equality which serves as the backbone of our derrevation.

$$ \underset{S_Z} {\int \cdots \int} f_Z(Z) \ \partial Z =  \underset{S_Y} {\int \cdots \int} f_Y(Y) \ \partial Y$$

#### 3.3.3.3. Perform Change Of Variable
We can then perform a change of variable on this equation to restate the left hand side in terms of $Y$ rather than $Z$. The general change ov variable for multiple integrals formula is as follows:

$$ \underset{S_Y} {\int \cdots \int} 
f_Z\left[ t^{-1}(Y) \right] |\mathbb{J}_{t^{-1}}|
\ \partial Z 
=
\underset{S_Y} {\int \cdots \int} f_Y(Y) \ \partial Y$$

This is a bit confusing so lets build up some intuition:

Our functions are well behaved, so we can use inversion to restate our transformation using an inverse function.

$$ Y = cZ + d = t(Z) $$

$$\Rightarrow Z = t^{-1}(Y) $$

$$ \Rightarrow t^{-1}(Y) = \frac{Y - d}{c} $$

$$ Z = \frac{Y - d}{c} $$


We can substitute $Z$ with the inverse transform of $t$. This will restate the integral interms of $Y$ but this will not complete the change of variable.

$$ \underset{S_Z} {\int \cdots \int} f_Z\left[ t^{-1}(Y) \right] \ \partial Z 
=
\underset{S_Y} {\int \cdots \int} f_Y(Y) \ \partial Y$$

$$ \underset{S_Z} {\int \cdots \int} f_Z\left[ \frac{Y - d}{c} \right] \ \partial Z 
=
\underset{S_Y} {\int \cdots \int} f_Y(Y) \ \partial Y$$

Recall that the multiple integreal over the support represents an n-dimensional area. As such we will need to convert the area from one unit/space to another. In other words we need to modify the integral so that it is covering $S_Z$ rather than $S_Y$ without changing the actual sum.

We will see that the reason we need both sides in terms of $Y$ is because we will eventually take the derivative of both sides with respect to $Y$.


We change the bounds of the integral using the Jacobian determinant $|\mathbb{J}_t|$ of our transformation $t$. Essentially we have:

$$ S_Y = |\mathbb{J}_t| S_Z$$

$$ S_X = |\mathbb{J}_t|^{-1} S_Z $$

$$ |\mathbb{J}_t|^{-1} = |\mathbb{J}_t^{-1}| = |\mathbb{J}_{t^{-1}}|$$

Given $t(Z) = cZ+d$ we have 

$$ |\mathbb{J}_t| = c $$


$$ \Rightarrow |\mathbb{J}_t^{-1}| 
= \frac{1}{|\mathbb{J}_t|}
= \begin{vmatrix}\frac{1}{c}\end{vmatrix}
$$

For more information see the [Jacobian Notebook](../Matrix%20Algebra/Jacobian.ipynb).

If we plug this information into the equation we complete the change of variable within the integral and thus have a general cumulative density functionfor $Y$.

$$ \underset{S_Y} {\int \cdots \int} 
f_Z\left[ t^{-1}(Y) \right] |\mathbb{J}_{t^{-1}}|
\ \partial Z 
=
\underset{S_Y} {\int \cdots \int} f_Y(Y) \ \partial Y$$

$$ \underset{S_Y} {\int \cdots \int} 
f_Z\left[ \frac{Y-d}{c} \right] \frac{1}{c}
\ \partial Z 
=
\underset{S_Y} {\int \cdots \int} f_Y(Y) \ \partial Y$$

#### 3.3.3.4. Differentiate Both Sides
We can now take the derivative of each side which yields our general solution for the probability density function.

$$ f_Y(Y) = f_Z\left[ t^{-1}(Y) \right] |\mathbb{J}_t^{-1}|$$

$$ = f_Z\left[ \frac{Y-d}{c} \right] \begin{vmatrix}\frac{1}{c}\end{vmatrix} $$

Note: This technique will work for any joint distirbution we may observe.

#### 3.3.3.5. Plug In Standard Normal Density Function
Up until this point, nothing required that the variables $Z_i$ be standard normal or independent. We can plug in any distribution at this point. The problem with straying away from a standard normal variable is that the math becomes very complicated. We will look at this in 3.3.3.6.

For now, we can now derive a solution specific to the standard normal distribution by plugging in the density function of $Z$ and the jacobian determinant.

$$ = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{Y-d}{c}\right)^2}\frac{1}{|c|}$$

$$ = \frac{1}{\sqrt{2\pi}|c|}e^{-\frac{1}{2}\left(\frac{Y-d}{c}\right)^2}$$

$$ = \frac{1}{\sqrt{2\pi |c^2|}}e^{-\frac{1}{2}\left(\frac{Y-d}{c}\right)^2}$$


By looking at the structure of this equation we can now see clearly that this is a normal pdf as it closely reseembles the univariate normal distribution and the multivariate iid. distribution derived previously.

To make this extremely clear we can switch notation, let $d$ be the mean of $Y$ such that 

$$d = \mu_Y$$ 

and let $c^2$ be the variance of $Y$ such that 

$$c^2 = \Sigma_Y$$

Plugging this notation into our solution yields:

$$ = \frac{1}{\sqrt{2\pi |\Sigma_Y|}}e^{-\frac{1}{2}\left(\frac{Y-\mu_Y}{\sqrt{\Sigma_Y}}\right)^2}$$

$$ = \frac{1}{\sqrt{2\pi |\Sigma_Y|}}e^{-\frac{1}{2} (Y-\mu)\Sigma_Y^{-1}(Y-\mu)} \tag{3.3.3.5}$$

#### 3.3.3.6. Define Linear Combination of Non-Standard Normal Variables
As mentioned previously, the usage of the joint standard normal distribution was for mathematical convenience. Below we will continue generalizing by assuming that $Y$ is a linear combination of arbitrary normal random variables rather than standard normal random variables.


Define $Y$ in terms of $X\sim\mathcal{N}(\mu_X,\Sigma_X)$ instead of $Z\sim\mathcal{Z}(0,1)$

$$ y = cX + d $$

Pickup at the 3.3.3.4 equation where we differentiate the inverse transform

$$ f_Y(Y) = f_X\left[ t^{-1}(Y) \right] |\mathbb{J}_t^{-1}|$$

$$ = f_X\left[ \frac{Y-d}{c} \right] \begin{vmatrix}\frac{1}{c}\end{vmatrix} $$

We inject the value of $X= \frac{Y-d}{c}$ into the pdf

$$ f_Y(Y) = \frac{1}{\sqrt{2\pi |\Sigma_X|}}e^{-\frac{1}{2}\left(\frac{X-\mu_X}{\sqrt{\Sigma_X}}\right)^2}$$

$$ = \frac{1}{\sqrt{2\pi |\Sigma_X|}}e^{-\frac{1}{2}\left(\frac{\left(\frac{Y-d}{c}\right)-\mu_X}{\sqrt{\Sigma_X}}\right)^2}$$

$$ = \frac{1}{\sqrt{2\pi |\Sigma_X|}}
e^{
-\frac{1}{2}
\left(
\frac{\left(\frac{Y-d}{c}\right)-\frac{c\mu_X}{c}}{\sqrt{\Sigma_X}}
\right)^2}$$



$$ = \frac{1}{\sqrt{2\pi |\Sigma_X|}}
e^{
-\frac{1}{2}
\left(
    \frac{Y-d-c\mu_X}{c\sqrt{\Sigma_X}}
\right)^2}$$

From 3.4.1. we know $\mu_Y = c\mu_X + d$ and thus

$$ = \frac{1}{\sqrt{2\pi |\Sigma_X|}}
e^{
-\frac{1}{2}
\left(
    \frac{Y- \mu_Y}{c\sqrt{\Sigma_X}}
\right)^2}$$

From 3.4.2. we know $\Sigma_Y = c^2 \Sigma_X = c\Sigma_Xc^T$ and thus

$$ = \frac{1}{\sqrt{2\pi |\Sigma_X|}}
e^{
-\frac{1}{2}
(Y- \mu_Y)\Sigma_Y^{-1}(Y- \mu_Y)
}$$

From 3.4.2 we deduce that $\Sigma_X = c^{-2}\Sigma_Y$ so we can manipulate the equation even farther and find:

$$ = \frac{1}{\sqrt{2\pi |c^{-2}\Sigma_Y|}}
e^{
-\frac{1}{2}
(Y- \mu_Y)\Sigma_Y^{-1}(Y- \mu_Y)
}$$

$$ = \frac{1}{\sqrt{2\pi |c^{2}\Sigma_Y|}}
e^{
-\frac{1}{2}
(Y- \mu_Y)\Sigma_Y^{-1}(Y- \mu_Y)
}$$

$$ = \frac{1}{\sqrt{2\pi |c\Sigma_Yc^T|}}
e^{
-\frac{1}{2}
(Y- \mu_Y)\Sigma_Y^{-1}(Y- \mu_Y)
} \tag{3.3.3.6}$$

## 3.4. Properties

### 3.4.1. Mean of Multivariate Normal

Given $Y = cX + d$ we will see that $\mu_Y = c\mu_X$

We start by taking the expectation:

$$ \mathbb{E}[Y] = \mathbb{E}[cX + d] $$

$$ \mathbb{E}[cX] + \mathbb{E}[d]$$

$$ c\mathbb{E}[X] + d$$


$$ c\mu_X + d \tag{3.4.1}$$

### 3.4.2. Variance of Multivariate Normal

Start with the basic variance formulas

$$\mathbb{Var}[Y] = \mathbb{E}[(Y - \mu_Y)^2]$$

$$= \mathbb{E}[Y^2 - 2\mu_Y Y + \mu_Y^2]$$

$$= \mathbb{E}[Y^2] - \mathbb{E}[2\mu_Y Y] + \mathbb{E}[\mu_Y^2]$$

$$= \mathbb{E}[Y^2] - 2\mu_Y\mathbb{E}[ Y] + \mu_Y^2$$

$$= \mathbb{E}[Y^2] - \mu_Y\mathbb{E}[ Y]$$

$$= \mathbb{E}[Y^2] - \mathbb{E}[Y]^2$$

Introduce the linear combination formula

$$= \mathbb{E}[(cX + d)^2] - \mathbb{E}[cX+d]^2$$

$$= \mathbb{E}[(cX + d)(cX + d)] - (c\mu_X+d)^2$$

$$= \mathbb{E}[c^2X^2+2cdX+d^2] - (c\mu_X+d)(c\mu_X+d)$$

$$= \mathbb{E}[c^2X^2]+\mathbb{E}[2cdX]+\mathbb{E}[d^2] - c^2\mu_X^2 - 2cd\mu_X - d^2$$

$$= c^2\mathbb{E}[X^2]+ 2cd\mathbb{E}[X]+ d^2 - c^2\mu_X^2 - 2cd\mu_X - d^2$$

$$= c^2\mathbb{E}[X^2] + 2cd\mu_X + d^2 - c^2\mu_X^2 - 2cd\mu_X - d^2$$

$$= c^2\mathbb{E}[X^2] - c^2\mu_X^2$$

$$= c^2\mathbb{E}[X^2] - c^2\mathbb{E}[X]^2$$

$$= c^2(\mathbb{E}[X^2] - \mathbb{E}[X]^2)$$


$$= c^2 \mathbb{Var}[X] $$

$$= c\Sigma c^t \tag{3.4.2}$$


### 3.4.3. Other Moments Of Multivariate Normal
The other moments of the multivariate normal random variable can be derived using the moment generating function. For a refresher on moment generating functions see the [notebook on expectations and MGFs](../../Statistics/Expectation.ipynb#Moment-Generation-Function)

Recall that the multivariate normal random variable is defined as $Y=cX + d$. The moment generating function $M_Y(c)$ for $Y$ can be derived as follows:

$$ M_Y(t) = \mathbb{E}[e^{tY}] $$ 

$$ = \mathbb{E}[e^{t(cX + d)}] $$ 

$$ = \mathbb{E}[e^{tc_1X_1 + \cdots + tc_nX_n + td}] $$

$$ = \mathbb{E}[e^{tc_1X_1}]\cdots\mathbb{E}[e^{tc_nX_n}]\mathbb{E}[e^{td}] $$

$$ = M_{X_1}(tc_1)\cdots M_{X_n}(tc_n)e^{td} $$ 

$$ = e^{td}\prod M_{X_i}(tc_i) \tag{3.4.3}$$ 

We can then differentiate this function as needed to find the coresponding moment.

More on this subject can be found [here](https://online.stat.psu.edu/stat414/lesson/25/25.2)

### 3.4.4. Linear Combination of Independent Normal Random Variables is Normal

One property that makes the normal distribution extremely tractable from an analytical viewpoint is its closure under linear combinations: the linear combination of two independent normal distributions is a normal distribution. And the linear combination of normal random variables is a normal random variable.

https://www.statlect.com/probability-distributions/normal-distribution-linear-combinations

There are several way to prove this. A convenient method is to show that the moment generating function of a linear transformation has the same form as the joint normal distribution.

We start from (3.4.3) and plug in the MGF for a normal random variable

$$ M_X(t) =  e^{\left( \mu t + \frac{\sigma^2 t^2}{2} \right)} $$

$$ M_Y(t) = e^{td}\prod M_{X_i}(tc_i)$$

$$ = e^{td}\prod e^{\left( \mu tc_i + \frac{\sigma^2 (tc_i)^2}{2} \right)}$$

$$ = e^{td}\prod e^{\mu tc_i} e^{\frac{\sigma^2 (tc_i)^2}{2}}$$

$$ = exp\big\{td\big\}\prod exp\big\{ \mu tc_i \big\} exp\Bigg\{\frac{\sigma^2 (tc_i)^2}{2}\Bigg\}$$

$$ = exp\big\{td\big\}exp\big\{ \sum \mu tc_i \big\} exp\Bigg\{\sum \frac{\sigma^2 (tc_i)^2}{2}\Bigg\}$$

$$ = exp\big\{td\big\}exp\big\{ t\sum \mu c_i \big\} exp\Bigg\{\frac{t^2}{2}\sum \frac{\sigma^2 c_i^2}{2}\Bigg\}$$

$$ = exp\big\{td\big\}exp\big\{ t\sum \mu c_i \big\} exp\Bigg\{\frac{t^2}{2}\sum \frac{\sigma^2 c_i^2}{2}\Bigg\}$$

$$ = exp\big\{td\big\}exp
\Bigg\{ 
t \left(\sum \mu c_i \right) + \frac{t^2}{2} \left( \sum \sigma^2 c_i^2 \right)
\Bigg\}$$

Based on the structure we see that this equation now looks like $M_X(t)$ but now the mean is $\sum \mu c_i$ and the variance is $\sum \sigma^2 c_i^2$. And if $d = 0$ this becomes even more clear.

Therefore the multivariate normal distribution has the same moment generating function as a univariate normal distribution but uses different parameters. Thus it is normally distributed. As such we have proved that the linear combination of normal random variables is normal.


https://online.stat.psu.edu/stat414/lesson/26/26.1

### 3.4.5 Marginal Distributions Of Joint Normal Are Normal

Previously in (3.3.3) we derived the multivariate normal distribution from a set of normal random variables with normal marginal distributions. Now we show that the marginal distributions of the multivariate normal must also be normal.

Let $C$ be an $(r \times n)$ matrix and let $Y$ be a linear combination of normal random variables $X$.

If $Y=cX$, then we know $Y \sim \mathcal{N}(C\mu, C\Sigma^{-1}C^T)$ by the proof of linear combination shown in (3.4.1) through (3.4.4) we can deduces that the marginal distributions of the multivariate normal are also normal.

We can apply a transform that essentially splits the vector $X$ into the two pieces. In doing so we can represent $X$ as a linar combination of $X_1$ and $X_2$. Let $X_1$ be an $(1 \times r)$ matrix while $X_2$ is an $(1 \times n-r)$ matrix, we would then have:

$$X = \begin{bmatrix}X_1 \\ X_2 \end{bmatrix}$$

We can create a linear transformation $c$ to transform $X$ into its coresponding parts $X_1$ and $X_2$ as follows:

For example, to derive $X_1$, we can multiply $X$ by $c_1=(I, 0)$ where $I$ is the $(r \times r)$ identity matrix, and $0$ is the $(r-n \times r-n)$ matrix.

By defining C in this way, we can say:

$$ X_1 = C_1X $$

We already proved that the linear combination of normal variables is normal. We are now showing that a marginal variable can be represented as a linear combination of a multivariate variable. As such that marginal variable must be normally distributed!

### 3.4.6. Closure

Gaussian distributions have the nice algebraic property of being closed under conditioning and marginalization. Being closed under conditioning and marginalization means that the resulting distributions from these operations are also Gaussian, which makes many problems in statistics and machine learning tractable.

https://distill.pub/2019/visual-exploration-gaussian-processes/

### 3.4.7. Transformation

$$\mathcal{N}(\mu,\sigma^2) = \mu + \sigma^2 \mathcal{N}(0,1)$$

### 3.4.8 Conditional Distributions

#### 3.4.8.1 Proof Overview
In this proof we will derive a conditional probability density function. There are many ways this can be done. We will take the simplest approach which relies on previously proven formulas from this notebook.



#### 3.4.8.2 Derive General Formula

We start with definition of conditional probability which allows us to "convert" a joint probability into a conditional probability:

$$ P(A|B) = \frac{P(A \cap B)}{P(B)} $$

We know that $P(A)$ and $P(B)$ are normal a priori and from (3.4.5). From (3.4.4) we know that $P(A \cap B)$ is normal as it is a linear combination of $n$ normal random variables. As such we know the density functions.


We then apply the formlas for the prbability density functions:

$$ f_{A|B} = \frac{f_{A, B}}{f_B}$$

$$ = f_{A, B}^{\ }{f_B^{-1}}$$

#### 3.4.8.3. Take a Shortcut
Rather than do all the complicated matrix algebra, we can take a shortcut to deriving this distribution using some of the proofs we established earlier.

If we think about the random variable $X$ representing $A|B$ we can think about it in terms of a linear combination. With a joint random variable we would simply have $X=A+B$ but because this is conditional we cannot quite say that. Instead we need to modify one of the terms. We know that B must occur but it's uncertain what will occur.

$$ Y = A + TB $$

$$  =
\begin{bmatrix}
1 & T
\end{bmatrix}
\begin{bmatrix}
A \\
B
\end{bmatrix}$$



#### 3.4.8.3 Derive Solution For Conditional Normal Distribution

Assume we have an n-dimensional multivariate normal random variable $X$. Assume we partition our variables into two halves such that

$$
X =
\begin{bmatrix}
A \\
B
\end{bmatrix}
$$

Let $C = A \cap B$



$$ = f_{C}^{\ }{f_B^{-1}}$$

We can now insert the equations for the pdfs. But first, a quick note on our dimensions. We noted earlier that $A \cap B$ and thus $C$ is n-dimensional. Let $B$ be r-dimensional such that $n>r>0$. We would then have that $A$ is (n-r)-dimensional. With this in mind the pfs would be:

$$
= 
\frac{1}{ \sqrt{(2\pi)^n |\Sigma_C|}} 
exp \left\{ -\frac{1}{2}(C-\mu_C)^T\Sigma_C^{-1}(C-\mu_C) \right\}
\Big(
\frac{1}{ \sqrt{(2\pi)^r |\Sigma_B|}} 
exp \left\{ -\frac{1}{2}(B-\mu_B)^T\Sigma_B^{-1}(B-\mu_B) \right\}
\Big)^{-1}
$$


$$
= 
\frac{1}{ \sqrt{(2\pi)^n |\Sigma_C|}} 
exp \left\{ -\frac{1}{2}(C-\mu_C)^T\Sigma_C^{-1}(C-\mu_C) \right\}
\sqrt{(2\pi)^r |\Sigma_B|}
exp \left\{ \frac{1}{2}(B-\mu_B)^T\Sigma_B^{-1}(B-\mu_B) \right\}
$$


$$
= 
\frac{\sqrt{(2\pi)^r |\Sigma_B|}}{ \sqrt{(2\pi)^n |\Sigma_C|}} 
exp \left\{ -\frac{1}{2}(C-\mu_C)^T\Sigma_C^{-1}(C-\mu_C) \right\}
exp \left\{ \frac{1}{2}(B-\mu_B)^T\Sigma_B^{-1}(B-\mu_B) \right\}
$$

$$
= 
\frac{\sqrt{(2\pi)^r |\Sigma_B|}}{ \sqrt{(2\pi)^n |\Sigma_C|}} 
exp \left\{ -\frac{1}{2}(C-\mu_C)^T\Sigma_C^{-1}(C-\mu_C)
+ \frac{1}{2}(B-\mu_B)^T\Sigma_B^{-1}(B-\mu_B) \right\}
$$

$$
= 
\frac{\sqrt{ |\Sigma_B|}}{ \sqrt{(2\pi)^{n-r} |\Sigma_C|}} 
exp 
\left\{ 
-\frac{1}{2}
\left(
(C-\mu_C)^T\Sigma_C^{-1}(C-\mu_C)
- (B-\mu_B)^T\Sigma_B^{-1}(B-\mu_B) 
\right)
\right\}
$$

$$
= 
\frac{1}{ \sqrt{(2\pi)^{n-r} |\Sigma_C||\Sigma_B|^{-1}}} 
exp 
\left\{ 
-\frac{1}{2}
\left(
(C-\mu_C)^T\Sigma_C^{-1}(C-\mu_C)
- (B-\mu_B)^T\Sigma_B^{-1}(B-\mu_B) 
\right)
\right\}
$$

We can further simplify the expression by expanding the quadratic $Q$ in the exponential $e^Q$

$$ Q
=
-\frac{1}{2}
\left(
(C-\mu_C)^T\Sigma_C^{-1}(C-\mu_C)
+ (B-\mu_B)^T\Sigma_B^{-1}(B-\mu_B) 
\right)
$$

To simplify the expansion we will denote $Q$ as $-\frac{1}{2}(q_1 + q_2)$ and will expand $q_1$ and $q_2$ separately before combining them.

We start with $q_1$

$$ q_1 = (C-\mu_C)^T\Sigma_C^{-1}(C-\mu_C) $$

$$ = 
\begin{bmatrix}
A - \mu_A & B - \mu_B
\end{bmatrix}
\Sigma_{A,B}^{-1}
\begin{bmatrix}
A - \mu_A \\
B - \mu_B
\end{bmatrix}
$$

$$ = 
\begin{bmatrix}
A - \mu_A & B - \mu_B
\end{bmatrix}
\begin{bmatrix}
\Sigma_{A,A} & \Sigma_{A,B} \\
\Sigma_{B,A} & \Sigma_{B,B}
\end{bmatrix}^{-1}
\begin{bmatrix}
A - \mu_A \\
B - \mu_B
\end{bmatrix}
$$

$$ = 
\begin{bmatrix}
A - \mu_A & B - \mu_B
\end{bmatrix}
\begin{bmatrix}
\Sigma_{A,A}^{-1} & \Sigma_{A,B}^{-1} \\
\Sigma_{B,A}^{-1} & \Sigma_{B,B}^{-1}
\end{bmatrix}
\begin{bmatrix}
A - \mu_A \\
B - \mu_B
\end{bmatrix}
$$

$$ = 
\begin{bmatrix}
(A - \mu_A)\Sigma_{A,A}^{-1} + (B - \mu_B)\Sigma_{B,A}^{-1}
& 
(A - \mu_A)\Sigma_{A,B}^{-1} + (B - \mu_B)\Sigma_{B,B}^{-1}
\end{bmatrix}
\begin{bmatrix}
A - \mu_A \\
B - \mu_B
\end{bmatrix}
$$

$$ = 
(A - \mu_A)\Sigma_{A,A}^{-1}(A - \mu_A) \\ 
\ \ \ + (B - \mu_B)\Sigma_{B,A}^{-1}(A - \mu_A) \\
\ \ \ + (A - \mu_A)\Sigma_{A,B}^{-1}(B - \mu_B) \\
\ \ \ + (B - \mu_B)\Sigma_{B,B}^{-1}(B - \mu_B)
$$

Next we do $q_2$

$$ q_2 = (B-\mu_B)^T\Sigma_B^{-1}(B-\mu_B) $$

Note that $\Sigma_B^{-1}= \Sigma_{BB}^{-1}$

So $q_1 + q_2$ is thus:

$$ = 
(A - \mu_A)\Sigma_{A,A}^{-1}(A - \mu_A) \\ 
\ \ \ \ \ \ + (B - \mu_B)\Sigma_{B,A}^{-1}(A - \mu_A) \\
\ \ \ \ \ \ + (A - \mu_A)\Sigma_{A,B}^{-1}(B - \mu_B) \\
\ \ \ \ \ \ + (B - \mu_B)\Sigma_{B,B}^{-1}(B - \mu_B) \\
\ \ \ \ \ \ - (B - \mu_B)\Sigma_{B,B}^{-1}(B - \mu_B)
$$

$$ = 
(A - \mu_A)\Sigma_{A,A}^{-1}(A - \mu_A)  + (B - \mu_B)\Sigma_{B,A}^{-1}(A - \mu_A)  + (A - \mu_A)\Sigma_{A,B}^{-1}(B - \mu_B) $$

Putting this in matrix form we see that we again have a quadratic

$$
\begin{bmatrix}
A - \mu_A &
B - \mu_B
\end{bmatrix}
\begin{bmatrix}
\Sigma^{-1}_{AA} & \Sigma^{-1}_{AB}\\
\Sigma^{-1}_{BA} & 0
\end{bmatrix}
\begin{bmatrix}
A - \mu_A \\
B - \mu_B
\end{bmatrix}
$$

$$
\begin{bmatrix}
A - \mu_A \\
B - \mu_B
\end{bmatrix}^T
\begin{bmatrix}
\Sigma^{-1}_{AA} & \Sigma^{-1}_{AB}\\
\Sigma^{-1}_{BA} & 0
\end{bmatrix}
\begin{bmatrix}
A - \mu_A \\
B - \mu_B
\end{bmatrix}
$$

$$ (D - \mu_D)^T \Sigma_D^{-1} (D- \mu_D) $$

Now plug this information back into the equation

$$
= 
\frac{1}{ \sqrt{(2\pi)^{n-r} |\Sigma_C||\Sigma_B|^{-1}}} 
exp 
\left\{ 
-\frac{1}{2}
Q
\right\}
$$

We can work on the determinants in the denominator of the scaling term

$$
\frac{1}{\sqrt{|\Sigma_C||\Sigma_B|^{-1}}}
$$


$$
\begin{vmatrix}
\Sigma_{A,A} & \Sigma_{A,B} \\
\Sigma_{B,A} & \Sigma_{B,B}
\end{vmatrix}
|\Sigma_B|^{-1}
$$

$$
= 
\frac{1}{ \sqrt{(2\pi)^{n-r} |\Sigma_C||\Sigma_B|^{-1}}} 
exp 
\left\{ 
    -\frac{1}{2}
    \begin{bmatrix}
    C - \mu_C \\
    B - \mu_B
    \end{bmatrix}
    \begin{bmatrix}
    \Sigma_C & 0 \\
    0 & \Sigma_B
    \end{bmatrix}^{-1}
    \begin{bmatrix}
    C - \mu_C \\
    B - \mu_B
    \end{bmatrix}^T
\right\}
$$

We can see that this equation takes on the form of a normal distribution given its structure. We can make this clear by grouping terms together. For this arbitrary operation we will use the term $D$ to denote this group. Making the transformation the equation is such that:

$$
D =
\begin{bmatrix}
C \\
B
\end{bmatrix}
$$

$$
\mu_D=    
\begin{bmatrix}
\mu_C \\
\mu_B
\end{bmatrix}
$$

$$
\Sigma_D =
\begin{bmatrix}
\Sigma_C & 0 \\
0 & \Sigma_B
\end{bmatrix}
$$


And so our equation can be reformulated as:

$$
= 
\frac{1}{ \sqrt{(2\pi)^{n-r} |\Sigma_D|^{-1}}} 
exp 
\left\{ 
    -\frac{1}{2}
    (D - \mu_D) \Sigma_D^{-1} (D - \mu_D)^T
\right\}
$$

We see that D is an n-dimensional vector that was created from two other vectors; $A \cap B$ and $B$ which are (n-r) and r-dimensional respectively. We see that the mean and variance are also n-dimensional.

We know D is normal as it is a linear combination of normal random variables.

#### 3.4.8.4. Derive the mean

$$
\mu_D=    
\begin{bmatrix}
\mu_C \\
\mu_B
\end{bmatrix}
=
\begin{bmatrix}
\mu_A \\
\mu_B \\
\mu_B
\end{bmatrix}
$$

We can represent $C$ 

#### Von Mises

The basic proof will construct a linear combination $Y$ from a set of normal random variables $X$ such that

$$ Y = CX $$

It will then partition the set of variables and the linear combination into two halves such that:

$$ \begin{bmatrix}
Y_1 \\
Y_2 \end{bmatrix}
= C \begin{bmatrix}
X_1 \\
X_2 \end{bmatrix}
$$

???We will prove that if $X \sim \mathcal{N}$ then $X_1 | X_2 \sim \mathcal{N}$ with mean $\mu_{X_1|X_2} = \mu_1 + \Sigma_{12} \Sigma_{22}^{-1}[X_2 - \mu_2]$ and a variance $\Sigma_{x_1|X_2}=\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{22}$.

#### 3.4.8.2 Define Assumptions
We define an relationship between the two partisions such that partition 1 depends on partition 1. 

$$Y_1 = X_1 - \Sigma_{12} \Sigma_{22}^{-1} X_{2}$$
$$ Y_2 = X_2$$

The relationships between $X$ and $Y$ and partition 1 and 2 can be expressed by defining the linear transformation in matrix form with $C$ such that:

$$C =
\begin{bmatrix}
I & -\Sigma_{12}\Sigma_{22}^{-1} \\
0 & I
\end{bmatrix}
$$

The relationship then becomes:

$$
Y = \begin{bmatrix}
I & -\Sigma_{12}\Sigma_{22}^{-1} \\
0 & I
\end{bmatrix} 
X
$$

Where $I$ is the identity matrix adjusted to fit the dimensions of the two paritions.

We can see that similar to the proof involving linear combinations, $Y$ is a non-singular linear transform of $X$ and is therfore normal. Additionally we proved that if $Y = CX$ then $\mu_Y = C\mu$ and $\Sigma_Y = C \Sigma C^T)$. 



#### Derive the mean
So we can deduce the mean

$$ \mathbb{E}[Y] = C \mathbb{E}[X]$$

$$
\begin{bmatrix}
\mathbb{E}[Y_1] \\
\mathbb{E}[Y_2] \end{bmatrix}
=
\begin{bmatrix}
1 & -\Sigma_{12}\Sigma_{22}^{-1} \\
0 & I
\end{bmatrix} 
\begin{bmatrix}
\mathbb{E}[X_1] \\
\mathbb{E}[X_2] \end{bmatrix}
$$

$$
\begin{bmatrix}
\mathbb{E}[Y_1] \\
\mathbb{E}[Y_2] \end{bmatrix}
=
\begin{bmatrix}
\mathbb{E}[X_1] & -\Sigma_{12}\Sigma_{22}^{-1}\mathbb{E}[X_2] \\
0 & \mathbb{E}[X_2]
\end{bmatrix} 
$$

Which yields:

$$ \mathbb{E}[Y_1] = \mathbb{E}[X_1] -\Sigma_{12}\Sigma_{22}^{-1}\mathbb{E}[X_2]$$
$$ \mathbb{E}[Y_2] = \mathbb{E}[X_2]$$

For compactness:

$$ \mu_{Y_1} = \mu_{X_1} - \Sigma_{12} \Sigma_{22}^{-1} \mu_{X_2}$$
$$ \mu_{Y_2} = \mu_{X_2}$$

If we apply these to our original definitions we see

$$Y_1 = X_1 - \Sigma_{12} \Sigma_{22}^{-1} X_{2}$$
$$ => Y_1 = X_1 - \mu_{Y_1}$$


#### Derive the variance
We calculate the vairance of Y

$$\mathbb{E}[Y] = \mathbb{E}[(Y - \mu_Y)(Y - \mu_Y)^T] $$

$$= \mathbb{E}\begin{bmatrix}
\Sigma_{11} & \Sigma_{12}\\
\Sigma_{21} & \Sigma_{22}
\end{bmatrix}$$

$$= \mathbb{E}\begin{bmatrix}
(Y_1 - \mu_{Y_1})(Y_1 - \mu_{Y_1})^T & (Y_1 - \mu_{Y_1})(Y_2 - \mu_{Y_2})^T \\
(Y_2 - \mu_{Y_2})(Y_1 - \mu_{Y_1})^T & (Y_2 - \mu_{Y_2})(Y_2 - \mu_{Y_2})^T
\end{bmatrix}$$

We start with $\Sigma_{11}$:

$$ \Sigma_{11} = \mathbb{E}[(Y_1 - \mu_{Y_1})(Y_1 - \mu_{Y_1})^T] $$

We enforce the conditions $Y_1 = X_1-\Sigma_{12}\Sigma_{22}^{-1} X_{2}$ and $\mu_{Y_1} = \mu_{X_1} - \Sigma_{12} \Sigma_{22}^{-1} \mu_{X_2}$ on a single term

$$ Y_1 - \mu_{Y_1}$$

$$ = (X_1-\Sigma_{12}\Sigma_{22}^{-1} X_{2})-(\mu_{X_1} - \Sigma_{12} \Sigma_{22}^{-1} \mu_{X_2})$$

$$ = X_1-\Sigma_{12}\Sigma_{22}^{-1} X_{2} - \mu_{X_1} + \Sigma_{12} \Sigma_{22}^{-1} \mu_{X_2}$$

$$ = X_1 - \mu_{X_1} -\Sigma_{12}\Sigma_{22}^{-1} X_{2} + \Sigma_{12} \Sigma_{22}^{-1} \mu_{X_2}$$

$$ = X_1 - \mu_{X_1} -\Sigma_{12}\Sigma_{22}^{-1}( X_{2} - \mu_{X_2})$$

We then apply it to both terms

$$ \mathbb{E}\left[ (Y_1 - \mu_{Y_1})(Y_1 - \mu_{Y_1})^T \right]$$

$$ = \mathbb{E}\left[ 
\left( X_1 - \mu_{X_1} -\Sigma_{12}\Sigma_{22}^{-1}( X_{2} - \mu_{X_2}) \right)
\left( X_1 - \mu_{X_1} -\Sigma_{12}\Sigma_{22}^{-1}( X_{2} - \mu_{X_2}) \right)^T
\right]
$$

$$ = \mathbb{E}\left[
\left( (X_1 - \mu_{X_1}) -\Sigma_{12}\Sigma_{22}^{-1}( X_{2} - \mu_{X_2}) \right)
\left( (X_1 - \mu_{X_1}) -\Sigma_{12}\Sigma_{22}^{-1}( X_{2} - \mu_{X_2}) \right)^T
\right]
$$



$$ = \mathbb{E}\left[
(X_1 - \mu_{X_1})(X_1 - \mu_{X_1})^T
- \Sigma_{12}\Sigma_{22}^{-1}( X_{2} - \mu_{X_2})(X_1 - \mu_{X_1})^T
- \Sigma_{12}\Sigma_{22}^{-1}( X_{2} - \mu_{X_2})(X_1 - \mu_{X_1})^T
+ \Sigma_{12}\Sigma_{22}^{-1}( X_{2} - \mu_{X_2})
(\Sigma_{12}\Sigma_{22}^{-1})^T( X_{2} - \mu_{X_2})^T
\right]
$$




$$ =
\mathbb{E}\left[
(X_1 - \mu_{X_1})(X_1 - \mu_{X_1})^T
- 2\Sigma_{12}\Sigma_{22}^{-1}( X_{2} - \mu_{X_2})(X_1 - \mu_{X_1})^T
+ \Sigma_{12}\Sigma_{22}^{-1}( X_{2} - \mu_{X_2})
(\Sigma_{12}\Sigma_{22}^{-1})^T( X_{2} - \mu_{X_2})^T
\right]
$$



$$ =
\mathbb{E}\left[
(X_1 - \mu_{X_1})(X_1 - \mu_{X_1})^T
- 2\Sigma_{12}\Sigma_{22}^{-1}( X_{2} - \mu_{X_2})(X_1 - \mu_{X_1})^T
+ \Sigma_{12}\Sigma_{22}^{-1}( X_{2} - \mu_{X_2})
\Sigma_{21}\Sigma_{22}( X_{2} - \mu_{X_2})^T
\right]
$$


$$ = 
\Sigma_{11}
- 2\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}
+ \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{22}\Sigma_{21}\Sigma_{22}
$$

$$ = 
\Sigma_{11}
- 2\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}
+ \Sigma_{12}\Sigma_{21}\Sigma_{22}
$$

$$ = 
\Sigma_{11}
- 2\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}
+ \Sigma_{12}\Sigma_{21}\Sigma_{22}^{-1}
$$

$$ = 
\Sigma_{11}
- 2\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}
+ \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}
$$

$$ = 
\Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}
$$

If we continue applying this technique we will derive the variance of Y:

$$ \Sigma_Y =
\begin{bmatrix}
\Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & 0 \\
0 & \Sigma_{22}
\end{bmatrix}
$$



The Covarianct matrix tells us that $Y_1 \perp Y_2$ because of the zeros in the top right and bottom left corners.

Because $\mu_{Y_2} = \mu_{X_2}$ and $\Sigma_{Y_2} = \Sigma_{X_2}$ (as shown above) we see that the moments between $Y_2$ and $X_2$ are the same and thus we can say that $Y_2 = X_2$.

Previosly we proved that the joint distribution for two independent variables is $f(x)f(y)$, we proved that the marginal distributions of the joint distribution are normal, we derived the moments of the marginal variables $Y_1$ and $Y_2$ and so now we can plug these parameters into the normal density functions and derive the joint distribution for our variables:

$$ \mathcal{N}_{Y1, Y2} = 
\mathcal{N}_{Y_1}
\mathcal{N}_{Y_2} $$

$$ =
\mathcal{N}(\mu_{Y_1}, \sigma_{Y_1})
\mathcal{N}(\mu_{Y_2}, \sigma_{Y_2})
$$

$$ =
\mathcal{N}(\mu_{X_1} - \Sigma_{12} \Sigma_{22}^{-1} \mu_{X_2}, \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21})
\mathcal{N}(\mu_{X_1}, \sigma_{X_2})
$$

We can also derive the joint distribution of $X_1$ and $X_2$ from the joint distribution of $Y_1$ and $Y_2$ by performing a change of variable and multiplying by the jacobian (which is conveniently equal to 1).

We start with the basic definition of the joint distribution:

$$\mathcal{N}(\mu, \Sigma) := 
\frac{1}{ \sqrt{(2\pi)^k |\Sigma|}} 
exp \left\{ -\frac{1}{2}(X-\mu)^T\Sigma^{-1}(X-\mu) \right\}$$

### 3.4.8 Conditional Distribution
#### 3.4.8.1. Proof Overview

#### 3.4.8.2. Partition A Joint Normal Random Variable

We start with a joint random variable following the multivariate normal distribution

$$ X \sim \mathcal{N}(\mu, \Sigma) $$

We then partition the deimensions of $X$ into two separate variables such that:

$$ X = \begin{bmatrix}X_1 \\ X_2 \end{bmatrix}$$

Now we want to find the conditional distribution of $X_1$ given $X_2$. We can apply the definition of conditional probability:

$$ P(X_1 | X_2) = \frac{P(X_1 \cap X_2)}{P(X_2)} $$

Rearanging this equation we then have:

$$ P(X_1 \cap X_2) = P(X_1|X_2)P(X_2)$$

So if we take the join normal distribution function, which yields the value of $P(X_1 \cap X_2)$, we should be able to manipulate it until we extract the marginal density function for $X_2$ (which yields $P(X_2)$.

Inserting the equation we have:

$$ P(X) = P(X_1 \cap X_2)$$

$$ =
\frac{1}{ \sqrt{(2\pi)^n |\Sigma_X|}} 
exp \left\{ -\frac{1}{2}(X-\mu_X)^T\Sigma_X^{-1}(X-\mu_X) \right\}
$$

Applying the partition we have:

$$ =
\frac{1}{ \sqrt{(2\pi)^n 
\begin{vmatrix}
\Sigma_{11} & \Sigma_{12} \\
\Sigma_{12} & \Sigma_{22}
\end{vmatrix}
}} 
exp 
\left\{ 
-\frac{1}{2}
\begin{bmatrix}
    X_1 - \mu_{X_1} \\
    X_2 - \mu_{X_2}
\end{bmatrix}^T
\begin{vmatrix}
\Sigma_{11} & \Sigma_{12} \\
\Sigma_{12} & \Sigma_{22}
\end{vmatrix}^{-1}
\begin{bmatrix}
    X_1 - \mu_{X_1} \\
    X_2 - \mu_{X_2}
\end{bmatrix}
\right\}
$$

#### 3.4.8.3. Diagonalize The Covariance Matrix

We see the covariance matrix, specifically the inverse covariance matrix, appears in two places within the joint density function. And this is the key to separating our equation into two equations.

The algebra to separate the equations is non-trivial. We Will diagonalize the inverse covariance matrix $\Sigma_X^{-1}$ In doing so we will conveniently separate the equation in such a way that the variance $\Sigma_{22}$ of $X_2$ can be isolated. We need it to be isolated because we need to arange the equation so that a portion of it looks like the marginal distribution of $X_2$ which requires $\Sigma_{22}$ as a parameter. 

We begin diagonalizing the covariance matrix starting with the top

$$
\begin{bmatrix}
I & -\Sigma_{12}\Sigma_{22}^{-1} \\
0 & 1
\end{bmatrix}
\begin{bmatrix}
\Sigma_{11} & \Sigma_{12} \\
\Sigma_{21} & \Sigma_{22}
\end{bmatrix}
=
\begin{bmatrix}
\Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & 0 \\
\Sigma_{21} & \Sigma_{22}
\end{bmatrix}
$$

$$
\begin{bmatrix}
I & -\Sigma_{12}\Sigma_{22}^{-1} \\
0 & 1
\end{bmatrix}
\begin{bmatrix}
\Sigma_{11} & \Sigma_{12} \\
\Sigma_{21} & \Sigma_{22}
\end{bmatrix}
\begin{bmatrix}
I & 0 \\
-\Sigma_{22}^{-1}\Sigma_{21} & I
\end{bmatrix}
=
\begin{bmatrix}
\Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & 0 \\
0 & \Sigma_{22}
\end{bmatrix}
$$

Using [LDU decomposition](../../Matrix%20Algebra/Diagonalization.ipynb#2.%20LDU-Decomposition) and [Schur Compliments](../../Matrix%20Algebra/Schur%20Compliments.ipynb) we see that 

$$\Sigma = LDU$$
$$L^{-1}\Sigma U^{-1} = D$$
$$ S_{\Sigma / \Sigma_{22}} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}$$

Using the [properties of the inverse of a matrix product](../../Matrix%20Algebra/Matrix%20Division%20(Inversion).ipynb#5.1.2.%20Inverse-Product) we can also assert that:

$$ \Sigma^{-1} = (LDU)^{-1} = U^{-1}D^{-1}L^{-1} $$

Putting all this information together we can derive the inverse covariance matrix

$$\Sigma^{-1} =
\begin{bmatrix}
I & 0 \\
-\Sigma_{22}^{-1}\Sigma_{21} & I
\end{bmatrix}
\begin{bmatrix}
\Sigma & 0 \\
0 & \Sigma_{22}^{-1}
\end{bmatrix}
\begin{bmatrix}
I & -\Sigma_{12}\Sigma_{22}^{-1} \\
0 & I
\end{bmatrix}
$$

We then calculate the inverse of sigma as that is what appears in our equation

$$ \Sigma^{-1} = (LDU)^{-1} $$

$$ \Sigma^{-1} = L^{-1}D^{-1}U^{-1} $$

$$ L\Sigma^{-1}U = LL^{-1}D^{-1}U^{-1}U $$

$$ L\Sigma^{-1}U = ID^{-1}I $$

$$ L\Sigma^{-1}U = D^{-1} $$

So we derivea formula for the inverse

$$
=
\begin{bmatrix}
I & 0 \\
-\Sigma_{22}^{-1}\Sigma_{21} & I
\end{bmatrix}
\begin{bmatrix}
\Sigma & 0 \\
0 & \Sigma_{22}^{-1}
\end{bmatrix}
\begin{bmatrix}
I & -\Sigma_{12}\Sigma_{22}^{-1} \\
0 & I
\end{bmatrix}
$$

#### 3.4.8.4. Apply To Exponential

Introducing this decomposition into the exponential of the joint density function we then have:

$$ 
exp 
\left\{ 
-\frac{1}{2}
\begin{bmatrix}
    X_1 - \mu_{X_1} \\
    X_2 - \mu_{X_2}
\end{bmatrix}^T
\begin{vmatrix}
\Sigma_{11} & \Sigma_{12} \\
\Sigma_{12} & \Sigma_{22}
\end{vmatrix}^{-1}
\begin{bmatrix}
    X_1 - \mu_{X_1} \\
    X_2 - \mu_{X_2}
\end{bmatrix} 
\right\}
$$

$$ =
exp 
\left\{ 
-\frac{1}{2}
\begin{bmatrix}
    X_1 - \mu_{X_1} \\
    X_2 - \mu_{X_2}
\end{bmatrix}^T
\begin{bmatrix}
I & 0 \\
-\Sigma_{22}^{-1}\Sigma_{21} & I
\end{bmatrix}
\begin{bmatrix}
S_{\Sigma/\Sigma_{22}}^{-1} & 0 \\
0 & \Sigma_{22}^{-1}
\end{bmatrix}
\begin{bmatrix}
I & -\Sigma_{12}\Sigma_{22}^{-1} \\
0 & I
\end{bmatrix}
\begin{bmatrix}
    X_1 - \mu_{X_1} \\
    X_2 - \mu_{X_2}
\end{bmatrix}
\right\}
$$

We then do the extremely tedious matrix algebra to expand the expression

$$ =
exp 
\left\{ 
-\frac{1}{2}
\begin{bmatrix}
X_1 - \mu_{X_1} -(X_1 - \mu_{X_1})\Sigma_{22}^{-1}\Sigma_{21} \\
X_2 - \mu_{X_2}
\end{bmatrix}^T
\begin{bmatrix}
S_{\Sigma/\Sigma_{22}}^{-1} & 0 \\
0 & \Sigma_{22}^{-1}
\end{bmatrix}
\begin{bmatrix}
I & -\Sigma_{12}\Sigma_{22}^{-1} \\
0 & I
\end{bmatrix}
\begin{bmatrix}
    X_1 - \mu_{X_1} \\
    X_2 - \mu_{X_2}
\end{bmatrix}
\right\}
$$

$$ 
exp 
\left\{ 
-\frac{1}{2}
\left\{
foobar
\right\}
\right\}
$$

https://www.cs.ubc.ca/~murphyk/Teaching/CS340-Fall07/reading/gauss.pdf

von Mises, Richard (1964). Mathematical theory of probability and statistics. Chapter VIII.9.3. Academic Press.

## 3.4. History

p432

# 4. Gaussian Distribution

The gaussian distribution is a normal distribution whos mean is zero and standard deviation is unit length ie. equal to one.

$$ \mathcal{G} \sim \mathcal{N}(0,1) $$

# 5. Swithcing between joint and conditional probability
I think if we switch between params we can chantge the distributions etc.

# 6. References
- http://noahgolmant.com/writings/derivationsunivariatemultivariate.pdf