## Joint distribution

- Discrete joint distribution
    - X and Y are discrete random variables
    - $Pr(X=x, Y=y) = \rho_{XY}(x,y)$
    - If X and Y and independent, $Pr(X=x, Y=y) = \rho_{XY}(x,y) = P(x) \cdot P(y)$

- Continuous joint distribution
    - X and Y are continuous random variables

- Marginal distribution
    - Distribution of variable X ignoring variable Y
    - e.g. rolling 2 dice, with X and Y representing result
        - Marginal distribution of X: 
            - The distribution of X across all possible X values as if Y doesn't exist

-  Conditional distribution
    - Distribution of variable X CONDITIONAL on variable Y
    - e.g. rolling 2 dice, with X and Y representing result
            - Conditional distribution of X: 
                - For a fixed value of Y, what is the distribution across all possible X values?
    - Discrete
        - $\rho_{Y | X=x}(y) = \frac{\rho_{XY}(x,y)}{\rho_X(X)}$
            - Conditional distribution of Y is the joint distribution of (x,y) divided by the marginal distribution of X
    - Continuous
        - $f_{Y | X=x}(y) = \frac{f_{XY}(x,y)}{f_X(X)}$
            - Conditional PDF of Y is the joint PDF of X and Y divided by the marginal distribution of X

## Covariance

- Definition
    - $Cov(x,y) = \frac{\sum_{i=1}^{n} (x_i - \mu_x)(y_i - \mu_y)} {n} $
    - Covariance > 0, feature is positively correlated
    - Covariance < 0, feature is negatively correlated
    - Assuming every pair of values $(x_i, y_i)$ have the same probability $p$

- What is the probabilities were not the same?
    - $Cov(x,y) = \sum_{i=1}^{n} \rho_{XY}(x,y) (x_i - \mu_x)(y_i - \mu_y)$, where $\rho_{XY}(x,y)$ is the joint probability observing $(X=x, Y=x)$

- Also note that this is true
$$\begin{align}
    Cov(X,Y) &= E[(X-\mu_X)(Y-\mu_Y)] \\
    &= E[XY] - E[X]\mu_Y - E[Y]\mu_X + \mu_X\mu_Y \\
    &= E[XY] - E[X]E[Y] - E[X]E[Y] + E[X]E[Y] \\
    &= E[XY] - E[X]E[Y]
\end{align}$$

- We see how the covariance of 2 random variables can be calculated
- In a typical dataset, there are multiple random variables, and so multiple covariances to compute
    - e.g. If I have 3 variables, $X_1$, $X_2$, $X_3$, I have to compute covariance $\sigma_{x_i, x_j}$ for all possible pairs of $i$ and $j$
- This is best represented in a variance-covariance matrix

| | $x_1$ | $x_2$ | $x_3$ |
| --- | --- | --- | --- |
| $x_1$ | $Var(x_1)$ | $Cov(x_1, x_2)$ | $Cov(x_1, x_3)$ | 
| $x_2$ | $Cov(x_2, x_1)$ | $Var(x_2)$ | $Cov(x_2, x_3)$ | 
| $x_3$ | $Cov(x_3, x_1)$ | $Cov(x_3, x_2)$ | $Var(x_3)$ | 

## Correlation Coefficient

- $Corr(X,Y) = \frac{Cov(X,Y)}{Std(X) \cdot Std(Y)}$

## Multivariate Gaussian

- Let $X \sim N(\mu_x, \sigma_x^2)$, $Y \sim N(\mu_y, \sigma_y^2)$
- If X and Y are independent
    - PDF of joint distribution:
        - $\begin{align}
            f_{XY}(x,y) &= f_{X}(x) \cdot f_{Y}(y) \\
            &= \frac{1}{\sqrt{2 \pi } \cdot \sigma_x} e^{-0.5 \cdot (\frac{x-\mu_x}{\sigma_x})^2} * \frac{1}{\sqrt{2 \pi } \cdot \sigma_y} e^{-0.5 \cdot (\frac{y-\mu_y}{\sigma_y})^2} \\
            &= \frac{1}{2 \pi \sigma_x \sigma_y} e^{-0.5 \cdot ( (\frac{x-\mu_x}{\sigma_x})^2 + (\frac{y-\mu_y}{\sigma_y})^2)}
        \end{align}$ 
    
    - Independent multivariate gaussian is symmetric

- Let's peel into the equation above a little more
    - $(\frac{x-\mu_x}{\sigma_x})^2 + (\frac{y-\mu_y}{\sigma_y})^2$
    - This can be rewritten as the square of the L2-Norm of matrix and simplified:
        - $\begin{align}\begin{vmatrix}\begin{bmatrix}
            \frac{x - \mu_x}{\sigma_x} \\ \frac{y - \mu_y}{\sigma_y}
        \end{bmatrix}\end{vmatrix}_2^2 &= \begin{bmatrix} \frac{x - \mu_x}{\sigma_x} & \frac{y - \mu_y}{\sigma_y} \end{bmatrix} \cdot \begin{bmatrix} \frac{x - \mu_x}{\sigma_x} \\ \frac{y - \mu_y}{\sigma_y} \end{bmatrix} \\
        &= (\begin{bmatrix} x-\mu_x & y-\mu_y \end{bmatrix}) \cdot \begin{bmatrix} \frac{1}{\sigma_x^2} & 0 \\ 0 & \frac{1}{\sigma_y^2} \end{bmatrix} \cdot (\begin{bmatrix} x-\mu_x \\ y-\mu_y \end{bmatrix}) \\
        &= (\begin{bmatrix} x-\mu_x \\ y-\mu_y \end{bmatrix})^T \cdot \begin{bmatrix} \sigma_x^2 & 0 \\ 0 & \sigma_y^2 \end{bmatrix}^{-1} \cdot (\begin{bmatrix} x-\mu_x \\ y-\mu_y \end{bmatrix}) \\
        &= (\begin{bmatrix} x \\ y \end{bmatrix} - \mathbf{\mu})^T  \cdot \Sigma^{-1} \cdot (\begin{bmatrix} x \\ y \end{bmatrix} - \mathbf{\mu})
        \end{align}$
    - $\mu$ is the vector of expectations [E[X], E[Y], ...]
    - $\Sigma$ is the covariance matrix of general form $\begin{bmatrix} Var(X_1) & Cov(X_1, X_2) & ... & Cov(X_1, X_n) \\ Cov(X_2, X_1) & Var(X_2) & ... & Cov(X_2, X_n) \\ ... \end{bmatrix}


- As such, the joint PDF can be rewritten as:
    - $\begin{align}
        f_{XY}(x,y) &= \frac{1}{2 \pi \sigma_x \sigma_y} e^{-0.5 \cdot ( (\frac{x-\mu_x}{\sigma_x})^2 + (\frac{y-\mu_y}{\sigma_y})^2)} \\
        &= \frac{1}{2 \pi \sigma_x \sigma_y} e^{-0.5 \cdot (\begin{bmatrix} x \\ y \end{bmatrix} - \mathbf{\mu})^T  \cdot \Sigma^{-1} \cdot (\begin{bmatrix} x \\ y \end{bmatrix} - \mathbf{\mu})} \\
        &= \frac{1}{2 \pi \cdot det(\Sigma)^{0.5}} e^{-0.5 \cdot (\begin{bmatrix} x \\ y \end{bmatrix} - \mathbf{\mu})^T  \cdot \Sigma^{-1} \cdot (\begin{bmatrix} x \\ y \end{bmatrix} - \mathbf{\mu})}
    \end{align}$ 
    - The last line follows because 
        - $\begin{align}
            det(\Sigma)^{0.5} &= (\sigma_x^2 \cdot \sigma_y^2 - 0)^{0.5} \\
            &= \sigma_x \cdot \sigma_y
        \end{align}$

    - Note that even in the case where X and Y are not independent, the general expression still holds
        - The only caveat is that the covariance matrix is no longer diagonal i.e. $\Sigma \ne \begin{bmatrix}\sigma_x^2 & 0 & ... \\ 0 & \sigma_y^2 & 0 \\ ...\end{bmatrix}$
        - Instead, the off-diagonals will have the Covariance terms i.e. $\Sigma \ne \begin{bmatrix}\sigma_x^2 & Cov(X,Y) & ... \\ Cov(Y,X) & \sigma_y^2 & ... \\ ...\end{bmatrix}$


- In the general case with `n` variables:
    - $\begin{align}
        f_X(x_1, x_2, ... x_n) = \frac{1}{ (2 \pi )^{ \frac{n}{2} } \cdot \begin{vmatrix} \Sigma \end{vmatrix}^{\frac{1}{2}}} \cdot e^{-0.5 \cdot (\mathbf{X} - \mu)^{T} \Sigma^{-1} (\mathbf{X} - \mu)}
    \end{align}$