## Gaussian Distributions

$$N(x|\mu,\sigma^{2}) = \frac{1}{(2\pi\sigma^{2})^{\frac{1}{2}}}\exp{-\frac{1}{2\sigma^{2}}(x - \mu)^{2}}$$

![image.png](attachment:e2d9b187-1cb4-4d50-9a03-2cd163bef782.png)

$$N(x|\mu,\sigma^{2}) > 0$$
$$\int^{\infty}_{-\infty}N(x|\mu,\sigma^{2}) dx = 1$$

$$N(x|\mu,\sum) = \frac{1}{(2\pi)^{\frac{D}{2}}}\frac{1}{(|\sum|)^{\frac{1}{2}}}\exp{-\frac{1}{2}{(x-\mu)^{T}}(\sum)^{-1}(x-\mu)}$$

This is a gaussian defined of D-dimensions, vector x of continuous variables.

$D x D$ matrix $\sum$  is called the covariance matrix and $|\sum|$ is the derminant.

---

**Covariance between Two Variables:**

For two random variables, $( X )$ and $( Y )$, the covariance is computed as:
$\text{cov}(X, Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})$

Where:
- $( x_i )$ and $( y_i )$ are the individual data points.
- $( \bar{x} )$ and $( \bar{y} )$ are the means of $( X )$ and $( Y )$ respectively.
- \( n \) is the number of data points.

This formula basically calculates how much, on average, two variables change together relative to their means. If they tend to increase together (a value above the mean of $ X $ coincides with a value above the mean of $ Y $ and vice versa), then their covariance will be positive. If one tends to increase when the other decreases, the covariance will be negative.

---

**Building the Covariance Matrix for \( X, Y, \) and \( Z \):**

Given three variables, \( X, Y, \) and \( Z \), we need to compute individual variances and pairwise covariances to create the covariance matrix.

1. **Variances**:
   - $\text{var}(X) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$
   - $\text{var}(Y) = \frac{1}{n-1}\sum_{i=1}^{n}(y_i - \bar{y})^2$
   - $\text{var}(Z) = \frac{1}{n-1}\sum_{i=1}^{n}(z_i - \bar{z})^2$

2. **Covariances**:
   - $text{cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})$
   - $text{cov}(X,Z) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(z_i - \bar{z})$
   - $\text{cov}(Y,Z) = \frac{1}{n-1}\sum_{i=1}^{n}(y_i - \bar{y})(z_i - \bar{z})$

3. **Covariance Matrix**:

Using the variances and covariances calculated, we can form the covariance matrix \( \Sigma \) as:

$
\Sigma = 
\begin{bmatrix}
\text{var}(X) & \text{cov}(X,Y) & \text{cov}(X,Z) \\
\text{cov}(X,Y) & \text{var}(Y) & \text{cov}(Y,Z) \\
\text{cov}(X,Z) & \text{cov}(Y,Z) & \text{var}(Z) \\
\end{bmatrix}
$

Note:
- The diagonal elements represent the variance of each variable.
- The off-diagonal elements represent the covariance between the pairs of variables.
- Covariance matrices are symmetric:$ \text{cov}(X,Y) = \text{cov}(Y,X)$.

---

**Understanding through $(x_i - \bar{x})(y_i-\bar{y})$:**

Let's intuitively understand the term $(x_i - \bar{x})(y_i-\bar{y})$:

- $(x_i - \bar{x})$: This represents how much the current data point $x_i $ deviates from the mean of $ X $.
  
- $ (y_i - \bar{y}) $: Similarly, this represents how much the current data point $ y_i $ deviates from the mean of $ Y $.

- By multiplying them together, we're looking at their joint deviation. If both deviations have the same sign, their product is positive, indicating that \( X \) and \( Y \) are moving in the same direction relative to their means. If the deviations have opposite signs, their product is negative, indicating that \( X \) and \( Y \) are moving in opposite directions.

The covariance is the average of these products, which gives an indication of the overall trend of the joint movement of \( X \) and \( Y \).

---

![image.png](attachment:c97c1b6b-c520-410a-97fa-11234457310a.png)

### Maximum Likelihood Estimation (MLE)

MLE aims to find the parameter values that maximize the likelihood of observing the given data. For a multivariate Gaussian, we try to find the $\mu$ and $\Sigma$ that maximize the likelihood of the observed data samples.

The likelihood function $L(\mu, \Sigma)$ is:

$
L(\mu, \Sigma) = \prod_{i=1}^{N} N(x_i|\mu,\Sigma)
$

### Log-Likelihood

Maximizing the product of many exponential terms can be computationally challenging. So, we often take the natural logarithm of the likelihood to simplify the calculation. The log-likelihood is:

$
\log L(\mu, \Sigma) = \sum_{i=1}^{N} \log(N(x_i|\mu,\Sigma))
$

The logarithm is a monotonically increasing function, meaning that if $x > y$, then $\log(x) > \log(y)$. Therefore, maximizing the log-likelihood is equivalent to maximizing the likelihood itself.

### Advantages of Using Log-Likelihood

1. **Computational Simplicity**: Products become sums and exponentials become powers, which are easier to handle.
   
2. **Numerical Stability**: When you're multiplying many probabilities (which are between 0 and 1), the result can get very close to zero, causing numerical instability. Logarithms help avoid this problem.

3. **Analytic Solutions**: The log-likelihood often leads to easier-to-solve equations when setting the derivative to zero for maximization.

In summary, using the log-likelihood in MLE offers both computational and numerical advantages, and it's especially useful for distributions like the multivariate Gaussian where likelihoods can get quite complicated.

## PCA Example (Not relavent)

1. **Hypothetical Data**:
Let's say we have 4 features in our data. We'll use a very basic toy dataset:
$$
\text{Data} = 
\begin{bmatrix}
2 & 3 & 1 & 4 \\
2.2 & 3.1 & 1.1 & 4.1 \\
2.4 & 3.2 & 1.3 & 4.3 \\
1.8 & 2.9 & 0.8 & 3.9 \\
2.1 & 3.0 & 1.0 & 4.0 \\
\end{bmatrix}
$$
Where each row represents an observation and each column a feature.

2. **Compute the Covariance Matrix**:
For simplicity, let's say the computed covariance matrix for the data is:

$$
\Sigma = 
\begin{bmatrix}
0.05 & 0.04 & 0.03 & 0.04 \\
0.04 & 0.06 & 0.04 & 0.05 \\
0.03 & 0.04 & 0.05 & 0.04 \\
0.04 & 0.05 & 0.04 & 0.06 \\
\end{bmatrix}
$$
(Note: This matrix is fictional and might not correspond to the actual covariance of the data above.)

3. **Compute the Eigenvectors and Eigenvalues**:
The next step in PCA is to compute the eigenvectors and eigenvalues of the covariance matrix. Each eigenvector will represent a principal component, and its corresponding eigenvalue will signify the amount of variance captured by that component.

Let's say, for simplicity, we found the following eigenvalues:

$
\lambda_1 = 0.18, \quad \lambda_2 = 0.10, \quad \lambda_3 = 0.03, \quad \lambda_4 = 0.01
$

And their corresponding eigenvectors (principal components):

$
v_1 = [0.5, 0.5, 0.5, 0.5], \quad v_2 = [0.5, -0.5, 0.5, -0.5], \quad \text{etc.}
$

4. **Sort by Eigenvalues**:
The next step in PCA is to sort the eigenvectors based on the descending order of their corresponding eigenvalues. This gives the order of importance of the principal components:

$
\text{Order: } v_1, v_2, v_3, v_4
$

5. **Projection**:
To reduce dimensionality or to transform the data, you'd project it onto a subset of the top eigenvectors. For instance, if we wanted to reduce our 4-dimensional data to 2 dimensions, we'd use \(v_1\) and \(v_2\). Mathematically, this is done by taking the dot product of the data with each eigenvector.

Using just the first two eigenvectors for projection, a data point:

$
x = [2, 3, 1, 4]
$

Would be transformed into:

$
x' = [x \cdot v_1, x \cdot v_2]
$

6. **Interpret Results**:
The transformed data represents the original data in the new basis defined by the principal components. The dimensions (principal components) are now uncorrelated, and they capture the most variance in the data in descending order.

Remember, in real applications, libraries like Scikit-learn or tools like MATLAB handle these computations efficiently, and for real datasets, the covariance matrix won't be as simple as our hypothetical example. The goal here was to convey the core concepts of PCA through a toy example.