# COVARIANCE MATRIX (always squared and symmetric)

These formulas are faster

(if each column is a sample):

$$ \Sigma = \frac{1}{n}(X-\mu)({X}-\mu)^\top $$

(if each row is a sample):

$$ \Sigma = \frac{1}{n}(X-\mu)^\top({X}-\mu) $$

but, alternatively, we could also do:

$$ var(x) = \frac {1}{n} \cdot \sum_{i=1}^n({x_i} - \mu_x)^2 $$

$$ var(y) = \frac {1}{n} \cdot \sum_{i=1}^n({y_i} - \mu_y)^2 $$

$$ cov(x, y) = cov(y, x) = \frac {1}{n} \cdot \sum_{i=1}^n({x_i} - \mu_x)({y_i} - \mu_y) $$

# PROJECTION OF A VECTOR ONTO A SUBSPACE 


The below formula is correct to compute the projection of a vector $\vec{x}$ onto a subspace $u$ generated by a unit vector $\vec{u}$.

$$ \mathbb{P}_{u}{x} = ({x}^T{u})\vec{u} $$

where:

- $ ({x}^T{u}) $ indicates the length, and
- $ \vec{u} $ indicates the direction.

- Saying that we can either maximize the length from the origin of the projected point onto the subspace or minimize the reconstruction error is the same thing.

But why "maximize" ? Because we want to keep the variance high, in order to don't loose information.

We will use the concept of "maximize the length from the origin of the project point onto the subspace", and in order to achieve this:

$$ \arg\max_{\mathbf{u}} \frac{1}{N}\sum_i^N \left\|   \mathbb{P}_{\mathbf{u}}\mathbf{x}_i \right\|_2^2 $$

- The **size** of the projection is measured with the $\ell_2^2$ norm:

$$ || \mathbb{P}_{\mathbf{u}}\mathbf{x}||_2^2 = ||X||\cdot|cos\theta|$$

since $ ||u||_2 = 1 $ and $ \theta $ is the angle between $X$ and $u$.

# CENTERING A CLOUD OF POINTS

To center a cloud of points, we need to subtract the empirical mean from each point.

NOTE: the empirical mean that we compute by considering the design matrix $X$

In Python, would be something like: 

```python
X-X.mean(axis=0)
```

where "axis=0" is the x-axis.

# ROTATION MATRIX

To obtain the rotation matrix $R$:

Given the covariance matrix $\sum$, we do eigendecomposition:

1. Compute the eigenvectors & eigenvalues of $\sum$;
2. Arrange the eigenvectors in descending order into a matrix $V$. The order is given by the associated eigenvalues.
3. Do the transpose of $V$ in order to obtain the inverse of $V$.

$V^T$ = $V^{-1}$

4. The rotation matrix $R$ can be obtained as $R = V^{-1} \cdot V^\top$

Recall that, the rotation matrix $R$ is orthogonal, so $R^\top = R^{-1}$ and also that $R^\top \cdot R = I$.

Recall that, after a rotation, the covariance matrix is diagonal, which means the covariance matrix is decorrelated.

# K-MEANS STEPS

1. **Initialization:** `RANDOM` sample the $K$ centroids $\{ {\mu}_1, \ldots, {\mu}_k\}$. A trick is using available points (just choose a few of them).

- This random initialization can be far from the optimal solution.

2. **Assignment:** assign each point to the closest centroid. 

- min $ \mu, x =  \sum_{i}^{N} || {x}_i -{\mu}_k ||_2^2 $.

- At the end of this step we should have a pair of aligned set $\{ {x}_1, \ldots, {x_n}\}$ and $\{ {y}_1, \ldots, {y_n}\}$.

3. **Update:** 

- Now, it is the **inverse** of before. Now we consider the "centroids".
- For every centroid $k=[1,\ldots,K]$ **we center it**. How ? We compute the mean of all the points that are assigned to that centroid.

- At the end of this step we should have a pair of aligned set $\{ {x}_1, \ldots, {x_n}\}$ and $\{ {y}_1, \ldots, {y_n}\}$ and **updated centroids** $\{ {\mu}_1, \ldots, {\mu}_k\}$.

# INVERSE TRANSFORM SAMPLING

Inverse transform sampling is a probabilistic method used in machine learning to generate random samples from a given probability distribution. The method relies on the cumulative distribution function (CDF) of the distribution to generate the samples.

To use inverse transform sampling, we need to first compute the CDF of the probability distribution we want to sample from. The CDF gives the probability that a random variable takes a value less than or equal to a given value. We can use this information to generate random samples from the distribution.

The steps to use inverse transform sampling are as follows:

1. Choose a probability distribution that you want to sample from. This could be any probability distribution, such as a normal distribution, uniform distribution, or any other distribution.

2. Compute the cumulative distribution function (CDF) of the chosen probability distribution. The CDF gives the probability that a random variable takes a value less than or equal to a given value. It is defined as the integral of the probability density function (PDF) from negative infinity to the given value. The CDF is a monotonically increasing function that ranges from 0 to 1.

3. Generate a uniform random variable U between 0 and 1. This can be done using any method for generating random numbers in the range [0, 1], such as using a pseudorandom number generator.

4. Use the inverse of the CDF to compute the corresponding value of the random variable X that corresponds to the probability U. This step is the core of inverse transform sampling. To do this, we take the inverse of the CDF, which gives us a function that maps probabilities to corresponding values of X. We then apply this inverse function to U to get a value of X that corresponds to the probability U.

5. Repeat steps 3 and 4 as many times as necessary to generate the desired number of random samples. Each time step 4 is performed, we generate a new value of X that corresponds to a different probability U. By repeating this process many times, we can generate a large number of random samples from the desired probability distribution.

# K-MEANS ++

In K-Means++, the initial centroids are chosen in a way that maximizes the chances of finding good quality clusters. The algorithm works as follows:

Choose the first centroid randomly from the data points.

For each data point, compute the distance to the nearest centroid that has already been chosen.

Choose the next centroid from the data points, with a probability proportional to the squared distance to the nearest centroid.

Repeat steps 2 and 3 until K centroids have been chosen.

By selecting the initial centroids in this way, K-Means++ ensures that the initial centroids are well spread out and that they are likely to represent different clusters. This helps to avoid the problem of K-Means getting stuck in local optima or converging to suboptimal clusterings.

After the initialization step, K-Means++ proceeds in the same way as K-Means. It iteratively assigns each data point to the nearest centroid, and then updates the centroids based on the mean of the assigned data points. The algorithm repeats until convergence or a maximum number of iterations is reached.

1. Choose ${\mu}_1$ arbitrarily between data points.
2. For $k=[2,\ldots,K]$:
    - **Inverse Transform Sampling** w.r.t. to distances between centroids
    - $Pr[{\mu}_k={x}_m] \propto \min_{k<k^{\prime}} \left| \left| {x}_m - {\mu}_k^{\prime}\right|\right|_2^2 \qquad$
3. Repeat Lloyd’s method until convergence :
    - **Assignment step:** $\forall i\in[1,N] \quad y_i =  \arg\min_k || {x}_i -{\mu}_k ||_2^2 $
    - **Update step:** $\forall k\in[1,K] \quad  {\mu}_k \leftarrow \frac{\sum_i \delta\{y_i=k\}{x}_i}{\sum_i \delta\{y_i=k\}}$

# RESPONSABILITIES

$\gamma_k$ indicates 

$$ \gamma = p(z == k | x) = \frac{N(x; \mu_k, \sigma_k)\cdot \pi_k}{\sum_{k\in |\gamma|} \cdot N(x; \mu_k, \sigma_k)\cdot \pi_k} $$