**CS596 - Machine Learning**
<br>
Date: **30 November 2020**


Title: **Lecture 12**
<br>
Speaker: **Dr. Shota Tsiskaridze**
<br>
Teaching Assistant: **Levan Sanadiradze**

<h1 align="center">Dimensionality Reduction</h1>

<h3 align="center">t-Distributed Stochastic Neighbor Embedding (t-SNE)</h3>

- **t-distributed stochastic neighbor embedding (t-SNE)** is a **nonlinear** dimensionality reduction technique well-suited for **embedding high-dimensional data** for **visualization** in a **low-dimensional space** of **two** or **three dimensions**.


- The **t-SNE algorithm** comprises **two main stages**:


- **First**, t-SNE c**onstructs** a **probability distribution** over **pairs of high-dimensional objects**.

  **Similar objects** are assigned a **higher probability** while **dissimilar points** are assigned a **lower probability**.
  

- **Second**, t-SNE **defines** a **similar probability distribution** over the points **in the low-dimensional map**.

  It **minimizes the Kullback–Leibler divergence** (KL divergence) between the two distributions with respect to the locations of the points in the map. 
 
 
- While the **original algorithm** uses the **Euclidean distance** between objects as the base of its similarity metric, this can be changed as appropriate.

<h3 align="center">Stochastic Neighbor Embedding</h3>

- - Let's assume that we are given a **collection** of $N$ **high-dimensional objects** $x_1, .., x_N$.


- **Stochastic Neighbor Embedding (SNE)** starts by **converting** the **high-dimensional Euclidean distances** between datapoints **into conditional probabilities** that represent **similarities**.


- The **similarity** of datapoint $x_j$ to datapoint $x_i$ is the **conditional probability**, $p_{j|i}$, that $x_i$ **would pick** $x_j$ as its **neighbor** if neighbors were picked in proportion to their **probability density** under a **Gaussian** centered at $x_i$:

  $$p_{j|i} = \frac{e^{-\frac{|| x_i  - x_j||^2}{2 \sigma_i^2}}}{\sum_{k \neq i}e^{-\frac{|| x_i  - x_k||^2}{2 \sigma_i^2}}},$$
  
  where $σ_i$ is the **variance** of the **Gaussian** that is centered on datapoint $x_i$.

  Because we are only interested in modeling pairwise similarities, we set the value of $p_{i|i}$ to zero.
  
  
- For the **low-dimensional counterparts** $y_i$ and $y_j$ of the high-dimensional datapoints $x_i$ and $x_j$, it is possible to compute a similar conditional probability, which we denote by $q_{j|i}$:

  $$q_{j|i} = \frac{e^{-|| y_i  - y_j||^2}}{\sum_{k \neq i}e^{-|| y_i  - y_k||^2}},$$
  
  where we **set variance** of the Gaussian that is employed in the computation of the conditional probabilities $q_{j|i} = \frac{1}{\sqrt{2}}$.
  
  Again, since we are only interested in modeling pairwise similarities, we set $q_{i|i} = 0$.
  
  <img src="images/L12_SNE.png" width="600" alt="Example" />
 
- A natural **measure of the faithfulness** with which $q_{j|i}$ models $p_{j|i}$ is the **Kullback-Leibler divergence**     
  SNE **minimizes** the **sum of Kullback-Leibler divergences** over all datapoints using a **gradient descent method**.
  
  The **cost function** $C$ is given by:\
  
  $$C = \sum_{i=1}^{N} KL(P_i || Q_i) = \sum_{i=1}^{N} \sum_{j=1}^{N} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}.$$
  
  in which $P_i$ represents the **conditional probability distribution** over all other datapoints **given datapoint** $x_i$
  
  and 
  
  $Q_i$ represents the **conditional probability distribution** over all other map points **given map point** $y_i$.


- The **remaining parameter** to be **selected** is the **variance** $\sigma_i$ of the Gaussian that is centered over each high-dimensional datapoint, $x_i$.

  It is not likely that there is a single value of $\sigma_i$ that is optimal for all datapoints in the dataset because the density of the data is likely to vary. 
  
  In **dense regions**, a **smaller value** of $\sigma_i$ is usually **more appropriate** than in sparser regions. 
  
  Any particular value of $\sigma_i$ induces a probability distribution, $P_i$, over all of the other datapoints. 
  
  This distribution has an entropy which increases as $\sigma_i$ increases. 
  
  SNE performs a binary search for the value of $\sigma_i$ that produces a $P_i$ with a fixed perplexity that is specified by the user.
  
  The **perplexity** is defined as:
  
  $$Perp(P_i) = 2^{H(P_i)},$$
  
  where $H(P_i)$ is the **Shannon entropy** of $P_i$ measured in bits:
  
  $$H(P_i) = - \sum_{j=1}^{N} p_{j|i} \log_2 p_{j|i}.$$
  
  The perplexity can be interpreted as a **smooth measure of the effective number of neighbors**. 
  
  The **performance of SNE** is **fairly robust** to changes in the **perplexity**, and typical **values are between 5 and 50**.
  
- The **minimization** of the **cost function** $C$ is performed using a **gradient descent method:

  $$\frac{\partial C}{\partial y_i} = 2 \sum_{j=1}^{N} (p_{j|i} - q_{j|i} + p_{j|j} - q{i|j})(y_i - y_j).$$
 

<h3 align="center">Crowding Problem</h3>

- As well as SNE preserves local relationships, it suffers from the **crowding problem**. 


- The **area of the 2D map** that is available to accommodate moderately distant data points **will not be large enough** compared with the area available to accommodate nearby data points.


- Intuitively, there is **less space** in a **lower dimension** to **accommodate moderately distant data points** originally in higher dimension:

<img src="images/L12_Crowding_Problem.png" width="600" alt="Example" />

- Although the **distances** between the closest points $AB$ and $BC$ **are preserved**, the **global distance** $AC$ has to **shrink**.

<h3 align="center">t-Distributed Stochastic Neighbor Embedding (t-SNE)</h3>

- **To address** the **crowding problem** and make SNE more robust to outliers, **t-SNE** was introduced.


- Compared to SNE, **t-SNE** has **two main changes**:

  1. A **symmetrized** version of the SNE **cost function** with simpler gradients.
  
  2. A **Student-t distribution** rather than a Gaussian to compute the similarity **in the low-dimensional space** to alleviate the crowding problem.
  

- In **t-SNE** $p_{ij}$ is defined instead as follows:

  $$p_{ij} = \frac{p_{i|j} + p_{j|i}}{2N}.$$
  
  In this way, $\sum_{j = 1}^{N} p_{ij} > \frac{1}{2N}$ for all data points $x_i$.
  
  As a result, each $x_i$ **makes a significant contribution** to the **cost function**.


- **t-SNE** uses the **Student’s t-distribution** instead of the Gaussian to define $Q$:

  $$q_{ij} = \frac{\left (1 + ||y_i - y_j||^2 \right )^{-1}}{\sum_{k \neq l} \left(1 + ||y_k - y_l||^2 \right )^{-1}}$$

  The **cost function** of **t-SNE** is now defined as:
  
  $$C = \sum_{i=1}^{N} KL(P_i || Q_i) = \sum_{i=1}^{N} \sum_{j=1}^{N} p_{ij} \log \frac{p_{ij}}{q_{ij}}.$$
  
  
- The **heavy tails** of the normalized Student-t kernel **allow** dissimilar **input objects** $x_i$ and $x_j$ **to be modeled** by low-dimensional counterparts $y_i$ and $y_j$ that are **far apart** because $q_{ij}$ is would be large for two embedded points that are far apart. 
  
  And since $q$ is what to be learned, the **outlier problem does not exist** for low-dimension.
  
  Below the **pairwise Euclidian distance** between **two points** in the **high-dim** and **low-dim** data are shown:

  <img src="images/L12_Gradients.png" width="600" alt="Example" />


- The **gradient** of the **cost function** is:

  $$\frac{\partial C}{ \partial y_i} = 4 \sum_{j = 1, j \neq i}^{N} (p_{ij} - q _{ij})\left ( 1 + ||y_i - y_j||^2\right)^{-1} (y_i - y_j).$$
  
  We can interpret the t-SNE gradient as a simulation of an **N-body system**:
  
    <img src="images/L12_Nbody.png" width="1000" alt="Example" />


- Notice also, that there is an **exaggeration parameter** $\alpha > 1$ in the tSNE algorithm, which is **used as a coefficient** for $p_{ij}$.

<img src="images/L12_tSNE_Algorithm.png" width="800" alt="Example" />


- Visualization of **6,000 digits** from the **MNIST dataset** produced by the **t-SNE** are shown below:

<img src="images/L12_MNIST.png" width="800" alt="Example" />


<h3 align="center">Weaknesses</h3>

- Although we have shown that t-SNE compares favorably to other techniques for data visualization, t-SNE has three potential weaknesses:


1. It is **unclear how t-SNE performs** on general **dimensionality reduction** tasks.


2. The relatively local nature of t-SNE makes it **sensitive to the curse of the intrinsic dimensionality** of the data.


3. t-SNE is **not guaranteed to converge** to a global optimum of its cost function.


<h1 align="center">End of Lecture</h1>