```{contents}
```

# t-SNE

* **t-SNE** is a **non-linear dimensionality reduction** technique.
* It’s mainly used for **visualizing high-dimensional data** in **2D or 3D**.
* Unlike PCA (linear), t-SNE preserves **local structure** and **clusters** in the data.

**Use case:** Visualizing clusters in datasets like images, word embeddings, or gene expression data.

---

## **2. Key Idea**

t-SNE tries to **map similar points in high-dimensional space close together** in low-dimensional space, and **dissimilar points far apart**.

1. Compute **pairwise similarities** in high-dimensional space:

   * Convert distances into probabilities using a **Gaussian distribution**:

   $$
   p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)}
   $$

   * $p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}$

2. Compute **pairwise similarities in low-dimensional space**:

   * Use a **Student-t distribution with 1 degree of freedom** (heavy tails):

   $$
   q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}}
   $$

3. Minimize **Kullback-Leibler (KL) divergence** between high- and low-dimensional similarities:

$$
\text{KL}(P || Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}
$$

* Intuition: Points close in high-dimensional space should be close in 2D/3D space.

---

## **3. How t-SNE Works (Step by Step)**

1. Compute high-dimensional probabilities $p_{ij}$ representing similarity.
2. Initialize points in low-dimensional space randomly ($y_i$).
3. Compute low-dimensional probabilities $q_{ij}$.
4. Minimize KL divergence using **gradient descent**.
5. Iterate until low-dimensional embedding preserves local neighborhoods.

---

## **4. Important Hyperparameters**

| Hyperparameter  | Effect                                                                                         | Typical Values              |
| --------------- | ---------------------------------------------------------------------------------------------- | --------------------------- |
| `perplexity`    | Balances local vs global structure. Low = focus on small clusters, high = larger neighborhoods | 5–50                        |
| `learning_rate` | Step size for gradient descent                                                                 | 10–1000                     |
| `n_iter`        | Number of iterations                                                                           | 1000+                       |
| `metric`        | Distance metric                                                                                | 'euclidean', 'cosine', etc. |

> ⚡ **Tip:** t-SNE is mostly **for visualization**, not feature reduction for predictive models.

---

## **5. Strengths & Limitations**

### **Strengths**

* Captures **non-linear structure**.
* Excellent for visualizing clusters.
* Preserves **local neighborhoods** better than PCA.

### **Limitations**

* Does **not preserve global distances**.
* Sensitive to **hyperparameters** (`perplexity`, `learning_rate`).
* Computationally **expensive** for large datasets.
* Embeddings are **non-deterministic** (different runs may differ unless random seed fixed).

---

**Intuition**

* Imagine **high-dimensional points connected with springs**.
* t-SNE stretches and squeezes points in 2D so that **similar points stay close** and **dissimilar points are far apart**, using a special heavy-tailed distribution to avoid crowding.


```{dropdown} Click here for Sections
```{tableofcontents}