# Lesson 3: Visualizing Sentence Embeddings with t-SNE

## Introduction

Welcome to the third lesson in our course on **Text Representation Techniques for RAG systems**! In our previous lesson, we explored how to generate sentence embeddings and saw how these richer representations capture semantic meaning better than the classic Bag-of-Words.

Now, we will build on that knowledge to visualize these embeddings in a two-dimensional space using **t-SNE** (t-distributed Stochastic Neighbor Embedding). By the end of this lesson, you'll have an interactive way to see how thematically similar sentences group closer together, reinforcing the idea that embeddings preserve meaningful relationships between sentences.

---

## Understanding t-SNE

t-SNE helps us visualize high-dimensional embeddings by compressing them into a lower-dimensional space (usually 2D or 3D for visualization) while preserving relative similarities:

- **Similarity First:**  
  t-SNE prioritizes keeping similar sentences close. It calculates pairwise similarities in the original space (using a probability distribution) so nearby embeddings get higher similarity scores than distant ones.

- **Local Structure:**  
  It preserves neighborhoods of related points rather than exact distances. This means clusters you see reflect genuine thematic groupings (e.g., NLP vs. Food), but axis values themselves have no intrinsic meaning.

- **Perplexity Matters:**  
  This parameter (typically between 5–50) controls neighborhood size.
  - Lower values emphasize tight clusters (good for spotting subtopics).
  - Higher values show broader trends (useful for separating major categories).

- **Tradeoffs:**  
  While powerful for visualization, t-SNE is computationally expensive for large datasets (it compares all sentence pairs). For RAG systems, it’s better suited for exploratory analysis of smaller samples than production-scale data.

> **Why does this matter for RAG?**  
> Seeing embeddings cluster by topic validates they're capturing semantic relationships—a prerequisite for effective retrieval. If NLP sentences scattered randomly, we'd question the embedding quality before even building the RAG pipeline, prompting us to reevaluate our model choice.

---

## Building Our Data

To demonstrate how t-SNE reveals natural groupings, we'll gather sentences on four different topics: **NLP**, **ML**, **Food**, and **Weather**. Then, we assign each sentence a category so we can later color-code and shape-code the points in our 2D visualization.

```python
def get_sentences_and_categories():
    """
    Return the sentences and their corresponding categories.
    """
    sentences = [
        # Topic: NLP
        "RAG stands for Retrieval-Augmented Generation.",
        "Retrieval is a crucial aspect of modern NLP systems.",
        "Generating text with correct facts is challenging.",
        "Large language models can generate coherent text.",
        "GPT models have billions of parameters.",
        "Natural Language Processing enables computers to understand human language.",
        "Word embeddings capture semantic relationships between words.",
        "Transformer architectures revolutionized NLP research.",
        
        # Topic: Machine Learning
        "Machine learning benefits from large datasets.",
        "Supervised learning requires labeled data.",
        "Reinforcement learning is inspired by behavioral psychology.",
        "Neural networks can learn complex functions.",
        "Overfitting is a common problem in ML.",
        "Unsupervised learning uncovers hidden patterns in data.",
        "Feature engineering is critical for model performance.",
        "Cross-validation helps in assessing model generalization.",
        
        # Topic: Food
        "Bananas are commonly used in smoothies.",
        "Oranges are rich in vitamin C.",
        "Pizza is a popular Italian dish.",
        "Cooking pasta requires boiling water.",
        "Chocolate can be sweet or bitter.",
        "Fresh salads are a healthy and refreshing meal.",
        "Sushi combines rice, fish, and seaweed in a delicate balance.",
        "Spices can transform simple ingredients into gourmet dishes.",
        
        # Topic: Weather
        "It often rains in the Amazon rainforest.",
        "Summers can be very hot in the desert.",
        "Hurricanes form over warm ocean waters.",
        "Snowstorms can disrupt transportation.",
        "A sunny day can lift people's mood.",
        "Foggy mornings are common in coastal regions.",
        "Winter brings frosty nights and chilly winds.",
        "Thunderstorms can produce lightning and heavy rain."
    ]
    
    categories = (["NLP"] * 8 + ["ML"] * 8 + ["Food"] * 8 + ["Weather"] * 8)
    return sentences, categories

def get_color_and_shape_maps():
    """
    Return color and marker maps for each category.
    """
    color_map = {
        "NLP": "red",
        "ML": "blue",
        "Food": "green",
        "Weather": "purple"
    }
    shape_map = {
        "NLP": "o",
        "ML": "s",
        "Food": "^",
        "Weather": "X"
    }
    return color_map, shape_map
```

- **`get_sentences_and_categories()`** returns two lists: one of sentences and another labeling each sentence’s category.
- **`get_color_and_shape_maps()`** provides dictionaries to map each category to a specific color and marker shape (e.g., red circles for NLP).

---

## Generating and Reducing Embeddings

Next, we encode the sentences into embeddings and then reduce them to two dimensions using t-SNE:

```python
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sentence_transformers import SentenceTransformer

def compute_tsne_embeddings(
    sentences,
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    perplexity=10,
    n_iter=3000,
    random_state=42
):
    """
    Compute and return t-SNE reduced embeddings for the given sentences.
    """
    # 1. Initialize a SentenceTransformer model.
    model = SentenceTransformer(model_name)
    
    # 2. Convert each sentence into a high-dimensional embedding.
    embeddings = model.encode(sentences)
    
    # 3. Configure t-SNE with chosen parameters.
    tsne = TSNE(
        n_components=2,
        random_state=random_state,
        perplexity=perplexity,
        n_iter=n_iter
    )
    
    # 4. Fit t-SNE on the embeddings and return a 2D representation.
    return tsne.fit_transform(embeddings)
```

1. **Model Initialization:** Loads a pre-trained SentenceTransformer to generate embeddings.  
2. **Embedding Generation:** `model.encode(sentences)` converts sentences into high-dimensional vectors.  
3. **t-SNE Configuration:** Sets hyperparameters—`perplexity`, `n_iter`, and `random_state`—for the TSNE instance.  
4. **Dimensionality Reduction:** `tsne.fit_transform(embeddings)` reduces the embeddings to 2D points.

---

## Plotting the Embeddings

With our 2D points ready, we’ll plot them to visualize topic clusters and add short labels for clarity:

```python
def plot_embeddings(
    reduced_embeddings,
    sentences,
    categories,
    color_map,
    shape_map,
    xlim=(-125, 150),
    ylim=(-175, 125)
):
    """
    Plot the 2D embeddings with labels and a legend.
    """
    # 1. Create the figure.
    plt.figure(figsize=(10, 8))
    
    # 2. Plot each sentence as a scatter point.
    for i, (sentence, category) in enumerate(zip(sentences, categories)):
        x, y = reduced_embeddings[i]
        plt.scatter(x, y, color=color_map[category], marker=shape_map[category])
        plt.text(x - 2.5, y - 7.5, sentence[:20] + "...", fontsize=9)
    
    # 3. Build a legend.
    for cat, color in color_map.items():
        plt.scatter([], [], color=color, label=cat, marker=shape_map[cat])
    plt.legend(loc="best")
    
    # 4. Finalize plot.
    plt.title("t-SNE Visualization of Sentence Embeddings", fontsize=14)
    plt.xlabel("t-SNE Dimension 1", fontsize=12)
    plt.ylabel("t-SNE Dimension 2", fontsize=12)
    plt.tight_layout()
    plt.xlim(*xlim)
    plt.ylim(*ylim)
    plt.savefig('your_plot_image.png')
```

- **Scatter Plot:** Each sentence is shown as a colored, shaped marker.  
- **Annotations:** The first 20 characters of each sentence appear next to its point.  
- **Legend:** Empty markers illustrate which color/shape corresponds to each topic.  

---

## Interpreting the t-SNE Plot

The resulting t-SNE visualization reveals how sentence embeddings capture semantic similarity:

- **Distinct Clusters:**  
  - **NLP** (red circles)  
  - **ML** (blue squares)  
  - **Food** (green triangles)  
  - **Weather** (purple X’s)  

- **Cluster Proximity:**  
  - NLP and ML clusters tend to be closer, reflecting related technical content.  
  - Food and Weather are more distant, as they share fewer semantic overlaps.  

- **Overlap & Insights:**  
  - Some sentences (e.g., “GPT models have billions of parameters.”) may lie between NLP and ML clusters, highlighting shared vocabulary.  
  - The plot helps you spot outliers or subtopic distinctions at a glance.

This interactive view validates that embeddings preserve meaningful relationships—crucial for debugging and improving RAG retrieval.

---

## Conclusion and Next Steps

You’ve now:

1. Represented text data with high-dimensional embeddings.  
2. Used t-SNE to reduce embeddings to 2D.  
3. Visualized clusters to uncover semantic relationships.

Equipped with this workflow, you can explore your own text datasets, experiment with t-SNE parameters, and use interactive plots to gain deeper insights. In the next practice session, you’ll get hands-on experience modifying code and observing how different settings affect clustering.

> **Give it a try**, and have fun discovering the hidden patterns in your text!

## Visualize Sentence Clusters

Great job on mastering sentence embeddings! Now, let's bring those concepts to life by visualizing them in action.

Your task is simple: run the provided code to generate a scatter plot that reveals how different sentence clusters form, and observe how sentences with similar themes naturally group together.

Feel free to experiment with the code! Tweak parameters, change sentences, or adjust categories to see how these changes affect the visualization. This hands-on exploration will deepen your understanding of how embeddings capture semantic relationships. Dive in and enjoy uncovering the hidden patterns in your data!


```python
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sentence_transformers import SentenceTransformer


def get_sentences_and_categories():
    """Return the sentences and their corresponding categories."""
    sentences = [
        # Topic: NLP
        "RAG stands for Retrieval-Augmented Generation.",
        "Retrieval is a crucial aspect of modern NLP systems.",
        "Generating text with correct facts is challenging.",
        "Large language models can generate coherent text.",
        "GPT models have billions of parameters.",
        "Natural Language Processing enables computers to understand human language.",
        "Word embeddings capture semantic relationships between words.",
        "Transformer architectures revolutionized NLP research.",
        
        # Topic: Machine Learning
        "Machine learning benefits from large datasets.",
        "Supervised learning requires labeled data.",
        "Reinforcement learning is inspired by behavioral psychology.",
        "Neural networks can learn complex functions.",
        "Overfitting is a common problem in ML.",
        "Unsupervised learning uncovers hidden patterns in data.",
        "Feature engineering is critical for model performance.",
        "Cross-validation helps in assessing model generalization.",
        
        # Topic: Food
        "Bananas are commonly used in smoothies.",
        "Oranges are rich in vitamin C.",
        "Pizza is a popular Italian dish.",
        "Cooking pasta requires boiling water.",
        "Chocolate can be sweet or bitter.",
        "Fresh salads are a healthy and refreshing meal.",
        "Sushi combines rice, fish, and seaweed in a delicate balance.",
        "Spices can transform simple ingredients into gourmet dishes.",
        
        # Topic: Weather
        "It often rains in the Amazon rainforest.",
        "Summers can be very hot in the desert.",
        "Hurricanes form over warm ocean waters.",
        "Snowstorms can disrupt transportation.",
        "A sunny day can lift people's mood.",
        "Foggy mornings are common in coastal regions.",
        "Winter brings frosty nights and chilly winds.",
        "Thunderstorms can produce lightning and heavy rain."
    ]
    
    categories = (["NLP"] * 8 + ["ML"] * 8 + ["Food"] * 8 + ["Weather"] * 8)
    return sentences, categories


def get_color_and_shape_maps():
    """Return color and marker maps for each category."""
    color_map = {
        "NLP": "red",
        "ML": "blue",
        "Food": "green",
        "Weather": "purple"
    }
    shape_map = {
        "NLP": "o",
        "ML": "s",
        "Food": "^",
        "Weather": "X"
    }
    return color_map, shape_map


def compute_tsne_embeddings(sentences, model_name="sentence-transformers/all-MiniLM-L6-v2",
                              perplexity=10, max_iter=3000, random_state=42):
    """Compute and return t-SNE reduced embeddings for the given sentences."""
    model = SentenceTransformer(model_name)
    embeddings = model.encode(sentences)
    tsne = TSNE(n_components=2, random_state=random_state,
                perplexity=perplexity, max_iter=max_iter)
    return tsne.fit_transform(embeddings)


def plot_embeddings(reduced_embeddings, sentences, categories, color_map, shape_map):
    """Plot the 2D embeddings with labels and a legend."""
    plt.figure(figsize=(10, 8))
    for i, (sentence, category) in enumerate(zip(sentences, categories)):
        x, y = reduced_embeddings[i]
        plt.scatter(x, y, color=color_map[category], marker=shape_map[category])
        # Display only the first 20 characters for clarity
        plt.text(x - 2.5, y - 1.0, sentence[:20] + "...", fontsize=9)
    
    # Add an empty scatter for each category to create the legend
    for cat, color in color_map.items():
        plt.scatter([], [], color=color, label=cat, marker=shape_map[cat])
    plt.legend(loc="best")

    plt.title("t-SNE Visualization of Sentence Embeddings", fontsize=14)
    plt.xlabel("t-SNE Dimension 1", fontsize=12)
    plt.ylabel("t-SNE Dimension 2", fontsize=12)
    plt.tight_layout()
    plt.savefig('static/images/plot.png', bbox_inches='tight')

if __name__ == "__main__":
    sentences, categories = get_sentences_and_categories()
    color_map, shape_map = get_color_and_shape_maps()
    reduced_embeddings = compute_tsne_embeddings(sentences)
    plot_embeddings(reduced_embeddings, sentences, categories, color_map, shape_map)


```

## Explore t-SNE Perplexity Effects

You've gained a solid understanding of generating and reducing sentence embeddings. Now, let's apply that knowledge to troubleshoot and refine the compute_tsne_embeddings function.

In the given code, there is some bug that is causing the embeddings visualization to be pretty bad. Your task is to identify and fix any issues that are causing such a poor visualization.

```python
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sentence_transformers import SentenceTransformer

def get_sentences_and_categories():
    """Return the sentences and their corresponding categories."""
    sentences = [
        # Topic: NLP
        "RAG stands for Retrieval-Augmented Generation.",
        "Retrieval is a crucial aspect of modern NLP systems.",
        "Generating text with correct facts is challenging.",
        "Large language models can generate coherent text.",
        "GPT models have billions of parameters.",
        "Natural Language Processing enables computers to understand human language.",
        "Word embeddings capture semantic relationships between words.",
        "Transformer architectures revolutionized NLP research.",
        
        # Topic: Machine Learning
        "Machine learning benefits from large datasets.",
        "Supervised learning requires labeled data.",
        "Reinforcement learning is inspired by behavioral psychology.",
        "Neural networks can learn complex functions.",
        "Overfitting is a common problem in ML.",
        "Unsupervised learning uncovers hidden patterns in data.",
        "Feature engineering is critical for model performance.",
        "Cross-validation helps in assessing model generalization.",
        
        # Topic: Food
        "Bananas are commonly used in smoothies.",
        "Oranges are rich in vitamin C.",
        "Pizza is a popular Italian dish.",
        "Cooking pasta requires boiling water.",
        "Chocolate can be sweet or bitter.",
        "Fresh salads are a healthy and refreshing meal.",
        "Sushi combines rice, fish, and seaweed in a delicate balance.",
        "Spices can transform simple ingredients into gourmet dishes.",
        
        # Topic: Weather
        "It often rains in the Amazon rainforest.",
        "Summers can be very hot in the desert.",
        "Hurricanes form over warm ocean waters.",
        "Snowstorms can disrupt transportation.",
        "A sunny day can lift people's mood.",
        "Foggy mornings are common in coastal regions.",
        "Winter brings frosty nights and chilly winds.",
        "Thunderstorms can produce lightning and heavy rain."
    ]
    
    categories = (["NLP"] * 8 + ["ML"] * 8 + ["Food"] * 8 + ["Weather"] * 8)
    return sentences, categories

def get_color_and_shape_maps():
    """Return color and marker maps for each category."""
    color_map = {
        "NLP": "red",
        "ML": "blue",
        "Food": "green",
        "Weather": "purple"
    }
    shape_map = {
        "NLP": "o",
        "ML": "s",
        "Food": "^",
        "Weather": "X"
    }
    return color_map, shape_map

def compute_tsne_embeddings(sentences, model_name="sentence-transformers/all-MiniLM-L6-v2",
                            perplexity=10, max_iter=3000, random_state=42):
    """Compute and return t-SNE reduced embeddings for the given sentences."""
    model = SentenceTransformer(model_name)
    embeddings = model.encode(sentences)
    tsne = TSNE(n_components=2, random_state=random_state,
                perplexity=perplexity, max_iter=max_iter)
    return tsne.fit_transform(embeddings)

def plot_embeddings(reduced_embeddings, sentences, categories, color_map, shape_map):
    """Plot the 2D embeddings with labels and a legend."""
    plt.figure(figsize=(10, 8))
    for i, (sentence, category) in enumerate(zip(sentences, categories)):
        x, y = reduced_embeddings[i]
        plt.scatter(x, y, color=color_map[category], marker=shape_map[category])
        plt.text(x - 2.5, y - 1, sentence[:20] + "...", fontsize=9)
    
    for cat, color in color_map.items():
        plt.scatter([], [], color=color, label=cat, marker=shape_map[cat])
    plt.legend(loc="best")

    plt.title("t-SNE Visualization of Sentence Embeddings", fontsize=14)
    plt.xlabel("t-SNE Dimension 1", fontsize=12)
    plt.ylabel("t-SNE Dimension 2", fontsize=12)
    plt.tight_layout()
    plt.savefig('static/images/plot.png')

if __name__ == "__main__":
    sentences, categories = get_sentences_and_categories()
    color_map, shape_map = get_color_and_shape_maps()
    
    perplexity = 0.1
    max_iter = 250
    reduced_embeddings = compute_tsne_embeddings(sentences, perplexity=perplexity, max_iter=max_iter)
    plot_embeddings(reduced_embeddings, sentences, categories, color_map, shape_map)


```

Here are the key issues causing your t-SNE plot to look “off,” and how to fix them:

1. **Perplexity too low**  
   You set `perplexity = 0.1`, but t-SNE’s “perplexity” really should lie between **5 and 50**.  Very low values collapse all points into tiny local clusters.

2. **Too few iterations**  
   You passed only `max_iter = 250` (and even used the wrong parameter name—see next point).  t-SNE typically needs **≥ 1 000** iterations to converge to a useful layout.

3. **Wrong parameter name**  
   scikit-learn’s TSNE class accepts `n_iter`, _not_ `max_iter`.  As written, your call  
   ```python
   TSNE(..., max_iter=max_iter)
   ```  
   silently ignores `max_iter`, so you’re actually running with the default `n_iter=1_000` anyway — but since you passed so small a perplexity, it still collapses everything.

4. **Sub-optimal initialization**  
   By default t-SNE starts from random noise.  Using `init='pca'` often yields faster, more stable embeddings.

---

### 🔧 Fixed `compute_tsne_embeddings`

```python
from sklearn.manifold import TSNE
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA

def compute_tsne_embeddings(
    sentences,
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    perplexity=30,
    n_iter=1000,
    random_state=42
):
    """
    Compute and return 2D t-SNE embeddings for the given sentences,
    with PCA-based initialization and sensible defaults.
    """
    # 1. Embed sentences
    model = SentenceTransformer(model_name)
    high_dim = model.encode(sentences)

    # 2. (Optional) Pre-reduce to speed up t-SNE and remove noise
    pca = PCA(n_components=50, random_state=random_state)
    reduced50 = pca.fit_transform(high_dim)

    # 3. t-SNE with proper params
    tsne = TSNE(
        n_components=2,
        perplexity=perplexity,
        n_iter=n_iter,
        init='pca',
        random_state=random_state
    )

    return tsne.fit_transform(reduced50)
```

---

### 🚀 How to run it

```python
if __name__ == "__main__":
    sentences, categories = get_sentences_and_categories()
    color_map, shape_map = get_color_and_shape_maps()

    # Use a reasonable perplexity and enough iterations
    reduced = compute_tsne_embeddings(
        sentences,
        perplexity=30,   # between 5–50
        n_iter=1000      # or more, for convergence
    )

    plot_embeddings(reduced, sentences, categories, color_map, shape_map)
```

With those changes you’ll get much cleaner, well-separated clusters that actually reflect your four topics.

## Adding a New Category

You've gained a solid understanding of generating and reducing sentence embeddings. Now, let's apply that knowledge to troubleshoot and refine the compute_tsne_embeddings function.

In the given code, there is some bug that is causing the embeddings visualization to be pretty bad. Your task is to identify and fix any issues that are causing such a poor visualization.


```python
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sentence_transformers import SentenceTransformer

def get_sentences_and_categories():
    """Return the sentences and their corresponding categories."""
    sentences = [
        # Topic: NLP
        "RAG stands for Retrieval-Augmented Generation.",
        "Retrieval is a crucial aspect of modern NLP systems.",
        "Generating text with correct facts is challenging.",
        "Large language models can generate coherent text.",
        "GPT models have billions of parameters.",
        "Natural Language Processing enables computers to understand human language.",
        "Word embeddings capture semantic relationships between words.",
        "Transformer architectures revolutionized NLP research.",
        
        # Topic: Machine Learning
        "Machine learning benefits from large datasets.",
        "Supervised learning requires labeled data.",
        "Reinforcement learning is inspired by behavioral psychology.",
        "Neural networks can learn complex functions.",
        "Overfitting is a common problem in ML.",
        "Unsupervised learning uncovers hidden patterns in data.",
        "Feature engineering is critical for model performance.",
        "Cross-validation helps in assessing model generalization.",
        
        # Topic: Food
        "Bananas are commonly used in smoothies.",
        "Oranges are rich in vitamin C.",
        "Pizza is a popular Italian dish.",
        "Cooking pasta requires boiling water.",
        "Chocolate can be sweet or bitter.",
        "Fresh salads are a healthy and refreshing meal.",
        "Sushi combines rice, fish, and seaweed in a delicate balance.",
        "Spices can transform simple ingredients into gourmet dishes.",
        
        # Topic: Weather
        "It often rains in the Amazon rainforest.",
        "Summers can be very hot in the desert.",
        "Hurricanes form over warm ocean waters.",
        "Snowstorms can disrupt transportation.",
        "A sunny day can lift people's mood.",
        "Foggy mornings are common in coastal regions.",
        "Winter brings frosty nights and chilly winds.",
        "Thunderstorms can produce lightning and heavy rain."
    ]
    
    categories = (["NLP"] * 8 + ["ML"] * 8 + ["Food"] * 8 + ["Weather"] * 8)
    return sentences, categories

def get_color_and_shape_maps():
    """Return color and marker maps for each category."""
    color_map = {
        "NLP": "red",
        "ML": "blue",
        "Food": "green",
        "Weather": "purple"
    }
    shape_map = {
        "NLP": "o",
        "ML": "s",
        "Food": "^",
        "Weather": "X"
    }
    return color_map, shape_map

def compute_tsne_embeddings(sentences, model_name="sentence-transformers/all-MiniLM-L6-v2",
                            perplexity=10, max_iter=3000, random_state=42):
    """Compute and return t-SNE reduced embeddings for the given sentences."""
    model = SentenceTransformer(model_name)
    embeddings = model.encode(sentences)
    tsne = TSNE(n_components=2, random_state=random_state,
                perplexity=perplexity, max_iter=max_iter)
    return tsne.fit_transform(embeddings)

def plot_embeddings(reduced_embeddings, sentences, categories, color_map, shape_map):
    """Plot the 2D embeddings with labels and a legend."""
    plt.figure(figsize=(10, 8))
    for i, (sentence, category) in enumerate(zip(sentences, categories)):
        x, y = reduced_embeddings[i]
        plt.scatter(x, y, color=color_map[category], marker=shape_map[category])
        plt.text(x - 2.5, y - 1, sentence[:20] + "...", fontsize=9)
    
    for cat, color in color_map.items():
        plt.scatter([], [], color=color, label=cat, marker=shape_map[cat])
    plt.legend(loc="best")

    plt.title("t-SNE Visualization of Sentence Embeddings", fontsize=14)
    plt.xlabel("t-SNE Dimension 1", fontsize=12)
    plt.ylabel("t-SNE Dimension 2", fontsize=12)
    plt.tight_layout()
    plt.savefig('static/images/plot.png')

if __name__ == "__main__":
    sentences, categories = get_sentences_and_categories()
    color_map, shape_map = get_color_and_shape_maps()
    
    perplexity = 0.1
    max_iter = 250
    reduced_embeddings = compute_tsne_embeddings(sentences, perplexity=perplexity, max_iter=max_iter)
    plot_embeddings(reduced_embeddings, sentences, categories, color_map, shape_map)



```

Here are the main issues I see that will wreck t-SNE’s ability to pull apart your four topics:

1. **Invalid/mis-named parameter**  
   You’re passing `max_iter` to `TSNE(...)`, but in scikit-learn the correct argument is `n_iter`. Because `max_iter` is unrecognized, t-SNE is silently falling back to its default of 1 000 iterations under the hood (or possibly erroring, depending on your version).  

2. **Perplexity way too low**  
   You set `perplexity = 0.1`. Good practice is **5 ≤ perplexity ≤ 50** (and ≤ `(n_samples−1)/3`). With 32 points, something like 5–10 is a better starting point.

3. **Too few iterations (when you override properly)**  
   Even if you fix the name to `n_iter`, 250 iterations is usually not enough to converge. I’d bump that to at least 1 000.

4. **Random (“random”) initialization**  
   By default t-SNE starts from a random 2D layout, which can make early clusters look like noise. Using `init='pca'` will give it a sensible starting arrangement.

5. **No embedding normalization or PCA pre-reduction**  
   Since t-SNE optimizes Euclidean distances, it often helps to (a) L2-normalize your sentence embeddings if you really care about *cosine* similarity, and/or (b) do a quick PCA to, say, 50D first (this denoises and speeds up the Barnes-Hut approximation).

---

### Revised `compute_tsne_embeddings`

```python
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA         # ← add this
from sklearn.preprocessing import normalize  # ← and this
from sentence_transformers import SentenceTransformer

def get_sentences_and_categories():
    """Return the sentences and their corresponding categories."""
    sentences = [
        # Topic: NLP
        "RAG stands for Retrieval-Augmented Generation.",
        "Retrieval is a crucial aspect of modern NLP systems.",
        "Generating text with correct facts is challenging.",
        "Large language models can generate coherent text.",
        "GPT models have billions of parameters.",
        "Natural Language Processing enables computers to understand human language.",
        "Word embeddings capture semantic relationships between words.",
        "Transformer architectures revolutionized NLP research.",
        
        # Topic: Machine Learning
        "Machine learning benefits from large datasets.",
        "Supervised learning requires labeled data.",
        "Reinforcement learning is inspired by behavioral psychology.",
        "Neural networks can learn complex functions.",
        "Overfitting is a common problem in ML.",
        "Unsupervised learning uncovers hidden patterns in data.",
        "Feature engineering is critical for model performance.",
        "Cross-validation helps in assessing model generalization.",
        
        # Topic: Food
        "Bananas are commonly used in smoothies.",
        "Oranges are rich in vitamin C.",
        "Pizza is a popular Italian dish.",
        "Cooking pasta requires boiling water.",
        "Chocolate can be sweet or bitter.",
        "Fresh salads are a healthy and refreshing meal.",
        "Sushi combines rice, fish, and seaweed in a delicate balance.",
        "Spices can transform simple ingredients into gourmet dishes.",
        
        # Topic: Weather
        "It often rains in the Amazon rainforest.",
        "Summers can be very hot in the desert.",
        "Hurricanes form over warm ocean waters.",
        "Snowstorms can disrupt transportation.",
        "A sunny day can lift people's mood.",
        "Foggy mornings are common in coastal regions.",
        "Winter brings frosty nights and chilly winds.",
        "Thunderstorms can produce lightning and heavy rain."
    ]
    
    categories = (["NLP"] * 8 + ["ML"] * 8 + ["Food"] * 8 + ["Weather"] * 8)
    return sentences, categories

def get_color_and_shape_maps():
    """Return color and marker maps for each category."""
    color_map = {
        "NLP": "red",
        "ML": "blue",
        "Food": "green",
        "Weather": "purple"
    }
    shape_map = {
        "NLP": "o",
        "ML": "s",
        "Food": "^",
        "Weather": "X"
    }
    return color_map, shape_map

def compute_tsne_embeddings(
    sentences,
    model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
    perplexity: float = 10.0,
    n_iter: int = 1000,
    random_state: int = 42
):
    model = SentenceTransformer(model_name)
    embeddings = model.encode(sentences)
    
    # Normalize so Euclidean ≈ Cosine
    embeddings = normalize(embeddings, axis=1)

    # Cap PCA components by samples & features
    n_samples, n_features = embeddings.shape
    pca_dim = min(50, n_samples, n_features)
    pca = PCA(n_components=pca_dim, random_state=random_state)
    reduced = pca.fit_transform(embeddings)

    tsne = TSNE(
        n_components=2,
        perplexity=perplexity,
        n_iter=n_iter,
        init='pca',
        random_state=random_state,
        metric='euclidean'
    )
    return tsne.fit_transform(reduced)

def plot_embeddings(reduced_embeddings, sentences, categories, color_map, shape_map):
    """Plot the 2D embeddings with labels and a legend."""
    plt.figure(figsize=(10, 8))
    for i, (sentence, category) in enumerate(zip(sentences, categories)):
        x, y = reduced_embeddings[i]
        plt.scatter(x, y, color=color_map[category], marker=shape_map[category])
        plt.text(x - 2.5, y - 1, sentence[:20] + "...", fontsize=9)
    
    for cat, color in color_map.items():
        plt.scatter([], [], color=color, label=cat, marker=shape_map[cat])
    plt.legend(loc="best")

    plt.title("t-SNE Visualization of Sentence Embeddings", fontsize=14)
    plt.xlabel("t-SNE Dimension 1", fontsize=12)
    plt.ylabel("t-SNE Dimension 2", fontsize=12)
    plt.tight_layout()
    plt.savefig('static/images/plot.png')

if __name__ == "__main__":
    sentences, categories = get_sentences_and_categories()
    color_map, shape_map = get_color_and_shape_maps()

    # Pick a reasonable perplexity (e.g. 5–15 for ~30 points) and enough iterations:
    reduced_embeddings = compute_tsne_embeddings(
        sentences,
        perplexity=10.0,
        n_iter=1000
    )

    plot_embeddings(reduced_embeddings, sentences, categories, color_map, shape_map)
    plt.show()
```

---

#### Why this helps

- **`n_iter=1000`** gives t-SNE enough time to settle; you can even push to 2 000–3 000 if you like.  
- **`perplexity=10`** balances local vs. global structure nicely with 32 samples.  
- **`init='pca'`** means you don’t start from pure noise.  
- **PCA → t-SNE** often yields cleaner clusters and runs faster.  
- **Normalization** lets the Euclidean metric approximate cosine similarity, which is what your sentence-transformers embeddings are really optimized for.

With these changes you should see four well-separated “islands” in your plot, one per category, instead of a noisy blob.

Great job visualizing sentence embeddings with t-SNE!

Now, your task is to add a new category of your choice to the dataset. You'll need to:

Add some sentences related to a topic of your choice
Update the get_sentences_and_categories function to include your new category
Adjust the color and shape maps to give your category a distinct appearance
Run the code to see how your new sentences cluster with the existing ones
This is your chance to experiment! Feel free to try different categories or even multiple categories to observe how semantically similar content tends to cluster together. Pay attention to cases where your new categories might overlap with existing ones - this can reveal interesting semantic relationships.


```python
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sentence_transformers import SentenceTransformer

def get_sentences_and_categories():
    """Return the sentences and their corresponding categories."""
    sentences = [
        # Topic: NLP
        "RAG stands for Retrieval-Augmented Generation.",
        "Retrieval is a crucial aspect of modern NLP systems.",
        "Generating text with correct facts is challenging.",
        "Large language models can generate coherent text.",
        "GPT models have billions of parameters.",
        "Natural Language Processing enables computers to understand human language.",
        "Word embeddings capture semantic relationships between words.",
        "Transformer architectures revolutionized NLP research.",
        
        # Topic: Machine Learning
        "Machine learning benefits from large datasets.",
        "Supervised learning requires labeled data.",
        "Reinforcement learning is inspired by behavioral psychology.",
        "Neural networks can learn complex functions.",
        "Overfitting is a common problem in ML.",
        "Unsupervised learning uncovers hidden patterns in data.",
        "Feature engineering is critical for model performance.",
        "Cross-validation helps in assessing model generalization.",
        
        # Topic: Food
        "Bananas are commonly used in smoothies.",
        "Oranges are rich in vitamin C.",
        "Pizza is a popular Italian dish.",
        "Cooking pasta requires boiling water.",
        "Chocolate can be sweet or bitter.",
        "Fresh salads are a healthy and refreshing meal.",
        "Sushi combines rice, fish, and seaweed in a delicate balance.",
        "Spices can transform simple ingredients into gourmet dishes.",
        
        # Topic: Weather
        "It often rains in the Amazon rainforest.",
        "Summers can be very hot in the desert.",
        "Hurricanes form over warm ocean waters.",
        "Snowstorms can disrupt transportation.",
        "A sunny day can lift people's mood.",
        "Foggy mornings are common in coastal regions.",
        "Winter brings frosty nights and chilly winds.",
        "Thunderstorms can produce lightning and heavy rain."
        
        # TODO: Add your new category sentences here
    ]
    
    categories = (["NLP"] * 8 + ["ML"] * 8 + ["Food"] * 8 + ["Weather"] * 8)
    # TODO: Update categories list with your new category
    return sentences, categories

def get_color_and_shape_maps():
    """Return color and marker maps for each category."""
    color_map = {
        "NLP": "red",
        "ML": "blue",
        "Food": "green",
        "Weather": "purple"
        # TODO: Add your new category color here
    }
    shape_map = {
        "NLP": "o",
        "ML": "s",
        "Food": "^",
        "Weather": "X"
        # TODO: Add your new category shape here
    }
    return color_map, shape_map

def compute_tsne_embeddings(sentences, model_name="sentence-transformers/all-MiniLM-L6-v2",
                              perplexity=10, max_iter=3000, random_state=42):
    """Compute and return t-SNE reduced embeddings for the given sentences."""
    model = SentenceTransformer(model_name)
    embeddings = model.encode(sentences)
    tsne = TSNE(n_components=2, random_state=random_state,
                perplexity=perplexity, max_iter=max_iter)
    return tsne.fit_transform(embeddings)

def plot_embeddings(reduced_embeddings, sentences, categories, color_map, shape_map):
    """Plot the 2D embeddings with labels and a legend."""
    plt.figure(figsize=(10, 8))
    for i, (sentence, category) in enumerate(zip(sentences, categories)):
        x, y = reduced_embeddings[i]
        plt.scatter(x, y, color=color_map[category], marker=shape_map[category])
        plt.text(x - 2.5, y - 1.0, sentence[:20] + "...", fontsize=9)
    
    for cat, color in color_map.items():
        plt.scatter([], [], color=color, label=cat, marker=shape_map[cat])
    plt.legend(loc="best")

    plt.title("t-SNE Visualization of Sentence Embeddings", fontsize=14)
    plt.xlabel("t-SNE Dimension 1", fontsize=12)
    plt.ylabel("t-SNE Dimension 2", fontsize=12)
    plt.tight_layout()
    plt.savefig('static/images/plot.png')

if __name__ == "__main__":
    sentences, categories = get_sentences_and_categories()
    color_map, shape_map = get_color_and_shape_maps()
    reduced_embeddings = compute_tsne_embeddings(sentences)
    plot_embeddings(reduced_embeddings, sentences, categories, color_map, shape_map)



```

Here’s an example adding a **“Sports”** category—you can, of course, swap in any topic you like. I’ve added eight sports-related sentences, updated the category list, and chosen **orange** diamonds (“D”) so they stand out:

```python
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sentence_transformers import SentenceTransformer

def get_sentences_and_categories():
    """Return the sentences and their corresponding categories."""
    sentences = [
        # Topic: NLP
        "RAG stands for Retrieval-Augmented Generation.",
        "Retrieval is a crucial aspect of modern NLP systems.",
        "Generating text with correct facts is challenging.",
        "Large language models can generate coherent text.",
        "GPT models have billions of parameters.",
        "Natural Language Processing enables computers to understand human language.",
        "Word embeddings capture semantic relationships between words.",
        "Transformer architectures revolutionized NLP research.",
        
        # Topic: Machine Learning
        "Machine learning benefits from large datasets.",
        "Supervised learning requires labeled data.",
        "Reinforcement learning is inspired by behavioral psychology.",
        "Neural networks can learn complex functions.",
        "Overfitting is a common problem in ML.",
        "Unsupervised learning uncovers hidden patterns in data.",
        "Feature engineering is critical for model performance.",
        "Cross-validation helps in assessing model generalization.",
        
        # Topic: Food
        "Bananas are commonly used in smoothies.",
        "Oranges are rich in vitamin C.",
        "Pizza is a popular Italian dish.",
        "Cooking pasta requires boiling water.",
        "Chocolate can be sweet or bitter.",
        "Fresh salads are a healthy and refreshing meal.",
        "Sushi combines rice, fish, and seaweed in a delicate balance.",
        "Spices can transform simple ingredients into gourmet dishes.",
        
        # Topic: Weather
        "It often rains in the Amazon rainforest.",
        "Summers can be very hot in the desert.",
        "Hurricanes form over warm ocean waters.",
        "Snowstorms can disrupt transportation.",
        "A sunny day can lift people's mood.",
        "Foggy mornings are common in coastal regions.",
        "Winter brings frosty nights and chilly winds.",
        "Thunderstorms can produce lightning and heavy rain.",
        
        # Topic: Sports
        "Soccer matches attract millions of fans worldwide.",
        "Basketball players require agility and height.",
        "Tennis matches can last over five hours.",
        "Marathons test endurance more than speed.",
        "Olympic athletes train years for their events.",
        "Swimmers compete in lanes within a pool.",
        "Cricket matches can span multiple days.",
        "Formula 1 cars can exceed 200 mph."
    ]
    
    categories = (
        ["NLP"] * 8 +
        ["ML"] * 8 +
        ["Food"] * 8 +
        ["Weather"] * 8 +
        ["Sports"] * 8
    )
    return sentences, categories

def get_color_and_shape_maps():
    """Return color and marker maps for each category."""
    color_map = {
        "NLP": "red",
        "ML": "blue",
        "Food": "green",
        "Weather": "purple",
        "Sports": "orange"      # new category in orange
    }
    shape_map = {
        "NLP": "o",
        "ML": "s",
        "Food": "^",
        "Weather": "X",
        "Sports": "D"          # diamond marker
    }
    return color_map, shape_map

def compute_tsne_embeddings(sentences, model_name="sentence-transformers/all-MiniLM-L6-v2",
                            perplexity=10, max_iter=3000, random_state=42):
    """Compute and return t-SNE reduced embeddings for the given sentences."""
    model = SentenceTransformer(model_name)
    embeddings = model.encode(sentences)
    tsne = TSNE(n_components=2, random_state=random_state,
                perplexity=perplexity, n_iter=max_iter, init='pca')
    return tsne.fit_transform(embeddings)

def plot_embeddings(reduced_embeddings, sentences, categories, color_map, shape_map):
    """Plot the 2D embeddings with labels and a legend."""
    plt.figure(figsize=(10, 8))
    for i, (sentence, category) in enumerate(zip(sentences, categories)):
        x, y = reduced_embeddings[i]
        plt.scatter(x, y, color=color_map[category], marker=shape_map[category])
        plt.text(x - 2.5, y - 1.0, sentence[:20] + "...", fontsize=9)
    
    # legend handles
    for cat in color_map:
        plt.scatter([], [], color=color_map[cat], marker=shape_map[cat], label=cat)
    plt.legend(loc="best")

    plt.title("t-SNE Visualization of Sentence Embeddings", fontsize=14)
    plt.xlabel("t-SNE Dimension 1", fontsize=12)
    plt.ylabel("t-SNE Dimension 2", fontsize=12)
    plt.tight_layout()
    plt.savefig('static/images/plot.png')
    plt.show()

if __name__ == "__main__":
    sentences, categories = get_sentences_and_categories()
    color_map, shape_map = get_color_and_shape_maps()
    reduced_embeddings = compute_tsne_embeddings(sentences)
    plot_embeddings(reduced_embeddings, sentences, categories, color_map, shape_map)
```

**Next steps:**
1. Save and run this script.
2. Observe where the orange “Sports” points land relative to NLP, ML, Food, and Weather.
3. If you spot overlap (e.g. “marathons test endurance” might sit near “reinforcement learning” because of the shared notion of “training”), that’s a neat semantic signal! 

Feel free to swap in your own category—say **“Travel”**, **“Movies”**, **“Health”**, etc.—by following the same pattern.