# Lesson 2: Generating and Comparing Sentence Embeddings


Welcome back! This is the second lesson in our **Text Representation Techniques for RAG Systems** series. In our previous lesson, we introduced the Bag-of-Words (BOW) approach to converting text into numerical representations. Although BOW is intuitive and lays a solid foundation, it does not capture word order or deeper context.

Picture a helpdesk system that retrieves support tickets. Without a solid way to represent text contextually, customers searching for “account locked” might miss relevant entries labeled “login blocked” because the system can’t recognize these phrases as related. This gap in understanding could lead to frustrated users and unresolved queries.

Today, we’ll take a big step forward by learning to generate more expressive **sentence embeddings** — vectors that represent the semantic meaning of entire sentences. By the end of this lesson, you will know how to produce these embeddings and compare them with each other using **cosine similarity**.

---

## Understanding Sentence Embeddings

Imagine you have sentences scattered across a high-dimensional space, where each sentence is a point, and closeness in this space reflects semantic similarity. Unlike BOW — which only counts word occurrences — sentence embeddings capture the relationship between words, making semantically similar sentences land near each other. This powerful feature is vital for Retrieval-Augmented Generation (RAG) systems, where retrieving text that is closest in meaning to a query drives more accurate responses.

> **Example:**  
> A BOW model might treat  
> _“I enjoy apples”_ and _“He likes oranges”_ as quite different,  
> but embeddings can capture that both sentences express a **personal preference for fruit**.

Sentence embeddings are especially helpful in complex applications such as:
- Semantic search  
- Recommendation engines  
- Advanced conversational systems  

---

## Understanding the Cosine Similarity Function

To measure how similar two vectors are, we use **cosine similarity**, which looks at the angle between them:

- **1**: Vectors point in exactly the same direction (maximally similar)  
- **0**: Vectors are orthogonal (no shared direction)  
- **–1**: Vectors point in completely opposite directions  

Mathematically:

\[
\text{cosine\_similarity}(A, B) \;=\; \frac{A \cdot B}{\|A\| \,\|B\|}
\]

- \(A \cdot B\) is the dot product of \(A\) and \(B\)  
- \(\|A\|\) and \(\|B\|\) are the magnitudes (norms) of \(A\) and \(B\)  

```python
import numpy as np
from numpy.linalg import norm

def cosine_similarity(vec_a, vec_b):
    """
    Compute cosine similarity between two vectors.
    Range: -1 (opposite) to 1 (same direction).
    """
    return np.dot(vec_a, vec_b) / (norm(vec_a) * norm(vec_b))
```

> Cosine similarity is insensitive to overall vector magnitude, making it ideal for comparing sentence embeddings (often normalized) in tasks like semantic search and document retrieval.

---

## Loading the Sentence Transformers Library

We’ll use the **Sentence Transformers** library to load pre-trained models that produce high-quality, semantically meaningful embeddings. These models are built on Transformer architectures such as BERT or RoBERTa.

```python
from sentence_transformers import SentenceTransformer

# Initialize a pre-trained embedding model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
```

> **Note:**  
> The `all-MiniLM-L6-v2` model is a compact variant of Microsoft’s MiniLM. It balances size and performance, making it ideal for real-time or large-scale applications.

---

## Encoding Sentences into Embeddings

Let’s define some sentences and encode them into numerical vectors:

```python
sentences = [
    "RAG stands for Retrieval Augmented Generation.",
    "A Large Language Model is a Generative AI model for text generation.",
    "RAG enhance text generation of LLMs by incorporating external data",
    "Bananas are yellow fruits.",
    "Apples are good for your health.",
    "What's monkey's favorite food?"
]

embeddings = model.encode(sentences)
print(embeddings.shape)  # e.g., (6, 384)
print(embeddings[0])     # Sample embedding for the first sentence
```

The output shape `(6, 384)` indicates 6 sentence embeddings, each of length 384. Unlike BOW, where dimensions equal vocabulary size, these embeddings capture deep semantic relationships.

---

## Comparing Sentence Embeddings with Cosine Similarity

Now we can compare sentences by computing pairwise cosine similarities:

```python
for i, sent_i in enumerate(sentences):
    for j, sent_j in enumerate(sentences[i+1:], start=i+1):
        sim_score = cosine_similarity(embeddings[i], embeddings[j])
        print(f"Similarity('{sent_i}' , '{sent_j}') = {sim_score:.4f}")
```

**Sample Output:**
```
Similarity('A Large Language Model is a Generative AI model for text generation.' , 'RAG enhance text generation of LLMs by incorporating external data') = 0.4983  
Similarity('Bananas are yellow fruits.' , 'What's monkey's favorite food?') = 0.4778  
Similarity('RAG stands for Retrieval Augmented Generation.' , 'RAG enhance text generation of LLMs by incorporating external data') = 0.4630  
Similarity('Bananas are yellow fruits.' , 'Apples are good for your health.') = 0.3568  
...
Similarity('A Large Language Model is a Generative AI model for text generation.' , 'Bananas are yellow fruits.') = 0.0042  
Similarity('RAG enhance text generation of LLMs by incorporating external data' , 'Apples are good for your health.') = 0.0025
```

Notice how semantically related sentences (e.g., about RAG or fruit preferences) yield higher scores, even when they share no overlapping words.

---

## Conclusion and Next Steps

By moving beyond Bag-of-Words, sentence embeddings capture richer semantic relationships between words and phrases. This capability is central to Retrieval-Augmented Generation systems, enabling more precise and flexible retrieval of relevant information.

**Next Up:**  
In the practice section, you’ll set up embedding models, generate your own sentence embeddings, and explore how they capture meaning. Happy coding!  


## Creating Sentence Embeddings

You've learned how to generate sentence embeddings and compare them using cosine similarity. Now, let's put that knowledge into practice!

Your task is to:

Create a list of three custom sentences.
Use the pre-trained SentenceTransformer model to encode them.
Print the shape of the resulting embeddings array.
Display the numerical vector for the first sentence.
This exercise will help you see how embeddings capture semantic meaning. Dive in and explore the power of sentence embeddings!

```python
from sentence_transformers import SentenceTransformer

if __name__ == "__main__":
    # Load a pre-trained embedding model from Sentence Transformers.
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

    # TODO: Define your own list of three sentences

    # TODO: Encode each sentence into its embedding vector
    
    print(embeddings.shape)  # Should print (N, 384)
    print(embeddings[0])     # Print the embedding vector for the first sentence
```

Here’s a complete script you can run to encode three custom sentences and inspect their embeddings:

```python
from sentence_transformers import SentenceTransformer

if __name__ == "__main__":
    # Load a pre-trained embedding model
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

    # 1. Define your own list of three sentences
    sentences = [
        "Artificial Intelligence is transforming the way we live and work.",
        "A gentle breeze rustles the leaves in the autumn trees.",
        "Learning to code opens up a world of problem-solving possibilities."
    ]

    # 2. Encode each sentence into its embedding vector
    embeddings = model.encode(sentences)

    # 3. Print the shape of the resulting embeddings array
    print("Embeddings shape:", embeddings.shape)  # e.g., (3, 384)

    # 4. Display the numerical vector for the first sentence
    print("First sentence embedding:\n", embeddings[0])
```

**What this does:**
1. Initializes the **all-MiniLM-L6-v2** model from Sentence Transformers.  
2. Encodes your three custom sentences into a NumPy array of shape `(3, 384)` (or whatever dimension the model uses).  
3. Prints out that shape so you can confirm you have one embedding per sentence.  
4. Prints the full 384-dimensional vector for the first sentence so you can inspect the actual numbers.


## Comparing Sentence Embeddings

## Finding the Most Similar Sentences

## Exploring Sentence Similarity Changes

## Ranking Sentences by Similarity