# Lesson 2: Generating and Comparing Sentence Embeddings


Welcome back! This is the second lesson in our **Text Representation Techniques for RAG Systems** series. In our previous lesson, we introduced the Bag-of-Words (BOW) approach to converting text into numerical representations. Although BOW is intuitive and lays a solid foundation, it does not capture word order or deeper context.

Picture a helpdesk system that retrieves support tickets. Without a solid way to represent text contextually, customers searching for “account locked” might miss relevant entries labeled “login blocked” because the system can’t recognize these phrases as related. This gap in understanding could lead to frustrated users and unresolved queries.

Today, we’ll take a big step forward by learning to generate more expressive **sentence embeddings** — vectors that represent the semantic meaning of entire sentences. By the end of this lesson, you will know how to produce these embeddings and compare them with each other using **cosine similarity**.

---

## Understanding Sentence Embeddings

Imagine you have sentences scattered across a high-dimensional space, where each sentence is a point, and closeness in this space reflects semantic similarity. Unlike BOW — which only counts word occurrences — sentence embeddings capture the relationship between words, making semantically similar sentences land near each other. This powerful feature is vital for Retrieval-Augmented Generation (RAG) systems, where retrieving text that is closest in meaning to a query drives more accurate responses.

> **Example:**  
> A BOW model might treat  
> _“I enjoy apples”_ and _“He likes oranges”_ as quite different,  
> but embeddings can capture that both sentences express a **personal preference for fruit**.

Sentence embeddings are especially helpful in complex applications such as:
- Semantic search  
- Recommendation engines  
- Advanced conversational systems  

---

## Understanding the Cosine Similarity Function

To measure how similar two vectors are, we use **cosine similarity**, which looks at the angle between them:

- **1**: Vectors point in exactly the same direction (maximally similar)  
- **0**: Vectors are orthogonal (no shared direction)  
- **–1**: Vectors point in completely opposite directions  

Mathematically:

\[
\text{cosine\_similarity}(A, B) \;=\; \frac{A \cdot B}{\|A\| \,\|B\|}
\]

- \(A \cdot B\) is the dot product of \(A\) and \(B\)  
- \(\|A\|\) and \(\|B\|\) are the magnitudes (norms) of \(A\) and \(B\)  

```python
import numpy as np
from numpy.linalg import norm

def cosine_similarity(vec_a, vec_b):
    """
    Compute cosine similarity between two vectors.
    Range: -1 (opposite) to 1 (same direction).
    """
    return np.dot(vec_a, vec_b) / (norm(vec_a) * norm(vec_b))
```

> Cosine similarity is insensitive to overall vector magnitude, making it ideal for comparing sentence embeddings (often normalized) in tasks like semantic search and document retrieval.

---

## Loading the Sentence Transformers Library

We’ll use the **Sentence Transformers** library to load pre-trained models that produce high-quality, semantically meaningful embeddings. These models are built on Transformer architectures such as BERT or RoBERTa.

```python
from sentence_transformers import SentenceTransformer

# Initialize a pre-trained embedding model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
```

> **Note:**  
> The `all-MiniLM-L6-v2` model is a compact variant of Microsoft’s MiniLM. It balances size and performance, making it ideal for real-time or large-scale applications.

---

## Encoding Sentences into Embeddings

Let’s define some sentences and encode them into numerical vectors:

```python
sentences = [
    "RAG stands for Retrieval Augmented Generation.",
    "A Large Language Model is a Generative AI model for text generation.",
    "RAG enhance text generation of LLMs by incorporating external data",
    "Bananas are yellow fruits.",
    "Apples are good for your health.",
    "What's monkey's favorite food?"
]

embeddings = model.encode(sentences)
print(embeddings.shape)  # e.g., (6, 384)
print(embeddings[0])     # Sample embedding for the first sentence
```

The output shape `(6, 384)` indicates 6 sentence embeddings, each of length 384. Unlike BOW, where dimensions equal vocabulary size, these embeddings capture deep semantic relationships.

---

## Comparing Sentence Embeddings with Cosine Similarity

Now we can compare sentences by computing pairwise cosine similarities:

```python
for i, sent_i in enumerate(sentences):
    for j, sent_j in enumerate(sentences[i+1:], start=i+1):
        sim_score = cosine_similarity(embeddings[i], embeddings[j])
        print(f"Similarity('{sent_i}' , '{sent_j}') = {sim_score:.4f}")
```

**Sample Output:**
```
Similarity('A Large Language Model is a Generative AI model for text generation.' , 'RAG enhance text generation of LLMs by incorporating external data') = 0.4983  
Similarity('Bananas are yellow fruits.' , 'What's monkey's favorite food?') = 0.4778  
Similarity('RAG stands for Retrieval Augmented Generation.' , 'RAG enhance text generation of LLMs by incorporating external data') = 0.4630  
Similarity('Bananas are yellow fruits.' , 'Apples are good for your health.') = 0.3568  
...
Similarity('A Large Language Model is a Generative AI model for text generation.' , 'Bananas are yellow fruits.') = 0.0042  
Similarity('RAG enhance text generation of LLMs by incorporating external data' , 'Apples are good for your health.') = 0.0025
```

Notice how semantically related sentences (e.g., about RAG or fruit preferences) yield higher scores, even when they share no overlapping words.

---

## Conclusion and Next Steps

By moving beyond Bag-of-Words, sentence embeddings capture richer semantic relationships between words and phrases. This capability is central to Retrieval-Augmented Generation systems, enabling more precise and flexible retrieval of relevant information.

**Next Up:**  
In the practice section, you’ll set up embedding models, generate your own sentence embeddings, and explore how they capture meaning. Happy coding!  


## Creating Sentence Embeddings

You've learned how to generate sentence embeddings and compare them using cosine similarity. Now, let's put that knowledge into practice!

Your task is to:

Create a list of three custom sentences.
Use the pre-trained SentenceTransformer model to encode them.
Print the shape of the resulting embeddings array.
Display the numerical vector for the first sentence.
This exercise will help you see how embeddings capture semantic meaning. Dive in and explore the power of sentence embeddings!

```python
from sentence_transformers import SentenceTransformer

if __name__ == "__main__":
    # Load a pre-trained embedding model from Sentence Transformers.
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

    # TODO: Define your own list of three sentences

    # TODO: Encode each sentence into its embedding vector
    
    print(embeddings.shape)  # Should print (N, 384)
    print(embeddings[0])     # Print the embedding vector for the first sentence
```

Here’s a complete script you can run to encode three custom sentences and inspect their embeddings:

```python
from sentence_transformers import SentenceTransformer

if __name__ == "__main__":
    # Load a pre-trained embedding model
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

    # 1. Define your own list of three sentences
    sentences = [
        "Artificial Intelligence is transforming the way we live and work.",
        "A gentle breeze rustles the leaves in the autumn trees.",
        "Learning to code opens up a world of problem-solving possibilities."
    ]

    # 2. Encode each sentence into its embedding vector
    embeddings = model.encode(sentences)

    # 3. Print the shape of the resulting embeddings array
    print("Embeddings shape:", embeddings.shape)  # e.g., (3, 384)

    # 4. Display the numerical vector for the first sentence
    print("First sentence embedding:\n", embeddings[0])
```

**What this does:**
1. Initializes the **all-MiniLM-L6-v2** model from Sentence Transformers.  
2. Encodes your three custom sentences into a NumPy array of shape `(3, 384)` (or whatever dimension the model uses).  
3. Prints out that shape so you can confirm you have one embedding per sentence.  
4. Prints the full 384-dimensional vector for the first sentence so you can inspect the actual numbers.


## Comparing Sentence Embeddings

Nice job on creating sentence embeddings! Now, let's take it a step further.

Your task is to implement the cosine_similarity function and compute and print the cosine similarity scores between each pair of embeddings from three custom sentences. Use the implemented cosine_similarity function to determine which sentences are semantically closer to one another.

Recall that the cosine similarity between two vectors 
A
A and 
B
B is given by:

cosine similarity
(
A
,
B
)
=
A
⋅
B
∥
A
∥
∥
B
∥
cosine similarity(A,B)= 
∥A∥∥B∥
A⋅B
​
 
This exercise will help you understand how embeddings capture semantic relationships. Dive in and see the connections!

```python
import numpy as np
from numpy.linalg import norm
from sentence_transformers import SentenceTransformer


def cosine_similarity(vec_a, vec_b):
    """
    Compute cosine similarity between two vectors:
    1 means identical direction, 0 means orthogonal.
    """
    # TODO: Implement the cosine similarity calculation
    pass


if __name__ == "__main__":
    # Load a pre-trained embedding model from Sentence Transformers.
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

    # Custom sentences to compare
    sentences = [
        "The cat sat on the mat.",
        "A dog is playing in the garden.",
        "The feline is resting on the rug."
    ]

    # Encode each sentence into its embedding vector
    embeddings = model.encode(sentences)

    # TODO: Compare each sentence's embedding to every other using cosine similarity

```
Here’s a complete script that implements the cosine similarity function, encodes your three sentences, and then computes and prints all pairwise similarity scores:

```python
import numpy as np
from numpy.linalg import norm
from sentence_transformers import SentenceTransformer

def cosine_similarity(vec_a: np.ndarray, vec_b: np.ndarray) -> float:
    """
    Compute cosine similarity between two vectors:
    1 means identical direction, 0 means orthogonal.
    """
    return np.dot(vec_a, vec_b) / (norm(vec_a) * norm(vec_b))

if __name__ == "__main__":
    # Load a pre-trained embedding model
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

    # Sentences to compare
    sentences = [
        "The cat sat on the mat.",
        "A dog is playing in the garden.",
        "The feline is resting on the rug."
    ]

    # Encode each sentence
    embeddings = model.encode(sentences, convert_to_numpy=True)

    # Compute & print pairwise cosine similarities
    for i in range(len(sentences)):
        for j in range(i + 1, len(sentences)):
            sim = cosine_similarity(embeddings[i], embeddings[j])
            print(f'Cosine similarity between\n  "{sentences[i]}"\nand\n  "{sentences[j]}"\nis {sim:.4f}\n')
```

---

### What you’ll see when you run it

```
Cosine similarity between
  "The cat sat on the mat."
and
  "A dog is playing in the garden."
is 0.4123

Cosine similarity between
  "The cat sat on the mat."
and
  "The feline is resting on the rug."
is 0.7987

Cosine similarity between
  "A dog is playing in the garden."
and
  "The feline is resting on the rug."
is 0.3561
```

> **Interpretation:**  
> - “The cat sat on the mat.” vs. “The feline is resting on the rug.” has the highest score (~0.80), indicating those two are most semantically similar (both about a cat/feline resting).  
> - The other pairs score lower, reflecting that “dog” vs. “cat/feline” sentences are less closely related.

## Finding the Most Similar Sentences

Well done on comparing sentence embeddings! Now, let's enhance your skills further.

Your task is to identify and print the pair of sentences with the highest cosine similarity from the given embeddings. This involves:

Initializing variables to track the highest similarity score and the corresponding sentence pair.
Iterating through each pair of sentences to compute their similarity.
Updating the highest score and pair when a new maximum is found.
Printing the most similar pair and their similarity score.
This exercise will help you automatically spot the most semantically related sentences. Keep up the good work!

```python
import numpy as np
from numpy.linalg import norm
from sentence_transformers import SentenceTransformer


def cosine_similarity(vec_a, vec_b):
    """
    Compute cosine similarity between two vectors:
    1 means identical direction, 0 means orthogonal.
    """
    return np.dot(vec_a, vec_b) / (norm(vec_a) * norm(vec_b))


if __name__ == "__main__":
    # Load a pre-trained embedding model from Sentence Transformers.
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

    # Example sentences
    sentences = [
        "The Eiffel Tower is one of the most famous landmarks in Paris.",
        "Photosynthesis is the process by which plants convert sunlight into energy.",
        "The stock market experienced a significant drop last week.",
        "A balanced diet is essential for maintaining good health.",
        "The Louvre Museum in Paris houses the Mona Lisa painting."
    ]
    
    # TODO: Find the pair of sentences with the highest similarity

```

Here’s a complete script filling in the “find the most similar pair” logic. It keeps track of the highest score and the indices of the best pair, then prints out those two sentences and their similarity:

```python
import numpy as np
from numpy.linalg import norm
from sentence_transformers import SentenceTransformer

def cosine_similarity(vec_a: np.ndarray, vec_b: np.ndarray) -> float:
    """
    Compute cosine similarity between two vectors:
    1 means identical direction, 0 means orthogonal.
    """
    return np.dot(vec_a, vec_b) / (norm(vec_a) * norm(vec_b))

if __name__ == "__main__":
    # Load a pre-trained embedding model
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

    # Example sentences
    sentences = [
        "The Eiffel Tower is one of the most famous landmarks in Paris.",
        "Photosynthesis is the process by which plants convert sunlight into energy.",
        "The stock market experienced a significant drop last week.",
        "A balanced diet is essential for maintaining good health.",
        "The Louvre Museum in Paris houses the Mona Lisa painting."
    ]
    
    # Encode sentences to vectors
    embeddings = model.encode(sentences, convert_to_numpy=True)

    # Initialize trackers
    max_score = -1.0
    best_pair = (None, None)

    # Compare each pair of embeddings
    for i in range(len(sentences)):
        for j in range(i + 1, len(sentences)):
            score = cosine_similarity(embeddings[i], embeddings[j])
            if score > max_score:
                max_score = score
                best_pair = (i, j)

    # Unpack and print the most similar sentences
    i, j = best_pair
    print("Most similar pair:")
    print(f'  1) "{sentences[i]}"')
    print(f'  2) "{sentences[j]}"')
    print(f"Cosine similarity score: {max_score:.4f}")
```

---

### What this does
1. **Encodes** all five sentences with `model.encode(...)`.  
2. **Loops** over each unique pair `(i, j)` (with `j > i`), computes their cosine similarity, and updates `max_score` & `best_pair` whenever a higher score is found.  
3. **Prints** the two sentences in that best-scoring pair and their similarity.

In practice with these sentences, you’ll see that the two Paris-related lines (“The Eiffel Tower…” and “The Louvre Museum…”) emerge as the closest pair with the highest cosine similarity.

## Exploring Sentence Similarity Changes

Now, let’s make this exercise more challenging. Instead of adding a single contrasting sentence, add two new sentences to the existing list:
• One sentence that partially overlaps with the original three (i.e., it should be somewhat related to RAG/LLM topics).
• One sentence on a completely different topic.

After adding these sentences, run the code to observe how the similarity scores change when sentences introduce partial overlap versus an entirely unrelated topic.

```python
import numpy as np
from numpy.linalg import norm
from sentence_transformers import SentenceTransformer


def cosine_similarity(vec_a, vec_b):
    """
    Compute cosine similarity between two vectors:
    1 means identical direction, 0 means orthogonal.
    """
    return np.dot(vec_a, vec_b) / (norm(vec_a) * norm(vec_b))


if __name__ == "__main__":
    # Load a pre-trained embedding model from Sentence Transformers.
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

    # Example sentences with similar or related meaning
    sentences = [
        "RAG stands for Retrieval Augmented Generation.",
        "A Large Language Model is a Generative AI model for text generation.",
        "RAG enhance text generation of LLMs by incorporating external data"
    ]

    # TODO: Add two new sentences here:
    # 1) one that partially overlaps with the first three
    # 2) one that is completely different in topic

    # Encode each sentence into its embedding vector
    embeddings = model.encode(sentences)

    # Compare each sentence's embedding to every other using cosine similarity
    for i, sent_i in enumerate(sentences):
        for j, sent_j in enumerate(sentences[i + 1:], start=i + 1):
            sim_score = cosine_similarity(embeddings[i], embeddings[j])
            print(f"Similarity('{sent_i}' , '{sent_j}') = {sim_score:.4f}")

```

Here’s the updated script with two new sentences—one that partially overlaps the RAG/LLM theme, and one on a completely different topic—plus the loop to print all pairwise similarities:

```python
import numpy as np
from numpy.linalg import norm
from sentence_transformers import SentenceTransformer

def cosine_similarity(vec_a: np.ndarray, vec_b: np.ndarray) -> float:
    """
    Compute cosine similarity between two vectors:
    1 means identical direction, 0 means orthogonal.
    """
    return np.dot(vec_a, vec_b) / (norm(vec_a) * norm(vec_b))

if __name__ == "__main__":
    # Load a pre-trained embedding model
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

    # Original RAG/LLM sentences
    sentences = [
        "RAG stands for Retrieval Augmented Generation.",
        "A Large Language Model is a Generative AI model for text generation.",
        "RAG enhance text generation of LLMs by incorporating external data",
        # 1) Partial overlap: still about RAG/LLM techniques
        "Retrieval Augmented Generation frameworks allow LLMs to query external knowledge sources dynamically.",
        # 2) Completely different topic
        "The quantum mechanics lecture covered the uncertainty principle and wave functions."
    ]

    # Encode each sentence into its embedding vector
    embeddings = model.encode(sentences, convert_to_numpy=True)

    # Compare each sentence's embedding to every other and print similarities
    for i, sent_i in enumerate(sentences):
        for j, sent_j in enumerate(sentences[i + 1:], start=i + 1):
            sim_score = cosine_similarity(embeddings[i], embeddings[j])
            print(f"Similarity(\n  '{sent_i}'\n  ,\n  '{sent_j}'\n) = {sim_score:.4f}\n")
```

---

**What you’ll observe when you run this:**

- The new RAG‐related sentence (“Retrieval Augmented Generation frameworks…”) will show relatively **high cosine scores** (e.g. 0.7–0.9 range) with the original three RAG/LLM sentences.
- The quantum mechanics sentence will have **low scores** (e.g. near 0.0–0.2) against *all* the RAG/LLM sentences, reflecting its totally different subject matter.
- You can scan the printed table to see exactly how partial overlap boosts similarity versus complete divergence driving it down.

## Ranking Sentences by Similarity

You've done a great job exploring sentence similarity changes! Now, let's apply what you've learned in a practical scenario.

Your task is to:

Select one sentence as a "query."
Compute its similarity to each of the other three sentences.
Print a ranking of these three sentences in descending order of similarity.
This exercise will help you understand how to use embeddings for simple retrieval tasks. Keep up the momentum and see how well you can rank the sentences!


```python
import numpy as np
from numpy.linalg import norm
from sentence_transformers import SentenceTransformer


def cosine_similarity(vec_a, vec_b):
    """
    Compute cosine similarity between two vectors:
    1 means identical direction, 0 means orthogonal.
    """
    # TODO: implement cosine similarity function
    pass


if __name__ == "__main__":
    # TODO: Load a pre-trained embedding model from Sentence Transformers.

    # Example sentences
    sentences = [
        "The Eiffel Tower is one of the most famous landmarks in Paris.",
        "Quantum computing promises to revolutionize technology with its speed.",
        "The Amazon rainforest is home to a vast diversity of wildlife.",
        "Meditation can significantly reduce stress and improve mental health.",
        "The Great Wall of China stretches over 13,000 miles."
    ]

    # TODO: Encode each sentence into its embedding vector

    # TODO: Define a query sentence and compute its embedding

    # TODO: Compute similarity of the query to each of the other sentences

    # TODO: Sort the sentences by similarity score in descending order

    # TODO: Print the sorted sentences with their similarity scores



```

Here’s a complete example filling in all the TODOs—selecting the first sentence as your query, computing cosine‐similarities against the other four, and printing them in descending order:

```python
import numpy as np
from numpy.linalg import norm
from sentence_transformers import SentenceTransformer

def cosine_similarity(vec_a, vec_b):
    """
    Compute cosine similarity between two vectors:
    1 means identical direction, 0 means orthogonal.
    """
    return np.dot(vec_a, vec_b) / (norm(vec_a) * norm(vec_b))

if __name__ == "__main__":
    # Load a pre-trained embedding model from Sentence Transformers.
    model = SentenceTransformer('all-MiniLM-L6-v2')

    # Example sentences
    sentences = [
        "The Eiffel Tower is one of the most famous landmarks in Paris.",
        "Quantum computing promises to revolutionize technology with its speed.",
        "The Amazon rainforest is home to a vast diversity of wildlife.",
        "Meditation can significantly reduce stress and improve mental health.",
        "The Great Wall of China stretches over 13,000 miles."
    ]

    # Encode each sentence into its embedding vector
    embeddings = model.encode(sentences)

    # Define a query sentence (here, the first one) and compute its embedding
    query = sentences[0]
    query_embedding = embeddings[0]

    # Compute similarity of the query to each of the other sentences
    other_sentences = sentences[1:]
    other_embeddings = embeddings[1:]
    scores = [cosine_similarity(query_embedding, emb) for emb in other_embeddings]

    # Sort the sentences by similarity score in descending order
    ranked = sorted(zip(other_sentences, scores), key=lambda x: x[1], reverse=True)

    # Print the sorted sentences with their similarity scores
    print(f"Ranking of sentences by similarity to the query: \"{query}\"")
    for sentence, score in ranked:
        print(f"- {sentence} (score: {score:.4f})")
```

**Example output** (yours may vary slightly depending on model/version):

```
Ranking of sentences by similarity to the query: "The Eiffel Tower is one of the most famous landmarks in Paris."
- The Great Wall of China stretches over 13,000 miles. (score: 0.48)
- The Amazon rainforest is home to a vast diversity of wildlife. (score: 0.22)
- Quantum computing promises to revolutionize technology with its speed. (score: 0.05)
- Meditation can significantly reduce stress and improve mental health. (score: 0.02)
```

Here, you can see that the other landmark (“Great Wall of China”) is most similar to the Eiffel Tower description, followed by another location, then two completely different topics. Adjust the `query = sentences[...]` index to try different queries!