# Import Libraries

In [1]:
from chromadb import Documents, EmbeddingFunction, Embeddings
import chromadb
import ollama
from ollama import embed
from IPython.display import Markdown
from ollama import generate
import numpy as np
from sklearn.metrics.pairwise import cosine_distances
import re

# Create Document

In [2]:
# DOCUMENT1 = "Operating the Climate Control System  Your Googlecar has a climate control system that allows you to adjust the temperature and airflow in the car. To operate the climate control system, use the buttons and knobs located on the center console.  Temperature: The temperature knob controls the temperature inside the car. Turn the knob clockwise to increase the temperature or counterclockwise to decrease the temperature. Airflow: The airflow knob controls the amount of airflow inside the car. Turn the knob clockwise to increase the airflow or counterclockwise to decrease the airflow. Fan speed: The fan speed knob controls the speed of the fan. Turn the knob clockwise to increase the fan speed or counterclockwise to decrease the fan speed. Mode: The mode button allows you to select the desired mode. The available modes are: Auto: The car will automatically adjust the temperature and airflow to maintain a comfortable level. Cool: The car will blow cool air into the car. Heat: The car will blow warm air into the car. Defrost: The car will blow warm air onto the windshield to defrost it."
# DOCUMENT2 = 'Your Googlecar has a large touchscreen display that provides access to a variety of features, including navigation, entertainment, and climate control. To use the touchscreen display, simply touch the desired icon.  For example, you can touch the "Navigation" icon to get directions to your destination or touch the "Music" icon to play your favorite songs.'
# DOCUMENT3 = "Shifting Gears Your Googlecar has an automatic transmission. To shift gears, simply move the shift lever to the desired position.  Park: This position is used when you are parked. The wheels are locked and the car cannot move. Reverse: This position is used to back up. Neutral: This position is used when you are stopped at a light or in traffic. The car is not in gear and will not move unless you press the gas pedal. Drive: This position is used to drive forward. Low: This position is used for driving in snow or other slippery conditions."

# documents = [DOCUMENT1, DOCUMENT2, DOCUMENT3]

In [3]:
DOCUMENT1 = """
Sure! Let’s break this down into a **clear, structured explanation** of what BLEU and ROUGE are, how they’re used, and where they fit in the **larger picture of evaluating Large Language Models (LLMs)** — especially in tasks like prompt engineering, machine translation, summarization, or domain-specific applications.

---

## 🔹 What Are BLEU and ROUGE?

### **BLEU (Bilingual Evaluation Understudy)**
- Originally designed to **evaluate machine translation**.
- Measures **how similar a generated sentence is to one or more reference sentences**, based on matching sequences of words (n-grams).
- Focuses on **precision** — how much of the generated text overlaps with the reference.

**Example:**
If the model generates:  
`"The cat is on the mat"`  
and the reference is:  
`"The cat sat on the mat"`  
BLEU looks at the overlapping n-grams (like `the`, `cat`, `on`, `the`, `mat`), and calculates a score.

### **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**
- Developed for **evaluating automatic summarization**.
- Measures **how much of the reference text appears in the generated text**, again using n-grams.
- Focuses more on **recall** — how much of the reference content is captured.

---

## 🔹 How Are These Metrics Used in Evaluating LLMs?

BLEU and ROUGE are **automatic, quantitative metrics** used to evaluate the quality of text generated by LLMs. They're commonly applied in these contexts:

### 1. **Prompt Engineering**
- When experimenting with different prompts to elicit better responses from an LLM:
  - BLEU/ROUGE can **score how close the output is to a desired answer**.
  - This is useful for **automated prompt selection** (i.e., choosing the best prompt without needing human judgment every time).

### 2. **Domain-Specific Tasks (e.g., Med-PaLM, SecLM)**
- In complex domains like medicine or security:
  - There may not be one "correct" answer.
  - BLEU/ROUGE can help by comparing generated answers to **a set of high-quality reference answers** (a.k.a. **golden responses**).
  - Helps to **quantify** performance even when human evaluation is expensive or time-consuming.

### 3. **Text Generation (Machine Translation, Summarization)**
- In foundational NLP tasks:
  - **BLEU is used for translations** — does the output match human-translated sentences?
  - **ROUGE is used for summaries** — does the summary include key phrases from the original content?

---

## 🔹 Limitations of BLEU and ROUGE

While useful, these metrics have **important limitations**, especially for **creative or open-ended tasks**:

- They **penalize valid but novel responses** — because the generated text might be correct or high quality, even if it doesn't match the reference exactly.
- They **struggle with semantic equivalence** — they don’t "understand meaning", just surface word overlaps.
- This makes them **less reliable for tasks like story generation, Q&A, or chatbot interactions**, where many answers can be equally valid.

---

## 🔹 The Bigger Picture: Evaluating LLMs

To properly evaluate LLMs, especially in complex or real-world applications, a **multi-method evaluation strategy** is recommended:

### ✅ **Traditional Metrics (like BLEU/ROUGE)**
- Quick, objective, and reproducible.
- Best for **structured tasks** with clear reference outputs.

### ✅ **Human Evaluation**
- Gold standard for assessing **fluency, relevance, coherence, and creativity**.
- But expensive and not easily scalable.

### ✅ **LLM-Powered Auto-Raters**
- Use another calibrated language model to **simulate human evaluation**.
- Can scale more easily than human raters.
- Requires **calibration against human judgments** to ensure reliability.

---

## 🔹 Final Thoughts

So, while **BLEU and ROUGE are valuable tools**, especially in tasks like translation and summarization, they are **not enough by themselves** for a full evaluation of LLM outputs. The best evaluations combine:

- **Similarity-based automatic metrics** (like BLEU/ROUGE),
- **Human assessments**, and
- **LLM-based scoring systems**.

This ensures that both **objective correctness** and **subjective quality** are properly captured — especially important when working with LLMs in nuanced or high-stakes domains.

---

Would you like a visual summary or code example showing how BLEU or ROUGE is computed in practice?
"""

In [4]:
DOCUMENT2 = """
Word embeddings are dense vector representations of words that capture their meanings, syntactic properties, and relationships with other words. There are several types of word embeddings, generally categorized based on how they are learned and what kind of data they use.

Here's a breakdown of the main types of word embeddings:

---

### **1. Count-Based Embeddings (Matrix Factorization)**
These are derived from word co-occurrence matrices.

#### a. **Latent Semantic Analysis (LSA)**
- **Method**: Create a term-document matrix, then apply **Singular Value Decomposition (SVD)** to reduce dimensions.
- **Captures**: Word similarity based on co-occurrence.
- **Limitations**: Linear, doesn't capture context dynamically.

#### b. **Pointwise Mutual Information (PMI) + SVD**
- Uses **PMI matrix** (measuring association between words) and then reduces it using SVD.
- **Variants**: PPMI (Positive PMI), Shifted PMI, etc.

---

### **2. Predictive Embeddings (Neural Networks)**
These use shallow neural networks to predict word context.

#### a. **Word2Vec**
- **Models**:
  - **CBOW (Continuous Bag-of-Words)**: Predicts a word from its context.
  - **Skip-Gram**: Predicts context words from a given word.
- **Training**: Negative sampling or hierarchical softmax.
- **Captures**: Semantic similarity, analogies (`king - man + woman ≈ queen`).

```python
# Using gensim to load Word2Vec
from gensim.models import Word2Vec
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1)
```

#### b. **GloVe (Global Vectors for Word Representation)**
- **Method**: Factorizes a co-occurrence matrix with a cost function involving word probabilities.
- **Captures**: Global co-occurrence and local context.
- **Advantage**: Incorporates both count-based and predictive elements.

#### c. **FastText**
- **Extension of Word2Vec** developed by Facebook.
- **Key Feature**: Represents words as bags of character **n-grams**, allowing it to handle **out-of-vocabulary (OOV)** words.
- Better for morphologically rich languages.

---

### **3. Contextual Embeddings**
These generate embeddings **based on context**, meaning the same word can have different vectors depending on its sentence.

#### a. **ELMo (Embeddings from Language Models)**
- Uses a **bi-directional LSTM** over the entire sentence.
- **Context-aware**: Embeddings vary by usage.

#### b. **BERT (Bidirectional Encoder Representations from Transformers)**
- Uses Transformer encoders.
- Learns embeddings for entire sentences, with **deep bidirectional** context.
- Typically uses the output from an intermediate or final layer for word/sentence embeddings.

```python
# Example using Hugging Face Transformers
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

inputs = tokenizer("Example sentence", return_tensors="pt")
outputs = model(**inputs)
word_embeddings = outputs.last_hidden_state
```

#### c. **Other Transformer Models**
- **GPT**, **RoBERTa**, **XLNet**, **T5**, etc.
- All provide contextual embeddings, fine-tuned for various tasks.

---

### Summary Table

| Type              | Examples         | Contextual | Handles OOV | Architecture        |
|-------------------|------------------|------------|-------------|---------------------|
| Count-based       | LSA, PMI         | ❌         | ❌          | Matrix factorization |
| Predictive        | Word2Vec, GloVe  | ❌         | FastText: ✅| Shallow NN          |
| Contextual        | ELMo, BERT, GPT  | ✅         | ✅          | RNNs / Transformers |

---

Let me know if you want a visual comparison or a deeper dive into how one of them works (like training Word2Vec step-by-step)."""

In [5]:
DOCUMENT3 = """
Based on the sources, the **Self-Attention Mechanism** is a **crucial component** within the architecture of **Large Language Models (LLMs)**, particularly those based on the **Transformer architecture**.

Here's a breakdown of its significance in the context of LLMs:

*   **Core Mechanism in Transformers:** The self-attention mechanism is a **key innovation of the Transformer network**, which has become the foundation for most modern LLMs. Unlike Recurrent Neural Networks (RNNs) that process sequences sequentially, **transformers can process sequences of tokens in parallel due to the self-attention mechanism**.

*   **Understanding Context and Dependencies:** Self-attention enables LLMs to **focus on specific parts of the input sequence that are relevant to the task at hand**. It allows the model to **capture long-range dependencies within sequences more effectively** than traditional RNNs. For example, in the sentence "The tiger jumped out of a tree to get a drink because it was thirsty," self-attention helps the model understand that "the tiger" and "it" refer to the same entity by determining the relationships between different words.

*   **How Self-Attention Works:** The process involves several steps:
    *   **Calculating Scores:** For each word in the input sequence, scores are calculated to determine how much it should 'attend' to other words. This is done by taking the **dot product of the query vector of one word with the key vectors of all the words in the sequence**.
    *   **Normalization:** The scores are then **divided by the square root of the key vector dimension** for stability and passed through a **softmax function** to obtain attention weights. These weights indicate the strength of connection between words.
    *   **Weighted Values:** Each **value vector is multiplied by its corresponding attention weight**, and the results are summed up to produce a **context-aware representation for each word**.
    *   In practice, these calculations are performed simultaneously using matrices for queries (Q), keys (K), and values (V).

*   **Multi-Head Attention:** Most LLMs utilize **multi-head attention**, which employs **multiple sets of query, key, and value weight matrices** running in parallel. Each 'head' can potentially focus on different aspects of the input relationships, and their outputs are combined to provide the model with a **richer representation of the input sequence**, improving its ability to handle complex language patterns.

*   **Foundation for Large Reasoning Models:** The "Foundational Large Language Models & Text Generation" whitepaper notes that **transformer architectures, with their self-attention mechanisms, are foundational for achieving robust reasoning capabilities in large models**.

*   **Evolution of Architectures:** Models like **BERT** (Bidirectional Encoder Representations from Transformers) are encoder-only architectures that heavily rely on the self-attention mechanism to understand context deeply. Other architectures like the original Transformer and GPT models also utilize self-attention in their encoder and/or decoder components.

*   **Efficiency Considerations:** While powerful, the self-attention mechanism in the original transformer has a **computational cost that is quadratic in the context length**, which can limit the size of the input it can effectively process. Techniques like **Flash Attention** aim to optimize the self-attention calculation to improve latency and reduce costs.

In summary, the **self-attention mechanism is a fundamental building block of modern LLMs**, enabling them to process information in parallel, understand contextual relationships between words in a sequence, capture long-range dependencies, and ultimately achieve strong performance in various natural language understanding and generation tasks. Its effectiveness has led to its widespread adoption in the Transformer architecture and its derivatives, which power many of the state-of-the-art LLMs discussed in the sources."""

In [6]:
documents = [DOCUMENT1, DOCUMENT2, DOCUMENT3]

# Create Document Embeddings

## Create document embeddings and add to chromadb

In [7]:
class OllamaEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        response = embed(model='bge-m3', input=input)
        return response['embeddings']

In [8]:
DB_NAME = "testdb1"
chroma_client = chromadb.Client()
embed_fn = OllamaEmbeddingFunction()
db = chroma_client.get_or_create_collection(name=DB_NAME, embedding_function=embed_fn)
db.add(documents=documents, ids=[str(i) for i in range(len(documents))])


In [9]:
db.count()

3

In [10]:
db.peek(2)

{'ids': ['0', '1'],
 'embeddings': array([[-0.00175362, -0.00174541,  0.00709319, ..., -0.0428962 ,
          0.01390002,  0.00645761],
        [-0.02732202,  0.00666147,  0.00107146, ..., -0.01805371,
         -0.01269804,  0.00306338]], shape=(2, 1024)),
 'documents': ['\nSure! Let’s break this down into a **clear, structured explanation** of what BLEU and ROUGE are, how they’re used, and where they fit in the **larger picture of evaluating Large Language Models (LLMs)** — especially in tasks like prompt engineering, machine translation, summarization, or domain-specific applications.\n\n---\n\n## 🔹 What Are BLEU and ROUGE?\n\n### **BLEU (Bilingual Evaluation Understudy)**\n- Originally designed to **evaluate machine translation**.\n- Measures **how similar a generated sentence is to one or more reference sentences**, based on matching sequences of words (n-grams).\n- Focuses on **precision** — how much of the generated text overlaps with the reference.\n\n**Example:**\nIf the model 

## Query chromadb for embeddings and get relevant documents

In [11]:
query = "What are the limitations of BLEU and ROUGE"
result = db.query(query_texts=[query], n_results=1)
[all_passages] = result["documents"]
Markdown(all_passages[0])


Sure! Let’s break this down into a **clear, structured explanation** of what BLEU and ROUGE are, how they’re used, and where they fit in the **larger picture of evaluating Large Language Models (LLMs)** — especially in tasks like prompt engineering, machine translation, summarization, or domain-specific applications.

---

## 🔹 What Are BLEU and ROUGE?

### **BLEU (Bilingual Evaluation Understudy)**
- Originally designed to **evaluate machine translation**.
- Measures **how similar a generated sentence is to one or more reference sentences**, based on matching sequences of words (n-grams).
- Focuses on **precision** — how much of the generated text overlaps with the reference.

**Example:**
If the model generates:  
`"The cat is on the mat"`  
and the reference is:  
`"The cat sat on the mat"`  
BLEU looks at the overlapping n-grams (like `the`, `cat`, `on`, `the`, `mat`), and calculates a score.

### **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**
- Developed for **evaluating automatic summarization**.
- Measures **how much of the reference text appears in the generated text**, again using n-grams.
- Focuses more on **recall** — how much of the reference content is captured.

---

## 🔹 How Are These Metrics Used in Evaluating LLMs?

BLEU and ROUGE are **automatic, quantitative metrics** used to evaluate the quality of text generated by LLMs. They're commonly applied in these contexts:

### 1. **Prompt Engineering**
- When experimenting with different prompts to elicit better responses from an LLM:
  - BLEU/ROUGE can **score how close the output is to a desired answer**.
  - This is useful for **automated prompt selection** (i.e., choosing the best prompt without needing human judgment every time).

### 2. **Domain-Specific Tasks (e.g., Med-PaLM, SecLM)**
- In complex domains like medicine or security:
  - There may not be one "correct" answer.
  - BLEU/ROUGE can help by comparing generated answers to **a set of high-quality reference answers** (a.k.a. **golden responses**).
  - Helps to **quantify** performance even when human evaluation is expensive or time-consuming.

### 3. **Text Generation (Machine Translation, Summarization)**
- In foundational NLP tasks:
  - **BLEU is used for translations** — does the output match human-translated sentences?
  - **ROUGE is used for summaries** — does the summary include key phrases from the original content?

---

## 🔹 Limitations of BLEU and ROUGE

While useful, these metrics have **important limitations**, especially for **creative or open-ended tasks**:

- They **penalize valid but novel responses** — because the generated text might be correct or high quality, even if it doesn't match the reference exactly.
- They **struggle with semantic equivalence** — they don’t "understand meaning", just surface word overlaps.
- This makes them **less reliable for tasks like story generation, Q&A, or chatbot interactions**, where many answers can be equally valid.

---

## 🔹 The Bigger Picture: Evaluating LLMs

To properly evaluate LLMs, especially in complex or real-world applications, a **multi-method evaluation strategy** is recommended:

### ✅ **Traditional Metrics (like BLEU/ROUGE)**
- Quick, objective, and reproducible.
- Best for **structured tasks** with clear reference outputs.

### ✅ **Human Evaluation**
- Gold standard for assessing **fluency, relevance, coherence, and creativity**.
- But expensive and not easily scalable.

### ✅ **LLM-Powered Auto-Raters**
- Use another calibrated language model to **simulate human evaluation**.
- Can scale more easily than human raters.
- Requires **calibration against human judgments** to ensure reliability.

---

## 🔹 Final Thoughts

So, while **BLEU and ROUGE are valuable tools**, especially in tasks like translation and summarization, they are **not enough by themselves** for a full evaluation of LLM outputs. The best evaluations combine:

- **Similarity-based automatic metrics** (like BLEU/ROUGE),
- **Human assessments**, and
- **LLM-based scoring systems**.

This ensures that both **objective correctness** and **subjective quality** are properly captured — especially important when working with LLMs in nuanced or high-stakes domains.

---

Would you like a visual summary or code example showing how BLEU or ROUGE is computed in practice?


In [12]:
query_oneline = query.replace("\n", " ")
## Concise answer prompt
# prompt = f"""You are a helpful and informative bot that answers questions using text from the reference passage included below. 
# Be sure to respond in a complete sentence, being concise, including only relevant information. 
# However, you are talking to a non-technical audience, so be sure to break down complicated concepts and 
# strike a friendly and converstional tone. If the passage is irrelevant to the answer, you may ignore it.

# QUESTION: {query_oneline}
# """

## Detailed answer prompt
prompt = f"""You are a helpful and informative bot that answers questions using text from the reference passage included below. 
Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. 
However, you are talking to a non-technical audience, so be sure to break down complicated concepts and 
strike a friendly and converstional tone. If the passage is irrelevant to the answer, you may ignore it.

QUESTION: {query_oneline}
"""

# Add the retrieved documents to the prompt.
for passage in all_passages:
    passage_oneline = passage.replace("\n", " ")
    prompt += f"PASSAGE: {passage_oneline}\n"

print(prompt)

You are a helpful and informative bot that answers questions using text from the reference passage included below. 
Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. 
However, you are talking to a non-technical audience, so be sure to break down complicated concepts and 
strike a friendly and converstional tone. If the passage is irrelevant to the answer, you may ignore it.

QUESTION: What are the limitations of BLEU and ROUGE
PASSAGE:  Sure! Let’s break this down into a **clear, structured explanation** of what BLEU and ROUGE are, how they’re used, and where they fit in the **larger picture of evaluating Large Language Models (LLMs)** — especially in tasks like prompt engineering, machine translation, summarization, or domain-specific applications.  ---  ## 🔹 What Are BLEU and ROUGE?  ### **BLEU (Bilingual Evaluation Understudy)** - Originally designed to **evaluate machine translation**. - Measures **how similar a generated

# Generate response from the query result using any LLM

In [13]:
response = generate('llama3.2', prompt)

In [14]:
Markdown(response['response'])

BLEU and ROUGE, two popular metrics used to evaluate the quality of text generated by Large Language Models (LLMs), have several limitations that make them less reliable for certain tasks. While they are useful for evaluating the precision and recall of a model's output, they can penalize valid but novel responses and struggle with semantic equivalence, which means they don't truly understand the meaning behind the words. This makes them less suitable for creative or open-ended tasks like story generation, Q&A, or chatbot interactions, where multiple answers can be equally valid.

For example, if a model generates a response that is correct but not identical to the reference answer, BLEU and ROUGE will likely score it lower. Similarly, if the model captures key phrases from the original content in its summary, ROUGE will measure the overlap, but may not necessarily evaluate the summary's coherence or relevance.

To get a comprehensive understanding of an LLM's performance, it's essential to combine automatic metrics like BLEU and ROUGE with human evaluations and LLM-based scoring systems. This ensures that both objective correctness and subjective quality are properly captured, which is crucial when working with LLMs in nuanced or high-stakes domains.

In summary, while BLEU and ROUGE are valuable tools for evaluating LLM outputs, they should be used in conjunction with other evaluation methods to ensure a more complete picture of the model's performance.

# Semantic RAG (Manual, Not as acurate as Expected)

### Key Idea -:
1. Split the documents into sentences based on separators(.,?,!)
2. Index each sentence based on position.
3. Group: Choose how many sentences to be on either side. Add a buffer of sentences on either side of our selected sentence.
4. Calculate distance between group of sentences.
5. Merge groups based on similarity i.e. keep similar sentences together.
6. Split the sentences that are not similar.

Three different strategies could be used use on Semantic Chunking:

- percentile — In this method, all differences between sentences are calculated, and then any difference greater than the X percentile is split.

- standard_deviation — In this method, any difference greater than X standard deviations is split.

- interquartile — In this method, the interquartile distance is used to split chunks.

In [15]:
def split_into_sentences(text):
    # Split by punctuation followed by space or end of text
    sentences = re.split(r'(?<=[.!?]) +', text.strip())
    return [s.strip() for s in sentences if s.strip()]

In [16]:
def get_sentence_groups(indexed_sentences, buffer=2):
    groups = []
    for i in range(len(indexed_sentences)):
        start = max(0, i - buffer)
        end = min(len(indexed_sentences), i + buffer + 1)
        group = " ".join([s for _, s in indexed_sentences[start:end]])
        groups.append((i, group))
    return groups

In [17]:
def embed_groups(groups):
    texts = [g[1] for g in groups]
    # embeddings = embed_fn(texts)['embeddings']  # Output from Ollama or similar
    text_embeddings = embed(model='bge-m3', input=texts)
    embeddings = text_embeddings['embeddings']
    return embeddings

def compute_distances(embeddings):
    return [cosine_distances([embeddings[i]], [embeddings[i+1]])[0][0] for i in range(len(embeddings)-1)]


### Strategy A: `percentile`

In [18]:
def split_by_percentile(distances, percentile=95):
    threshold = np.percentile(distances, percentile)
    return [i for i, d in enumerate(distances) if d > threshold]


### Strategy B: `standard_deviation`

In [19]:
def split_by_stddev(distances, num_std=2):
    mean = np.mean(distances)
    std = np.std(distances)
    return [i for i, d in enumerate(distances) if d > mean + num_std * std]


### Strategy C: `interquartile`

In [20]:
def split_by_interquartile(distances):
    q1 = np.percentile(distances, 25)
    q3 = np.percentile(distances, 75)
    iqr = q3 - q1
    threshold = q3 + 1.5 * iqr
    return [i for i, d in enumerate(distances) if d > threshold]


In [21]:
def make_chunks_from_splits(sentences, split_indices):
    chunks = []
    start = 0
    for split in split_indices:
        chunk_sentences = [s[1] if isinstance(s, tuple) else s for s in sentences[start:split+1]]
        chunks.append(" ".join(chunk_sentences))
        start = split+1
    if start < len(sentences):
        chunk_sentences = [s[1] if isinstance(s, tuple) else s for s in sentences[start:]]
        chunks.append(" ".join(chunk_sentences))
    return chunks




In [22]:
# Split the documents into sentences
indexed_sentences = []
for doc in documents:
    indexed_sentences += [(i, sentence) for i, sentence in enumerate(split_into_sentences(str(doc)))]

In [23]:
indexed_sentences

[(0, 'Sure!'),
 (1,
  'Let’s break this down into a **clear, structured explanation** of what BLEU and ROUGE are, how they’re used, and where they fit in the **larger picture of evaluating Large Language Models (LLMs)** — especially in tasks like prompt engineering, machine translation, summarization, or domain-specific applications.\n\n---\n\n## 🔹 What Are BLEU and ROUGE?\n\n### **BLEU (Bilingual Evaluation Understudy)**\n- Originally designed to **evaluate machine translation**.\n- Measures **how similar a generated sentence is to one or more reference sentences**, based on matching sequences of words (n-grams).\n- Focuses on **precision** — how much of the generated text overlaps with the reference.\n\n**Example:**\nIf the model generates:  \n`"The cat is on the mat"`  \nand the reference is:  \n`"The cat sat on the mat"`  \nBLEU looks at the overlapping n-grams (like `the`, `cat`, `on`, `the`, `mat`), and calculates a score.\n\n### **ROUGE (Recall-Oriented Understudy for Gisting Ev

In [24]:
groups = get_sentence_groups(list(indexed_sentences), buffer=2)

In [25]:
groups

[(0,
  'Sure! Let’s break this down into a **clear, structured explanation** of what BLEU and ROUGE are, how they’re used, and where they fit in the **larger picture of evaluating Large Language Models (LLMs)** — especially in tasks like prompt engineering, machine translation, summarization, or domain-specific applications.\n\n---\n\n## 🔹 What Are BLEU and ROUGE?\n\n### **BLEU (Bilingual Evaluation Understudy)**\n- Originally designed to **evaluate machine translation**.\n- Measures **how similar a generated sentence is to one or more reference sentences**, based on matching sequences of words (n-grams).\n- Focuses on **precision** — how much of the generated text overlaps with the reference.\n\n**Example:**\nIf the model generates:  \n`"The cat is on the mat"`  \nand the reference is:  \n`"The cat sat on the mat"`  \nBLEU looks at the overlapping n-grams (like `the`, `cat`, `on`, `the`, `mat`), and calculates a score.\n\n### **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

In [26]:
type(groups)

list

In [27]:
embeddings = embed_groups(groups)

In [28]:
len(embeddings)

31

In [29]:
distances = compute_distances(embeddings)

In [30]:
distances

[np.float64(0.014861479771087205),
 np.float64(0.6242932450524188),
 np.float64(0.31781167707691527),
 np.float64(0.19440118284302832),
 np.float64(0.30915355963671864),
 np.float64(0.07747864256306847),
 np.float64(0.10225654671764428),
 np.float64(0.02980765548101716),
 np.float64(0.47633028816037215),
 np.float64(0.1470827182584915),
 np.float64(0.03461220532224285),
 np.float64(0.12691344442639108),
 np.float64(0.07195052112201783),
 np.float64(0.14416966991275137),
 np.float64(0.14256233054457668),
 np.float64(0.08367140675740092),
 np.float64(0.15517983843461058),
 np.float64(0.5864761692184371),
 np.float64(0.247394926088923),
 np.float64(0.06169298272705537),
 np.float64(0.2675377873066991),
 np.float64(0.41050125420073114),
 np.float64(0.5596207794418939),
 np.float64(0.0890455808314572),
 np.float64(0.587182870487359),
 np.float64(0.42110101697568403),
 np.float64(0.23969887485799624),
 np.float64(0.42974350368931713),
 np.float64(0.6225042978472564),
 np.float64(0.0986388917

In [109]:
splits = split_by_percentile(distances, percentile=40)

In [110]:
splits

[1, 2, 3, 4, 8, 9, 13, 16, 17, 18, 20, 21, 22, 24, 25, 26, 27, 28]

In [111]:
documents = make_chunks_from_splits(list(indexed_sentences), splits)

In [112]:
documents

['Sure! Let’s break this down into a **clear, structured explanation** of what BLEU and ROUGE are, how they’re used, and where they fit in the **larger picture of evaluating Large Language Models (LLMs)** — especially in tasks like prompt engineering, machine translation, summarization, or domain-specific applications.\n\n---\n\n## 🔹 What Are BLEU and ROUGE?\n\n### **BLEU (Bilingual Evaluation Understudy)**\n- Originally designed to **evaluate machine translation**.\n- Measures **how similar a generated sentence is to one or more reference sentences**, based on matching sequences of words (n-grams).\n- Focuses on **precision** — how much of the generated text overlaps with the reference.\n\n**Example:**\nIf the model generates:  \n`"The cat is on the mat"`  \nand the reference is:  \n`"The cat sat on the mat"`  \nBLEU looks at the overlapping n-grams (like `the`, `cat`, `on`, `the`, `mat`), and calculates a score.\n\n### **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**\n- 

In [113]:
DB_NAME = "SemanticRAG01"
chroma_client = chromadb.Client()
embed_fn = OllamaEmbeddingFunction()
db = chroma_client.get_or_create_collection(name=DB_NAME, embedding_function=embed_fn)
db.add(documents=documents, ids=[str(i) for i in range(len(documents))])

In [114]:
db.count()

19

In [115]:
db.peek(1)

{'ids': ['0'],
 'embeddings': array([[-0.05443733,  0.00357789, -0.03386477, ..., -0.03299839,
          0.06081931, -0.03797556]], shape=(1, 1024)),
 'documents': ['Sure! Let’s break this down into a **clear, structured explanation** of what BLEU and ROUGE are, how they’re used, and where they fit in the **larger picture of evaluating Large Language Models (LLMs)** — especially in tasks like prompt engineering, machine translation, summarization, or domain-specific applications.\n\n---\n\n## 🔹 What Are BLEU and ROUGE?\n\n### **BLEU (Bilingual Evaluation Understudy)**\n- Originally designed to **evaluate machine translation**.\n- Measures **how similar a generated sentence is to one or more reference sentences**, based on matching sequences of words (n-grams).\n- Focuses on **precision** — how much of the generated text overlaps with the reference.\n\n**Example:**\nIf the model generates:  \n`"The cat is on the mat"`  \nand the reference is:  \n`"The cat sat on the mat"`  \nBLEU looks 

In [120]:
# query = "what are the limitations of BLEU and ROUGE"
# query = "what are the different types of word embeddings in NLP"
query = "what is self attention in LLMs"
result = db.query(query_texts=[query], n_results=2)
[all_passages] = result["documents"]

In [121]:
for i in range(len(all_passages)):
    print(f"Passage {i}:")
    print(all_passages[i])
    print()

Passage 0:
Techniques like **Flash Attention** aim to optimize the self-attention calculation to improve latency and reduce costs.

In summary, the **self-attention mechanism is a fundamental building block of modern LLMs**, enabling them to process information in parallel, understand contextual relationships between words in a sequence, capture long-range dependencies, and ultimately achieve strong performance in various natural language understanding and generation tasks. Its effectiveness has led to its widespread adoption in the Transformer architecture and its derivatives, which power many of the state-of-the-art LLMs discussed in the sources.

Passage 1:
Based on the sources, the **Self-Attention Mechanism** is a **crucial component** within the architecture of **Large Language Models (LLMs)**, particularly those based on the **Transformer architecture**.

Here's a breakdown of its significance in the context of LLMs:

*   **Core Mechanism in Transformers:** The self-attention me

In [108]:
for i in range(db.count()):
    db.delete(str(i))