<a href="https://colab.research.google.com/github/rolandkristo/llm/blob/main/retreival_augmented_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://cdn.githubraw.com/antndlcrx/oss_2024/main/images/dpir_oss.png?raw=true:,  width=70" alt="My Image" width=500>

# **Retreival Augmented Generation**

In [None]:
#@title Default Set-Up
!pip install -q langchain langgraph pymupdf langchain-text-splitters langchain-community langchain-huggingface

import torch

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.5/43.5 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m142.4/142.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m40.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.2/47.2 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🔍 **Retrieval-Augmented Generation (RAG)** is a powerful technique that improves how language models answer knowledge-intensive questions.

While LLMs store vast information in their parameters, they can not easily update facts, explain their outputs, or know about events after their training. RAG solves this by combining a language model with a **retrieval system**: instead of relying solely on stored knowledge, the model **fetches relevant documents** (e.g., from Wikipedia) and uses them as context to generate accurate, grounded responses.

## ⚙️ 1. **Implementing a RAG Pipeline**

A RAG pipeline consists of these steps:

1. 📚 **Represent Your Documents with Embeddings**

    First, we convert each document into a vector (a numeric representation of its meaning) using a sentence embedding model. This lets us compare documents and queries based on semantic similarity — not just exact wording.

    *Why*? Language models cant "search" — we give them this ability by turning our document collection into a searchable index.

2. **🔍 Implement a Retriever**  
   
   Given a user's question (query), we compute its embedding and compare it to our document vectors to find the most relevant pieces of information.

    *Why*? Instead of relying on what the model remembers (which may be outdated or wrong), we retrieve accurate, up-to-date context.

3. 🧩 **Build the Prompt (Query + Context)**  

   We then take the original query and the top retrieved documents and merge them into a single prompt. This composite input is passed to the language model.

    *Why*? This ensures the model generates its answer using retrieved facts, not just memorized knowledge.

4. **✍️ Generate the Answer**  

   Finally, we ask the language model to answer the question using the combined input.

    Why? The generation is now grounded in relevant evidence, reducing hallucination and improving factual accuracy.

## **LangChain: Endow LMs with External Tools**

[**LangChain**](https://www.langchain.com/) is a powerful Python framework for building applications that **combine language models with other external tools, knowledge sources, and workflows**. It helps structure complex LLM applications — like chatbots, retrieval-based QA systems, and agents — into modular and reusable components.

At its core, LangChain simplifies working with **documents**, **retrieval**, **memory**, **tools**, and **chains of logic** — enabling developers to go from raw data to intelligent assistants quickly and flexibly.

---

#### **Key Building Blocks**

📄 **`Document` Class**  
LangChain represents each text unit as a [**Document**](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html) object, which holds:

- `page_content`: the main text
- `metadata`: optional info like source, filename, or tags

```python
from langchain_core.documents import Document

document = Document(
    page_content="Hello, world!",
    metadata={"source": "https://example.com"}
)
```

---

**Document Loaders**  
LangChain offers a wide variety of [**document loaders**](https://python.langchain.com/docs/concepts/document_loaders/) to ingest and process data from different sources: PDFs, CSVs, websites, Google Drive, Notion, and more. Loaders return data as a list of `Document` objects, ready for further processing.

For example, loading from a CSV:

```python
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="data.csv")
documents = loader.load()
```

You can explore available loaders in [this guide](https://python.langchain.com/docs/integrations/document_loaders/) and the [official documentation](https://python.langchain.com/docs/concepts/document_loaders/).

In [None]:
#@title **Politics Articles**
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_core.documents import Document
import os

from google.colab import drive
drive.mount('/content/drive')

# DATA_LINK = "https://drive.google.com/drive/folders/1k7ANJEF8QYrKNb4oGs0--YBRJ61nXtog?usp=sharing"

pdf_dir = ""

# load each PDF as one long document (instead of page-by-page as default by PyMuPDFLoader)
def load_full_docs(pdf_dir):
    all_docs = []
    for root, dirs, files in os.walk(pdf_dir):
        for file in files:
            if file.endswith(".pdf"):
                loader = PyMuPDFLoader(os.path.join(root, file))
                # merge pages into one document per PDF
                pages = loader.load()
                full_text = "\n".join([page.page_content for page in pages])
                all_docs.append(Document(page_content=full_text, metadata={"source": file}))
    return all_docs

all_docs = load_full_docs(pdf_dir)
print(f"Loaded {len(all_docs)} full PDF documents")

Mounted at /content/drive
Loaded 24 full PDF documents


In [None]:
#@title **Alternative Data: Scrape a Web Page**
from langchain_community.document_loaders import WebBaseLoader
import bs4

LINK = "https://doomscrollingbabel.manoel.xyz/p/simulating-human-behavior"

# Load and chunk contents of the blog
loader = WebBaseLoader(
    web_paths=(LINK,),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
        )
    ),
)

text = loader.load()

# minimal preprocessing
# text[0].page_content = text[0].page_content[421:]
# text[0].page_content = text[0].page_content[:-619]
# my_text = text[0].page_content



### **Split Documents into Chunks for Efficient Retrieval**

Long documents can be a challenge when working with language and embedding models — especially because most models have **maximum input length limits**. To ensure smooth performance and more accurate results, we break our documents into smaller, overlapping pieces using LangChain’s [**Text Splitters**](https://python.langchain.com/docs/concepts/text_splitters/).

This process helps in several key ways:
- **Handles input size limits**: Embedding and generation models can't process arbitrarily long text. Splitting lets us work within model constraints.
- **Improves embedding quality**: Smaller chunks produce more focused embeddings that better capture the meaning of each section.
- **Boosts retrieval accuracy**: Search and RAG systems work better when they can match queries to smaller, more targeted document sections.



We use the `RecursiveCharacterTextSplitter` 🛠️ from LangChain, which intelligently breaks text based on structure (like paragraphs or sentences), rather than just chopping at a fixed length.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# create a splitter that chunks by sentences/paragraphs, not just blindly by character count
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # total characters per chunk
    chunk_overlap=150,  # amount of overlap to preserve context
    separators=["\n\n", "\n", ".", " "]  # priority order of where to split
)

# apply to all loaded full-text documents
split_docs = text_splitter.split_documents(all_docs)
print(f"Split into {len(split_docs)} smaller chunks")

Split into 2143 smaller chunks


### 📚 **Step 1: Build a Document "Library" (Indexed Embeddings)**


Once we've split our documents into smaller, focused chunks, the next step is to **represent them numerically** — this is where embeddings come in. We use a **sentence embedding model** to turn each chunk of text into a dense vector (a list of numbers) that captures its semantic meaning.

This vector representation allows us to:
- 🔍 **Search** documents by meaning, not just keywords
- 💡 **Retrieve relevant chunks** to provide context for answering questions in RAG
---

Embedding models convert documents (splits) and queries into vector representations. Read more about [Embedding Models](https://python.langchain.com/docs/concepts/embedding_models/) and [LangChain Embedding Integrations](https://python.langchain.com/docs/integrations/text_embedding/).

In this tutorial, we use [ModernBERT Embed (Base)](https://huggingface.co/nomic-ai/modernbert-embed-base), a model specifically optimized for embedding tasks. It transforms each document into a **high-dimensional vector** such that **texts with similar meanings produce similar vectors** — even if the wording is different.

In [None]:
# runs about 2 minutes
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore

# load a model to create vector representations of documents
embedding_model = HuggingFaceEmbeddings(model_name="nomic-ai/modernbert-embed-base")
vector_store = InMemoryVectorStore(embedding_model)

# index the chunks
vector_store.add_documents(split_docs)

### **How Do We Measure Semantic Similarity?**

Once we have document and query embeddings, we compare them using **cosine similarity**, a common way to measure how "close" two vectors are in meaning.

The formula is:

$$
\text{cosine similarity}(A, B) = \frac{A \cdot B}{\|A\| \cdot \|B\|}
$$

- $ A \cdot B $ is the **dot product** of the two vectors.
- $ \|A\| $ and $ \|B\| $ are their **magnitudes** (lengths).
- The result is a number between -1 and 1:  
  - 1 = very similar  
  - 0 = unrelated  
  - -1 = opposite direction (not common in embeddings)

---

#### 📐 **A Small Example: Cosine Similarity in Action**

Cinsider some toy 5D vectors:

```python
query = [0.1, 0.2, 0.3, 0.4, 0.5]
doc_1 = [0.1, 0.2, 0.3, 0.4, 0.5]   # same as query
doc_2 = [0.5, 0.4, 0.3, 0.2, 0.1]   # reversed pattern
```

---

**Query vs Doc 1: (Identical)**

1. **Dot product**
    $$
    0.1 \cdot 0.1 + 0.2 \cdot 0.2 + 0.3 \cdot 0.3 + 0.4 \cdot 0.4 + 0.5 \cdot 0.5 = 0.01 + 0.04 + 0.09 + 0.16 + 0.25 = 0.55
    $$

2. **Magnitude of each vector**
    $$
    \|A\| = \sqrt{0.1^2 + 0.2^2 + 0.3^2 + 0.4^2 + 0.5^2} = \sqrt{0.55} \approx 0.7416
    $$

3. **Cosine similarity**  
    $$
    \frac{0.55}{0.7416 \cdot 0.7416} = \frac{0.55}{0.55} = 1.0
    $$

✔️ Perfect match!

---

**Query vs Doc 2: (Reversed)**

1. **Dot product**
    $$
    0.1 \cdot 0.5 + 0.2 \cdot 0.4 + 0.3 \cdot 0.3 + 0.4 \cdot 0.2 + 0.5 \cdot 0.1 = 0.05 + 0.08 + 0.09 + 0.08 + 0.05 = 0.35
    $$

2. **Magnitude of each vector**  
    Since both vectors have the same numbers in different orders:  
    $$
    \|A\| = \|B\| = \sqrt{0.55} \approx 0.7416
    $$

3. **Cosine similarity**
    $$
    \frac{0.35}{0.7416 \cdot 0.7416} = \frac{0.35}{0.55} \approx 0.636
    $$

🔻 Lower similarity — even though the values are the same, the **order matters**! The semantic meaning has shifted.

---


Cosine similarity helps us detect **semantic closeness** — not just based on what numbers are used, but also **how they are arranged**. This is what powers document retrieval in RAG!

#### 💡 **Why Does This Matter?**

Once we compute similarity for every document, we **select the top-k most similar** to the query. These are the passages we feed to the language model as external context — so it can generate more accurate, fact-grounded answers.

This similarity step makes RAG pipelines **powerful and reliable**, especially when **working with large corpora or when factual correctness is critical**.

---

We use the `HuggingFaceEmbeddings` class from `langchain_huggingface` to load an embedding model, and the `InMemoryVectorStore` to store and retrieve our chunk vectors.

After this step, your chunks are transformed into a **semantic index** — a searchable space where similar content is placed close together.

In [None]:
## Let us look at a anotther example:
from sentence_transformers import SentenceTransformer
emb_model = SentenceTransformer("nomic-ai/modernbert-embed-base")

# we want model to answer a set of questions of dire importance to us. They are:
query_embeddings = emb_model.encode([
    "search_query: why are cats cute?",
    "search_query: Is cola healthy?",
])

# we use model.encode to turn raw text into numerical vectors representing the substantive
# meaning of our documents and queries.

doc_embeddings = emb_model.encode([
    "search_document: cats are nice fluffy creatures with cute faces and horrible yet adorable character.",
    "search_document: TSNE is a dimensionality reduction algorithm.",
    "search_document: cola is a soft-drink that is rich in sugar."
])

print(f"Query embedding dimension: {query_embeddings.shape}", "\n",
    f"documents embedding shape: {doc_embeddings.shape}", "\n")
# (2, 768) (3, 768)

# Next, we search for most relevant documents for our queries.
# To do that, we rely on cosine similarity
similarities = emb_model.similarity(query_embeddings, doc_embeddings)
print(similarities)

# tensor([[0.6725, 0.1002, 0.1628],
#         [0.1014, 0.0234, 0.5753]])

# First query about cats finds most similar document: doc number 0 with siimilarity score 0.67
# Second query about cola find most similar doc at position 2, with similarity score 0.57

Query embedding dimension: (2, 768) 
 documents embedding shape: (3, 768) 

tensor([[0.6725, 0.1002, 0.1628],
        [0.1014, 0.0234, 0.5753]])


### 📎 **Load a Text Generation Model with LangChain Integration**

Next,  we load a **pretrained instruction-tuned model** from Hugging Face and prepare it for use inside a LangChain pipeline.

- `transformers`: Hugging Face's core library for working with pretrained models.
- `langchain_huggingface`: Lets us wrap Hugging Face models so they can be used in LangChain workflows.

We are using the [**Qwen 1.5B Instruct**](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model — a small but capable model optimized for **instruction-following** tasks like question answering and summarization.

- `pipeline("text-generation")`: Creates a high-level interface to handle input, tokenization, generation, and output formatting. We have seen this in the intro to HF tutorial.
- `max_new_tokens=512`: Limits how much text the model is allowed to generate in response.

- `HuggingFacePipeline`: A LangChain wrapper that allows us to use Hugging Face pipelines as **LLMs inside LangChain chains** (e.g. `RetrievalQA`, `ConversationalRetrievalChain`, etc.).
- This turns the model into a plug-and-play component for more complex RAG pipelines and question-answering chains.

**Now** the model is ready to be dropped into a RAG pipeline or other LangChain components like `RetrievalQA` or `LLMChain`.

In [None]:
# pick a model and download it
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain_huggingface import HuggingFacePipeline

model_id = "Qwen/Qwen2.5-0.5B-Instruct"
pipe = pipeline("text-generation",
                model=model_id,
                max_new_tokens=512)

llm = HuggingFacePipeline(pipeline=pipe)

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Device set to use cuda:0


###  **RAG Pipeline (Composable LangChain Runnables)**

LangChain now encourages building RAG pipelines using **modular and flexible components** called `Runnable` objects. This new style lets us plug in a retriever, prompt, LLM, and output parser like building blocks 🧱 — and it's more customizable and future-proof.

Here is an example of how it works:

- `vector_store.as_retriever()` grabs the most relevant docs for your query.
- `format_docs()` flattens those documents into clean text.
- `prompt = hub.pull("rlm/rag-prompt")` loads a well-tested community prompt template.
- `llm` runs the final prompt through your language model.
- `StrOutputParser()` extracts clean text from the model’s output.

All of this is wrapped into a single chain you can call with `.invoke(query)`.

```python
response = qa_chain.invoke("What are political institutions?")
```

---

#### **Controlling Model Behavior (Generation Settings)**

To adjust how the model responds (e.g., creativity or output length), you can pass **generation parameters** to the chain via `config`:

```python
response = qa_chain.invoke(
    "What are political institutions?",
    config={"llm_kwargs": {"temperature": 0.7, "max_new_tokens": 512}}
)
```

These options are passed directly to the underlying `pipeline()` or `generate()` function used by the model. For example:

- `temperature`: controls randomness (higher = more creative)
- `max_new_tokens`: limits how long the model's response can be

In [None]:
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableMap

# See full prompt at https://smith.langchain.com/hub/rlm/rag-prompt
prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


retriever = vector_store.as_retriever(search_kwargs={"k": 3})

# Note on code syntax:
#   A | B | C is euqivalent to C.invoke(B.invoke(A.invoke(input)))

qa_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough(),
    }

    | prompt
    | llm
    | StrOutputParser()
)

query = "What are the main findings about political polarization in 2024?"

# # "What are the main findings about political polarization in 2024?"
# # "What makes political institutions solid and robust?"

# for a single question to test
# response = qa_chain.invoke(query)

# allos conduct evaluation with RAGAS later
full_chain = RunnableMap({
    "user_input": lambda x: x,
    "response": qa_chain,
    "retrieved_contexts": lambda q: [doc.page_content for doc in retriever.invoke(q)]
})

output = full_chain.invoke(query)



In [None]:
print("Answer:", output["user_input"])
print("Contexts:", output["retrieved_contexts"])

Answer: What are the main findings about political polarization in 2024?
Contexts: ['one of two policy areas (along with criminal justice) that did not polarize during this time\nperiod.\nMore recent studies examining the role of parties and interest groups show evidence of grow-\ning partisan polarization. Scholars have investigated factors that could be driving polarization\nin education, including increasing nationalization of education politics, adoption of polarizing\npolicies, and growing alignment of education interest groups with political parties. For example,\nWeinschenk (2022) shows that the relationship between county-level partisan votes for presi-\ndent and state superintendent election vote share has grown much stronger over time—for both\npartisan and nonpartisan races for state superintendent. In other words, even for a down-ballot\neducation-specific office, and even when partisanship is not on the ballot, voters are increasingly\nsupporting candidates for state super

In [None]:
#@title Helper Function to Print Model Response

def extract_answer(response, prefix="Assistant:"):
    """
    Extracts the model's answer from a RAG pipeline response string.

    Args:
        response (dict or str): Output from the RAG pipeline (response["result"]).
        prefix (str): Marker string before the model's answer.

    Returns:
        str: Cleaned answer string (without prompt or retrieved context).
    """
    if isinstance(response, dict):
        text = response.get("result", "")
    else:
        text = response

    if prefix in text:
        return text.split(prefix)[1].strip()
    else:
        return text.strip()

In [None]:
clean_answer = extract_answer(output["response"])
print(clean_answer)

Main findings about political polarization in 2024 were primarily discussed in relation to the lack of polarization in policy areas along with criminal justice. More recent studies indicate increased partisan polarization related to education. These findings suggest that while polarization may not be entirely new, it appears to be becoming more pronounced in recent years. Additionally, observations show that rising polarization and future directions for research regarding education's impact remain uncertain. Political polarization in education can be influenced by changing party affiliations, shifting public opinions, and the presence of special interests within the field. However, the overall trend remains relatively stable. The research indicates that political polarization in education can be caused by factors like changes in party affiliation or ideology, shifts in public opinion, and the rise of special interests. Despite some potential changes, the current trends appear to be rel

## 🧠 2. **Evaluation with RAGAS: Retrieval-Augmented Generation Assessment**



[**RAGAS**](https://aclanthology.org/2024.eacl-demo.16/) introduces a framework for evaluating retrieval-augmented generation (RAG) systems — systems that combine information retrieval with large language models (LLMs). While earlier approaches often relied on specialized LMs, simply appending retrieved documents to a standard LM's input has shown strong performance.

Check out the RAGAS [library](https://docs.ragas.io/en/stable/)


⚠️ Why RAG Evaluation Is Challenging

1. Evaluating complex RAG architectures is non-trivial.
2. It's difficult to measure how well a system retrieves **relevant documents** and **focused passages**.
3. Even with good retrieval, it's challenging to evaluate how **faithfully** an LLM uses the retrieved content.

🎯 RAGAS Metrics

RAGAS proposes a suite of [**automatic evaluation metrics**](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/)  that require **no human annotations**, enabling system-level evaluation along several dimensions:

---

RAG Pipeline Components

A typical RAG system has several moving parts:

- The **retrieval model**
- The **corpus** being searched
- The **LLM** used for generation
- The **prompt** design

RAGAS focuses on evaluating the pipeline along three key metrics:

---

### **1. Faithfulness**

**Goal:** The generated answer should be **grounded in the retrieved context** to avoid hallucination and ensure verifiability.

Let:
- $a_s(q)$: the generated answer for question $q$
- $c(q) $: the retrieved context
- $S(a_s(q)) $: a set of atomic statements extracted from $ a_s(q) $
- $ V \subseteq S $: the subset of statements verified as supported by $c(q)$

✍️ LLM Prompts:

- **Statement extraction** prompt:
  ```
  Given a question and answer, create one or more statements from each sentence in the given answer.
  question: [question]
  answer: [answer]
  ```

- **Statement verification** prompt:
  ```
  Consider the given context and following statements, then determine whether they are supported by the information present in the context.
  Provide a brief explanation for each statement before arriving at the verdict (Yes/No).
  ```

📐 Faithfulness Score:

The **faithfulness score** is computed as:

$$
F = \frac{|V|}{|S|}
$$

Where:
- $|V|$: number of supported statements
- $|S|$: total number of extracted statements

---

### **2. Answer Relevance**

**Goal:** The answer should be **directly relevant** to the question.

To compute this:

1. Use an LLM to generate $n$ questions $q_i$ from the answer $a_s(q)$ using:

   ```
   Generate a question for the given answer.
   answer: [answer]
   ```

2. Compute the cosine similarity between each generated question $q_i$ and the original question $q$.

📐 Answer Relevance Score:

Let $\text{sim}(q, q_i)$ be the cosine similarity between embeddings of $q$ and $q_i$. Then:

$$
AR = \frac{1}{n} \sum_{i=1}^{n} \text{sim}(q, q_i)
$$

Where:
- $AR$: answer relevance
- $n$: number of generated questions

---

### **3. Context Relevance**

**Goal:** The retrieved context should contain **only the necessary information** to answer the question.

Long or irrelevant context passages:
- Increase computational cost
- Decrease LLM effectiveness

To evaluate this:

- Use an LLM to extract the minimal subset of sentences $ S_{\text{ext}} \subseteq c(q) $ necessary to answer $ q $, using the prompt:

   ```
   Please extract relevant sentences from the provided context that can potentially help answer the following question.
   If no relevant sentences are found, or if you believe the question cannot be answered from the given context, return the phrase "Insufficient Information".
   ```

- Compare extracted subset with the full context to judge how focused the retrieved content was.


### **Manual RAGAS Implementation**

In [None]:
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
llm_eval = AutoModelForCausalLM.from_pretrained(model_name,
                                             trust_remote_code=True,
                                             torch_dtype=torch.float16, # load model with smaller memory requirements
                                             device_map='auto',
                                             )

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

In [None]:
#@title Faithfulness Score
def extract_statements(question, answer, tokenizer, llm):
    prompt = f"""Given a question and answer, create one or more statements from each sentence in the given answer.
question: {question}
answer: {answer}"""

    messages = [
        {"role": "system", "content": "You are a helpful assistant that breaks down answers into atomic statements."},
        {"role": "user", "content": prompt}
    ]

    encoded = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(encoded, return_tensors="pt").to(llm.device)

    outputs = llm.generate(**inputs, max_new_tokens=256)
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

    generated_part = decoded[len(encoded):].strip()
    statements = [s.strip("0123456789. ").strip() for s in generated_part.split("\n") if s.strip()]
    return statements


def verify_statements(statements, context, tokenizer, llm):
    results = []
    for s in statements:
        prompt = f"""Consider the given context and the following statement. Determine whether it is supported by the information present in the context.

context: {context}
statement: {s}

Respond with a final verdict: Yes or No."""

        messages = [
            {"role": "user", "content": prompt}
        ]
        enc = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        inputs = tokenizer(enc, return_tensors="pt").to(llm.device)

        outputs = llm.generate(**inputs, max_new_tokens=64)
        answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
        verdict = "yes" in answer.lower()
        results.append(verdict)

    return results

In [None]:
# output['user_input']
len(output['retrieved_contexts'])

3

In [None]:
statements = extract_statements(output['user_input'],
                                clean_answer,
                                tokenizer,
                                llm_eval)

verifications = verify_statements(statements,
                                  output['retrieved_contexts'],
                                  tokenizer,
                                  llm_eval)

faithfulness_score = sum(verifications) / len(statements)
print("Faithfulness score:", round(faithfulness_score, 3))

Faithfulness score: 1.0


In [None]:
for s, v in zip(statements, verifications):
    print(f"statement: {s}, verdict: {v}")

statement: in policy areas beyond criminal justice, verdict: True
statement: Recent studies have shown increased partisan polarization specifically in education, verdict: True
statement: Research indicates that educational polarization might be due to changing party affiliations, shifting public opinions, and the emergence of special interests, verdict: True
statement: The stability of the current trend suggests that although there has been an increase in polarization, these trends are likely to persist, verdict: True
statement: Current research points to both stable and dynamic aspects of political polarization in education, verdict: True
statement: Understanding the causes of polarization, including changes in party affiliation and shifts in public opinion, is crucial for future research, verdict: True
statement: Ongoing research aims to clarify the mechanisms driving the observed patterns of polarization in education, verdict: True


In [None]:
#@title Answer Relevance
from sklearn.metrics.pairwise import cosine_similarity

def generate_questions_from_answer(answer, tokenizer, llm):
    prompt = f"Generate 3 questions that could have this answer:\nanswer: {answer}"
    messages = [{"role": "user", "content": prompt}]

    # format as chat and tokenize
    encoded_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(encoded_prompt, return_tensors="pt").to(llm.device)

    # generate response
    output = llm.generate(**inputs, max_new_tokens=128)
    decoded = tokenizer.decode(output[0], skip_special_tokens=True)

    # remove prompt from the output, if it appears
    if decoded.startswith(encoded_prompt):
        decoded = decoded[len(encoded_prompt):]

    # extract just the questions
    questions = [line.strip("-• ").strip() for line in decoded.split("\n") if "?" in line]

    return questions


def compute_answer_relevance(original_q, generated_qs, model):
    q_vec = model.encode([original_q])
    gen_vecs = model.encode(generated_qs)
    sims = cosine_similarity(q_vec, gen_vecs)[0]
    return float(sims.mean())

In [None]:
generated_qs = generate_questions_from_answer(output['user_input'], tokenizer, llm_eval)
answer_relevance_score = compute_answer_relevance(output['user_input'], generated_qs, emb_model)
print("Answer Relevance score:", round(answer_relevance_score, 3))

Answer Relevance score: 0.861


In [None]:
generated_qs

['answer: What are the main findings about political polarization in 2024?',
 '1. How did recent events or trends contribute to increased political polarization in 2024?',
 '2. What measures can be taken to reduce political polarization and promote unity among different groups?',
 '3. In what ways has social media played a role in exacerbating or mitigating political polarization in 2024?']

In [None]:
#@title Context Relevance

def extract_relevant_context(context, question, tokenizer, llm):
    prompt = f"""Please extract relevant sentences from the provided context that can help answer the question.
 While extracting candidate sentences you’re not allowed to make any changes to sentences from given context. Print the sentences word-by-word as they are.

question: {question}
context: {context}"""

# If none, say "Insufficient Information"

    messages = [{"role": "user", "content": prompt}]
    encoded = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(encoded, return_tensors="pt").to(llm.device)

    output = llm.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
    answer = tokenizer.decode(output[0], skip_special_tokens=False)

    # Clean everything before the assistant's reply
    assistant_marker = "<|im_start|>assistant\n"
    if assistant_marker in answer:
        answer = answer.split(assistant_marker, 1)[-1].strip()
    else:
        answer = answer.strip()
    return answer

In [None]:
relevant_sentences = extract_relevant_context(output['retrieved_contexts'], output['user_input'], tokenizer, llm_eval)

In [None]:
# flatten all retrieved contexts into sentences using '.' as a naive sentence splitter
all_sentences = []
for chunk in output['retrieved_contexts']:
    # split by period, remove empty strings, and strip whitespace
    sentences = [s.strip() for s in chunk.split('.') if s.strip()]
    all_sentences.extend(sentences)

clean_relevant_sentences = [s.strip() for s in relevant_sentences.split('.') if s.strip()]

# now compare with number of relevant sentences
relevance_ratio = len(clean_relevant_sentences) / len(all_sentences) if all_sentences else 0
context_relevance_score = round(relevance_ratio, 3)

print("Context Relevance Score (sentence-level):", context_relevance_score)

📊 Context Relevance Score (sentence-level): 0.476


In [None]:
relevant_sentences

'The extracted sentences related to "What are political institutions?" are:\n\n1. **"Individuals are governed by institutions they accept and support."**\n   - Explanation: This sentence directly addresses the concept of political institutions being accepted and supported by individuals within a political community.\n\n2. **"Political institutions play an important role in shaping political behavior and outcomes."**\n   - Explanation: This statement highlights the significance of political institutions in influencing political behaviors and results, which is crucial information about what political institutions entail.\n\n3. **"While it is unlikely in modern societies that a group would share a commitment to enact certain laws or to promote shared ethical or cultural values,"**\n   - Explanation: This sentence introduces the idea that despite potential disagreement, there is often a common commitment to institutions such as Parliaments or constitutions, indicating the presence of polit

In [None]:
output['retrieved_contexts']

['vidual members of that group. Instead, individuals are governed by institutions they accept and\nsupport. Of course, since all political communities feature deep disagreements, one might ques-\ntion whether groups can in fact share common political commitments. But while it is unlikely in\nmodern societies that a group would share a commitment to enact certain laws or to promote\nshared ethical or cultural values, it is more common that most members of a political community\nwill share a commitment to associate together in institutions that they accept as a legitimate way\nto define and enforce justice among themselves (e.g., to accept Parliament or to accept the Consti-\ntution as a source of valid law). For that reason, where feasible and consistent with the minimally\njust performance of state functions, groups with common political commitments ought to be al-\nlowed to govern themselves. This arrangement promotes the political autonomy of those groups’\nmembers.\nEXCLUSIONARY CLA