<center>
  <img src="https://python.langchain.com/assets/images/rag_concepts-4499b260d1053838a3e361fb54f376ec.png"
       width="640" alt="RAG concepts">
  <div><small><a href="https://python.langchain.com/docs/concepts/rag">Source</a></small></div>

# Building a minimal Retrieval-Augmented Generation pipeline

In this tutorial, you will build a simple Retrieval-Augmented Generation pipeline using the [ETH Zurich Degree Programmes PDF](https://ethz.ch/content/dam/ethz/main/education/bachelor/studiengaenge/files/ETH-Zurich-Degree-programmes.pdf) as the corpus. We begin by showing why plain LLM queries on this document don't always work, continuing with setting up the RAG pipeline.

![](https://i.ibb.co/nsJTYh6j/LLM-azure-text-clean.png)

We will call the LLM via an API (Application Programming Interface) — a defined interface that allows two programs to interact (here, the notebook code and the LLM service). 

## Preparations
As a first step, we need to install a few libraries:

In [None]:
%pip install openai

In [None]:
%pip install pypdf

In [None]:
%pip install langchain-text-splitters

In [None]:
%pip install sentence_transformers

In [None]:
%pip install -U ipywidgets  # may need to update ipywidgets

In [None]:
%pip install faiss-cpu

In [None]:
%pip install docling

In [None]:
%pip install accelerate

We recommend you restart the kernel so that the newly installed packages will be available.

## Launching models using AzureOpenAI

Here, we will use AzureOpenAI. We are hard-coding an Azure API key — note that we will disable this one after the block; so if you want to run this notebook afterwards, you have to set a different value for the  `azure_key`, otherwise you will get an error and will not be able to get responses from the language model.

In [None]:
import os
from openai import AzureOpenAI

# Technical set-up
azure_key = "986IfxLKwN3Paiq4yx1Kn2iTG7FyG2GxFg17qQSyr1KZqGLaizAGJQQJ99BCACI8hq2XJ3w3AAABACOGQfvw"

endpoint = os.getenv("ENDPOINT_URL", "https://cas-dml-llm.openai.azure.com/")
deployment = os.getenv("DEPLOYMENT_NAME", "gpt-35-turbo")
subscription_key = os.getenv("AZURE_OPENAI_API_KEY", azure_key)

# Initialize AzureOpenAI Service client with key-based authentication
client = AzureOpenAI(
    azure_endpoint=endpoint,
    api_key=subscription_key,
    api_version="2024-05-01-preview",
)

We initialized the AzureOpenAI client and now want to send queries to the model. The system message sets the assistant’s behavior (answer in English, be helpful), and the user message contains our actual question.

In [None]:
# Сore funtion to interact with the llm over the API
def get_ai_response(query):
    messages = [
        {"role": "system", "content": "You are a helpful assistant. Answer questions in English."},  # set the behavior
        {"role": "user", "content": query}  # our question
    ]

    response = client.chat.completions.create(
        model=deployment,  # deployment name in Azure
        messages=messages,
        temperature=0,  # deterministic answer
    )

    return response.choices[0].message.content  # extract AI's reply; choices containts possible model answers

Now, let us check the model:

In [None]:
# Example query
query = "What is RAG in AI?"
answer = get_ai_response(query)
print(answer)

## Why we need RAG?

The test answer seems nice. Now, let's try asking something more local and recent. The question we want to ask is: “How many ETH Zurich spin-off companies were founded in 2024?”.

In [None]:
query = "How many ETH Zurich spin-off companies were founded in 2024?"
answer = get_ai_response(query)
print(answer)

**Why it happened?** The model doesn't have up-to-date or document-specific knowledge. Remember that the LLM’s knowledge comes from a fixed training snapshot. Therefore, if a fact is niche or very recent, the model likely didn’t see it during the training.

So, let us try to enrich the prompt with relevant information extracted from the file about ETH Zurich so the model can answer based on facts, not guesswork.

## What is RAG (briefly)

* **R**etrieval — fetch relevant chunks from an external corpus (e.g., a PDF).

* **A**ugmented **G**eneration — inject these chunks into the prompt so the LLM relies on facts.

![](https://i.ibb.co/qMghzdwY/image.png)

### Main steps:

1. Extract text from the PDF → split it into chunks.

2. Build vector representations of the chunks using a neural model.

3. Create an index (a vector database).

4. For each question → embed the question → search for the top-k most similar chunks → build a prompt like:

```
Context:
<chunk1>
<chunk2>

Question: <question>

Instructions: answer using only the context; if the answer is not present — say so.
```

5. Send this prompt to the model via AzureOpenAI.


## Prepare the data

### Step 1. Install dependencies and download the PDF

The document we'll be using as basis for RAG is already available on RenkuLab. 

If you need to download a file from a public source on the internet, you can use the code below (With changed file names as necessary).

In [None]:
# download the document by link

# !wget --no-check-certificate \
#      --header="User-Agent: Mozilla/5.0 (X11; Linux x86_64)" \
#      -O ETH-Zurich-Degree-programmes.pdf \
#      "https://ethz.ch/content/dam/ethz/main/education/bachelor/studiengaenge/files/ETH-Zurich-Degree-programmes.pdf"

### Step 2. Read the PDF and extract text

Next, we open the pdf file with pypdf — a Python library for reading and manipulating PDFs (extract text, merge/split, rotate, etc.).

In [None]:
from pypdf import PdfReader
from pathlib import Path

FILE_PATH = Path("ETH-Zurich-Degree-programmes.pdf")
reader = PdfReader(FILE_PATH)
number_of_pages = len(reader.pages)

entire_text = ""
for page_num in range(number_of_pages):
    page = reader.pages[page_num]
    entire_text += page.extract_text()

# Let us have a look at the text
entire_text[:200]

Hidden/template layers often get mixed into the plain text. For instance, "Spitztitel Lorem Ipsum dolor sit amet": German "Spitztitel" = running title; the Lorem Ipsum is template/placeholder text left in a master page layer. Text extractors still see it.

### Step 3. Split text into chunks

Now we’ll split the text into **chunks** — small, self-contained pieces that the model can search.

We’ll use a ```RecursiveCharacterTextSplitter``` splitter, which cuts on natural boundaries (paragraphs → lines → words) and keeps a small overlap to preserve context.

Note: depending on your data, you may wish to pick another splitter: Markdown (```MarkdownHeaderTextSplitter```), token-based (RecursiveTokenTextSplitter) for strict context budgets, or language-aware for code.

In [None]:
# LangChain is an open-source framework for building LLM apps from modular blocks
# We are going to only take the splitter from it

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

text_chunks = text_splitter.split_text(entire_text)
print(f"Total chunks: {len(text_chunks)}")

```chunk_size``` is the target length of each chunk (in tokens or characters, depending on the splitter), and ```chunk_overlap``` is how much content to repeat between adjacent chunks (e.g., 10–25%) so information that spans a boundary isn’t lost.

**How to choose the chunk size and overlap?** A typical answer should fit in one chunk (sometimes two) without dragging along lots of extra text. Start with defaults, then test 5–10 real queries: if the relevant passage isn’t in the top-k, increase chunk size/overlap; if the context feels bloated, decrease them.

**Larger** ```chunk_size``` generally increases recall — i.e., the chance that relevant content appears in the top-k — because more context stays together; but it also adds noise, slows search, and reduces diversity. **Smaller** ```chunk_size``` improves precision and speed but can split facts across chunks (mitigate with 10–20% overlap or a higher k).”

Use ```chunk_overlap``` to protect **boundary info**: a moderate 10–20% overlap keeps cross-boundary details; higher (20–30%, e.g., for code/step-by-step docs) improves recall but bloats the index and yields near-duplicates; lower (0–10%) is lighter and faster but may miss boundary context.

Let us have a look at the first chunks:

In [None]:
text_chunks[:2]

## Make vector dataset

## How to compare the strings (what is embeddings)?

Now we have a list of chunks (context passages).

**What we want?** Given a question, we want to automatically retrieve the most relevant chunks and pass them to the model.

**What do we need?** To do that, we need a way to measure the semantic similarity (closeness) between the question and each chunk.

**How to compare the closeness?** Raw strings are hard to compare “by meaning.” So we turn each chunk into a vector of numbers (an **embedding**). Embeddings have a useful property: texts with similar meaning → nearby vectors (high cosine similarity). That lets us search by meaning, not exact words.

**Why we have this property or what is an encoder?** Encoder is a pretrained model that maps text → vector. It’s trained so semantically similar texts land close together in vector space. Important: what counts as “similar” depends on the task, so different tasks may need different encoders.

**How to pick an encoder?** There are [lots of encoders](https://huggingface.co/spaces/mteb/leaderboard)! Choose based on your task and technical needs.

**What we'll do next?** In practice, we convert text to embeddings (vectors) and compare them (e.g., with cosine similarity), then send the top-k chunks to the model.

<center>
  <img src="https://arize.com/wp-content/uploads/2022/06/blog-king-queen-embeddings.jpg"
       width="640" alt="RAG concepts">
  <div><small><a href="https://arize.com/blog-course/embeddings-meaning-examples-and-how-to-compute/">Source</a></small></div>

### Step 1. Chunks to embeddings (vectors)

In [None]:
%env TOKENIZERS_PARALLELISM=false  # technical

In [None]:
from sentence_transformers import SentenceTransformer

# Download a model to create vector representations of text
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Encode text chunks into embeddings (one vector per chunk)
embeddings = model.encode(text_chunks, batch_size=64, show_progress_bar=True)

### Step 2. Make a vector database

To store embeddings and search for the nearest embeddings fastly, we use [FAISS](https://github.com/facebookresearch/faiss) (Facebook AI Similarity Search). It is a fast library for finding nearest vectors. A **FAISS index** is both a container for your embedding vectors and the search method that makes lookups fast.

Let's make a FAISS index.

In [None]:
# Create a FAISS index for efficient similarity search

import faiss
embeddings = embeddings.astype("float32")

# Cosine-similarity trick: L2-normalize so inner product ≈ cosine similarity
faiss.normalize_L2(embeddings)

# Build an exact inner-product index
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)

## Getting the answer

Now, we will do a final step! For each question → embed the question → search for the top-k most similar chunks → build a prompt like:

```
Context:
<chunk1>
<chunk2>

Question: <question>

Instructions: answer using only the context; if the answer is not present — say so.

### Step 1. Embed the question

In [None]:
query = "How many ETH Zurich spin-off companies were founded in 2024?"

# reminder: model is an embedding model we initialized before
query_embedding = model.encode([query], normalize_embeddings=True)

### Step 2. Search for the top-k most similar chunks

In [None]:
# reminder: index is FAISS index (vector database) we have made before
D, I = index.search(query_embedding, k=2)  # search for top-2 closest chunks

retrieved_chunks = [text_chunks[i] for i in I[0]]  # I has a (n_queries, k) shape; we have 1 query
retrieved_chunks

**The chunk we need**. The nearest chunk was retrieved correctly, but the content is low-quality. We need the following fragment, but it’s corrupted by text extraction:
>Switzerland created this \n place of innovation and \nknowledge\nspin-off companies \nfounded in 2024 \nof which CHF 1.42bn

**Why it happened?** The problem is that while the PDF looks fine to a human, the computer fails to capture the number 37 in the extracted text. Let us have a look at the PDF fragment we need:

<center>
  <img src="https://i.ibb.co/pv9j6FzJ/image.png"
       width="640" alt="RAG concepts">



### Get the answer

Finally, we’ll generate an answer by injecting the retrieved chunks into the model prompt. First, let’s refactor the AI interuction function:

In [None]:
def get_ai_response_with_context(query, context):
    messages = [
        {
            "role": "system",
            "content": "You are an assistant that answers questions based on the provided information."
            },  # set the behavior

        {
            "role": "user",
            "content": f"Question: {query}\n\nContext:\n{context}"
            }
    ]

    response = client.chat.completions.create(
        model=deployment,  # deployment name in Azure
        messages=messages,
        temperature=0,  # deterministic answer
    )

    return response.choices[0].message.content  # extract AI's reply; choices containts possible model answers

Now, we can get an answer:

In [None]:
query = "How many ETH Zurich spin-off companies were founded in 2024?"
context = "\n\n".join(retrieved_chunks)

answer = get_ai_response_with_context(query, context)
print(answer)

As we have seen, the reason is data preparation. Let's fix it.

## Another try: data preparation using Docling

### Step 1. Data preparation using Docling

Our goal is to extract text from PDF correctly. We use the [Docling](https://github.com/docling-project/docling?tab=readme-ov-file) library to convert PDF into structured Markdown. Unlike simple text extraction, Docling combines existing text with OCR (text from scanned pages) and preserves basic structure such as headings, paragraphs, and tables.

This is a pretty compute-intensive task, so it will take some time.

In [None]:
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("ETH-Zurich-Degree-programmes.pdf")

# the below code can be used to work with pdf files that are available on the internet:
# source = "https://ethz.ch/content/dam/ethz/main/education/bachelor/studiengaenge/files/ETH-Zurich-Degree-programmes.pdf"  # document per local path or URL
# converter = DocumentConverter()
# result = converter.convert(source)

In [None]:
md = result.document.export_to_markdown()
print(md[:500])

As we see, it's much better. Let us clean the text.

### Step 2. Clean the extracted text

The text still contains noise: invisible placeholders, fake headings, and formatting artifacts. This function cleans the text by removing image markers, “lorem ipsum” placeholders, and page artifacts, while also fixing line breaks and hyphenated words.

In [None]:
import re

def clean_docling_markdown(text: str) -> str:
    # remove comments <!-- image -->
    text = re.sub(r"<!--\s*image\s*-->\s*", "", text, flags=re.I)

    # remove obvious placeholder strings
    placeholders = [
        r"lorem ipsum.*",   # lorem ipsum ...
        r"spitztitel",      # layout headline
        r"dummy",           # the word "dummy"
        r"platzhalter",     # "placeholder" in German
        r"placeholder",     # "placeholder" in English
    ]
    text = re.sub("(?mi)^(" + "|".join(placeholders) + r")\s*$", "", text)

    # normalize line breaks and spaces
    text = text.replace("\r", "")
    text = re.sub(r"[ \t]+\n", "\n", text)              # remove trailing spaces before newline
    text = re.sub(r"\n{3,}", "\n\n", text)              # maximum two consecutive newlines

    # merge word breaks across line breaks: "Spin-\noff" -> "Spinoff"
    text = re.sub(r"(\w)[\-–]\n(\w)", r"\1\2", text)

    # remove very short single-line "placeholder headings"
    text = re.sub(r"(?m)^\s*[A-ZÄÖÜ][A-Za-zÄÖÜäöüß]{1,12}\s*$", "", text)

    return text.strip()

clean_md = clean_docling_markdown(md)
clean_md[:1000]

We don’t remove the page numbers here because the PDF’s layout makes it easy to accidentally delete other content. You might normally do this to reduce unnecessary information.

### Step 3. Split the text into chunks

Now, we should just repeat the steps we have done before.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_text(clean_md)

print(f"Total chunks: {len(chunks)}")
print(f"Example chunk: \n {chunks[0]}")

Looks nice, let us repeat the vector dataset preparation.

### Step 4. Encode chunks and create FAISS index

In [None]:
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = model.encode(chunks, batch_size=64, show_progress_bar=True)

embeddings = embeddings.astype("float32")
faiss.normalize_L2(embeddings)
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)

### Step 5. Retrieve top chunks for a query

In [None]:
query = "How many ETH Zurich spin-off companies were founded in 2024?"
query_embedding = model.encode([query], normalize_embeddings=True)
D, I = index.search(query_embedding, k=2)  # get top-2 closest chunks

retrieved_chunks = [chunks[i] for i in I[0]]
retrieved_chunks

### Step 6. Ask the LLM with context

In [None]:
query = "How many ETH Zurich spin-off companies were founded in 2024?"
context = "\n\n".join(retrieved_chunks)

answer = get_ai_response_with_context(query, context)
print(answer)

## Assignments

### Extend a Query to a RAG Query

You want to know how many patent applications were reported at ETH Zurich in 2024. You ask the LLM:

In [None]:
query = "How many patent applications were reported at ETH Zurich in 2024?"
answer = get_ai_response(query)
print(answer)

**Task 1:** Fix the situation, using the same PDF about degree programs at ETH. In particular, you should:
* get a query embedding
* find the chunks closest to the query (in embedding space)
* build a list of the retrieved chunks, and add this to the context
* get an AI response for a query with the constructed context.

Note that all the necessary function calls are already available above — so you just need to find the relevant code and put it together.

### Impact of `chunk_size` and `chunk_overlap` parameters
To get an intuition for the impact of the two parameters, you will experiment with the ```chunk_size``` and ```chunk_overlap``` parameters.

**Task 2:**

(a) Write a function that takes a cleaned Markdown string (we already have it!), chunk_size, chunk_overlap, and a query. It should return the retrieved chunks and the model’s answer. The function should do the following:

1. Split the cleaned Markdown into overlapping text chunks.

2. Encode the chunks with a SentenceTransformer.

3. Build a FAISS inner-product index (cosine similarity on L2-normalized vectors).

4. Retrieve the top-k chunks most similar to the query (use a reasonable default, e.g., k = 2).

5. Call a LLM helper to produce an answer from the (query + concatenated context).

Again, note that all relevant parts are already given above - you just have to select and combine.

(b) Examine the function’s behavior for different chunk sizes and overlaps. Describe your findings.

You may want to print or inspect the retrieved chunks to see how they affect the final answer.

**For your experiments, here are sample questions with their answers (based on the document)**:

1. How many ETH Zurich spin-off companies were founded in 2024? — 37

2. At what semester will ETH Zurich change its academic calendar? — Autumn semester 2027

3. How many patent applications were reported at ETH Zurich in 2024? — More than 100

4. What percentage of Master’s graduates at ETH Zurich have a job one year after graduation? — 97%

5. After Autumn Semester 2027, what will be the maximum permitted duration of studies? — 6 years