<a href="https://colab.research.google.com/github/Nebius-Academy/LLM-Engineering-Essentials/blob/main/topic3/3.1_the_concept_of_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Engineering Essentials by Nebius Academy

Course github: [link](https://github.com/Nebius-Academy/LLM-Engineering-Essentials/tree/main)

The course is in development now, with more materials coming soon.

# LLMOps Essentials 3.1. The concept of RAG

Whether you're creating an NPC is a fictional universe or a legal assistant bot, you'll likely be concerned about factuality. You don't want the NPC to misname the land's ruler or the legal assistant to cite a non-existent law. Unfortunately, LLMs on their own are far from being omniscient and, moreover, are prone to hallucinations. In this notebook we'll discuss factuality problems in more details before getting started with the remedy - **Retrieval Augmented Generation**.

## Getting ready

In [1]:
!pip install -q openai

In [2]:
import os

with open("nebius_api_key", "r") as file:
    nebius_api_key = file.read().strip()

os.environ["NEBIUS_API_KEY"] = nebius_api_key

We'll be calling APIs quite often in this notebook, so let's define a shortcut fuction to avoid repeating all the code:

In [6]:
from openai import OpenAI

nebius_client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)

llama_8b_model = "meta-llama/Meta-Llama-3.1-8B-Instruct"

def prettify_string(text, max_line_length=80):
    """Prints a string with line breaks at spaces to prevent horizontal scrolling.

    Args:
        text: The string to print.
        max_line_length: The maximum length of each line.
    """

    output_lines = []
    lines = text.split("\n")
    for line in lines:
        current_line = ""
        words = line.split()
        for word in words:
            if len(current_line) + len(word) + 1 <= max_line_length:
                current_line += word + " "
            else:
                output_lines.append(current_line.strip())
                current_line = word + " "
        output_lines.append(current_line.strip())  # Append the last line
    return "\n".join(output_lines)

def answer_with_llm(prompt: str,
                    system_prompt="You are a helpful assistant",
                    max_tokens=512,
                    client=nebius_client,
                    model=llama_8b_model,
                    prettify=True,
                    temperature=None) -> str:

    messages = []

    if system_prompt:
        messages.append(
            {
                "role": "system",
                "content": system_prompt
            }
        )

    messages.append(
        {
            "role": "user",
            "content": prompt
        }
    )

    completion = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=max_tokens,
        temperature=temperature
    )

    if prettify:
        return prettify_string(completion.choices[0].message.content)
    else:
        return completion.choices[0].message.content

# The pitfalls of factuality

## 1. Hallucinations

Let's ask Llama-3.1-70B a question about the [Pathfinder](https://en.wikipedia.org/wiki/Pathfinder_Roleplaying_Game) universe.

In [None]:
result = answer_with_llm(
    prompt="Which deity in the Pathfinder universe has a servant called Peace Through Vigilance?",
    model="meta-llama/Meta-Llama-3.1-70B-Instruct"
    )

print(result)

I'm not familiar with a specific deity in the Pathfinder universe that has a
servant called "Peace Through Vigilance." Could you provide more context or
information about where you encountered this name? I'd be happy to try and help
you find the answer.


In [None]:
result = answer_with_llm(
    prompt="Which deity in the Pathfinder universe has a servant called Peace Through Vigilance?",
    model="meta-llama/Meta-Llama-3.1-70B-Instruct"
    )

print(result)

According to the Pathfinder universe, the deity Iomedae has a servant called
Peace Through Vigilance.


In [None]:
result = answer_with_llm(
    prompt="Which deity in the Pathfinder universe has a servant called Peace Through Vigilance?",
    model="meta-llama/Meta-Llama-3.1-70B-Instruct"
    )

print(result)

In the Pathfinder universe, the deity who has a servant called "Peace Through
Vigilance" is Sarenrae.


The right answer is [Iomedae](https://pathfinder.fandom.com/wiki/Iomedae). If you run the above cell several times, Llama-3.1-70B will sometimes give the right answer and sometimes it won't. This is a case of **extrinsic hallucination**. Of course, with more widespread knowledge, the likelihood of a mistake would be lower.

**Note**. Much also depends on what was in the training data. For example, it is likely that Llama-3.1 models were trained on the [Pathfinder Fandom wiki](https://pathfinder.fandom.com/) but not on [this wiki](https://pathfinderwiki.com/). Here is the evidence in favor of this hypothesis:, the deity [Casandalee](https://pathfinderwiki.com/wiki/Casandalee) is mentioned in the latter but not in the former, and Llamas-3.1 have no idea who she is. At the same time, GPT-4o is perfectly aware of Casandalee, so it's probably trained on the second wiki.

## Knowledge cut-off

Let's ask Llama about something which happened very recently, so that the information just couldn't get into the training dataset.

In [None]:
result = answer_with_llm(
    prompt="How DeepSeek R1 was trained?",
    model="meta-llama/Meta-Llama-3.1-70B-Instruct"
    )

print(result)

DeepSeek is an AI-powered image analysis platform, and its R1 model is a deep
learning-based model trained on a large dataset of images. Unfortunately, I
couldn't find a detailed description of the exact training process used for
DeepSeek R1. However, I can provide some general information on how deep
learning models like DeepSeek R1 are typically trained.

DeepSeek R1 is likely a convolutional neural network (CNN) based model, which
is commonly used for image analysis tasks. The training process for such models
typically involves the following steps:

1. **Data collection**: A large dataset of images is collected, which includes
images of various types, such as microscopy images, medical images, or other
types of images that DeepSeek is designed to analyze.
2. **Data preprocessing**: The collected images are preprocessed to enhance
image quality, remove noise, and normalize the images.
3. **Data augmentation**: The preprocessed images are augmented to increase the
size of the dataset 

For Llama-3.1, the pretraining data has a cutoff of December 2023. So, it knows nothing of DeepSeek R1 which emerged in February 2025.

## Long context (non-)proficiency

Most of today's LLMs have generous context length meaning that theoretically you can put lots of information into a prompt. But can LLMs really use it and extract meaningful facts and connections from it? Not really.

Let's look at a simple example - Q&A about 2024 Summer Olympics. Llama-3.1 was trained on data that predated the event, and it's quite obvious from how it answers the following question:

In [None]:
result = answer_with_llm(
    prompt="What do you know about the 2024 Summer Olympics?",
    model="meta-llama/Meta-Llama-3.1-70B-Instruct"
    )

print(result)

The 2024 Summer Olympics, officially known as the Games of the XXXIII Olympiad,
are scheduled to take place from July 26 to August 11, 2024, in Paris, France.
Here's what I know about the upcoming event:

**Host City:** Paris, France (the City of Light will be hosting the Olympics
for the third time, after 1900 and 1924)

**Dates:** July 26 to August 11, 2024

**Participating Countries:** Over 200 countries are expected to participate in
the Games.

**Sports:** The International Olympic Committee (IOC) has confirmed that the
2024 Olympics will feature 32 sports, including:

1. New sports:
* Breakdancing (making its Olympic debut)
* Skateboarding (returning after its debut in 2020)
* Sport Climbing (returning after its debut in 2020)
* Surfing (returning after its debut in 2020)
2. Returning sports:
* Baseball and Softball (returning after a 12-year absence)
* Karate (returning after its debut in 2020)

**Competition Venues:** The Olympics will be held across various venues in
Paris and

Notice the future tense: Llama-3.1 still lives in 2023.

To make a Q&A tool, we'll gather data about 2024 Olympics from Wikipedia and try to include all this information into the context.

In [None]:
!pip install -q requests beautifulsoup4 tqdm

In [None]:
import requests
from bs4 import BeautifulSoup
import time
import json
from tqdm import tqdm
from typing import List, Dict, Set, Optional

class Olympics2024WikiScraper:
    def __init__(self):
        self.base_url = "https://en.wikipedia.org/w/api.php"
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Olympics2024Scraper/1.0 (Research purposes)'
        })

    def get_category_members(self, category: str) -> Set[str]:
        """
        Get all pages in a category and its subcategories
        """
        pages = set()
        categories_to_process = {category}
        processed_categories = set()

        with tqdm(desc="Fetching category pages") as pbar:
            while categories_to_process:
                current_category = categories_to_process.pop()
                if current_category in processed_categories:
                    continue

                params = {
                    'action': 'query',
                    'format': 'json',
                    'list': 'categorymembers',
                    'cmtitle': current_category,
                    'cmlimit': 500,
                    'cmtype': 'page|subcat'
                }

                while True:
                    try:
                        response = self.session.get(self.base_url, params=params)
                        data = response.json()

                        if 'query' in data and 'categorymembers' in data['query']:
                            for member in data['query']['categorymembers']:
                                if member['ns'] == 0:  # Regular article
                                    pages.add(member['title'])
                                elif member['ns'] == 14:  # Subcategory
                                    categories_to_process.add(member['title'])
                                pbar.update(1)

                        if 'continue' not in data:
                            break

                        params.update(data['continue'])
                        time.sleep(1)  # Be nice to Wikipedia's servers

                    except Exception as e:
                        print(f"Error processing category {current_category}: {str(e)}")
                        break

                processed_categories.add(current_category)

        return pages

    def parse_table(self, table) -> str:
        """
        Convert a HTML table to text format
        """
        rows = []

        # Get headers
        headers = []
        for th in table.find_all('th'):
            headers.append(th.get_text(strip=True))
        if headers:
            rows.append(" | ".join(headers))
            rows.append("-" * len(rows[0]))  # Add separator line

        # Get data rows
        for tr in table.find_all('tr'):
            cells = []
            # Include both th and td as some tables use th in rows
            for cell in tr.find_all(['td', 'th']):
                # Clean up the cell text
                text = cell.get_text(strip=True)
                text = ' '.join(text.split())  # Normalize whitespace
                cells.append(text)
            if cells and cells != headers:  # Avoid duplicate header rows
                rows.append(" | ".join(cells))

        return "\n".join(rows)

    def get_page_content(self, title: str) -> Optional[Dict]:
        """
        Get the content of a Wikipedia page using the API, including tables
        """
        params = {
            'action': 'parse',
            'format': 'json',
            'page': title,
            'prop': 'text|info',
            'formatversion': '2'
        }

        try:
            response = self.session.get(self.base_url, params=params)
            data = response.json()

            if 'parse' in data:
                page = data['parse']
                html = page['text']
                soup = BeautifulSoup(html, 'html.parser')

                # Process tables first
                table_texts = []
                for table in soup.find_all('table', class_='wikitable'):
                    table_text = self.parse_table(table)
                    if table_text:
                        table_texts.append(table_text)

                # Remove unwanted elements
                for element in soup.find_all(['script', 'style', 'sup', 'ref']):
                    element.decompose()

                # Get main text
                for table in soup.find_all('table'):
                    table.decompose()  # Remove tables after parsing them

                main_text = soup.get_text(separator=' ')
                main_text = ' '.join(main_text.split())  # Normalize whitespace

                # Combine main text with table texts
                full_content = main_text
                if table_texts:
                    full_content += "\n\nTables:\n\n" + "\n\n".join(table_texts)

                return {
                    'title': title,
                    'content': full_content,
                    'url': f"https://en.wikipedia.org/wiki/{title.replace(' ', '_')}"
                }

        except Exception as e:
            print(f"Error processing page {title}: {str(e)}")

        return None

    def scrape_olympics_articles(self, output_jsonl: str = 'olympics_2024.jsonl',
                               output_text: str = 'olympics_2024.txt'):
        """
        Scrape all articles related to 2024 Olympics and save them
        """
        # Main category for 2024 Summer Olympics
        main_category = "Category:2024 Summer Olympics"

        print("Starting to scrape 2024 Olympics articles...")

        # Get all relevant pages
        pages = self.get_category_members(main_category)
        print(f"Found {len(pages)} pages to process")

        # Process each page and save to JSONL
        processed_articles = []
        with open(output_jsonl, 'w', encoding='utf-8') as f_jsonl:
            for title in tqdm(pages, desc="Processing articles"):
                article = self.get_page_content(title)
                if article and len(article['content'].strip()) > 100:  # Skip very short articles
                    f_jsonl.write(json.dumps(article, ensure_ascii=False) + '\n')
                    f_jsonl.flush()
                    processed_articles.append(article)
                time.sleep(1)  # Rate limiting

        # Create plain text version
        print(f"Creating text file {output_text}...")
        with open(output_text, 'w', encoding='utf-8') as f_text:
            for article in processed_articles:
                f_text.write(article['content'])
                f_text.write('\n\n')  # Add extra newline between articles


The following code will run forever, so to save yourself from the ordeal, you can just skip the following cell and load it from google drive.

In [None]:
scraper = Olympics2024WikiScraper()
scraper.scrape_olympics_articles()

Starting to scrape 2024 Olympics articles...


Fetching category pages: 13158it [00:16, 775.99it/s]


Found 10359 pages to process


Processing articles: 100%|██████████| 10359/10359 [3:23:03<00:00,  1.18s/it]


Creating text file olympics_2024.txt...


In [3]:
# Download the data without re-scraping it
!gdown 1Q-_hrvW_NcEZiCReG5ZCqUL0SmOas6Yv

Downloading...
From: https://drive.google.com/uc?id=1Q-_hrvW_NcEZiCReG5ZCqUL0SmOas6Yv
To: /content/olympics_2024.jsonl
100% 67.0M/67.0M [00:03<00:00, 18.9MB/s]


In [4]:
import json

file_path = "olympics_2024.jsonl"

olympics_data = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        try:
            olympics_data.append(json.loads(line))
        except json.JSONDecodeError as e:
            print(f"Skipping invalid JSON line: {line.strip()}")
            print(f"Error: {e}")

print(f"Number of articles loaded: {len(olympics_data)}")

Number of articles loaded: 10357


Now, imagine that we want to learn who won gold in Breaking.

Without context the LLM will be unable to answer.

In [7]:
target_index = 1607
result = answer_with_llm(
    prompt=f"""Who won gold in Breaking at the 2024 Summer Olympics?""",
    model="meta-llama/Meta-Llama-3.1-70B-Instruct"
    )

print(result)

Since the 2024 Summer Olympics have not yet occurred, I do not have information
about who won gold in Breaking.


Now, let's find a relevant article and add it to context:

In [8]:
def search_articles(articles, search_term):
    results = []
    for i, article in enumerate(articles):
        if search_term in article['title']:
            results.append((i, article))
    return results

# Search for the article
search_term = "Breaking at the 2024 Summer Olympics"
found_articles = search_articles(olympics_data, search_term)

# Print the results
if found_articles:
    print(f"Found {len(found_articles)} articles matching the search term:")
    for indx, article in found_articles:
        print(f"Index: {indx}")
        print(article['title'])
        print(article['url'])
else:
    print("No articles found matching the search term.")


Found 4 articles matching the search term:
Index: 1607
Breaking at the 2024 Summer Olympics
https://en.wikipedia.org/wiki/Breaking_at_the_2024_Summer_Olympics
Index: 2769
Breaking at the 2024 Summer Olympics – Qualification
https://en.wikipedia.org/wiki/Breaking_at_the_2024_Summer_Olympics_–_Qualification
Index: 4816
Breaking at the 2024 Summer Olympics – B-Girls
https://en.wikipedia.org/wiki/Breaking_at_the_2024_Summer_Olympics_–_B-Girls
Index: 10204
Breaking at the 2024 Summer Olympics – B-Boys
https://en.wikipedia.org/wiki/Breaking_at_the_2024_Summer_Olympics_–_B-Boys


In [9]:
target_index = 1607
result = answer_with_llm(
    prompt=f"""Given the following context, tell who won gold in Breaking at the 2024 Summer Olympics.
    #CONTEXT: {olympics_data[target_index]["content"]}""",
    model="meta-llama/Meta-Llama-3.1-70B-Instruct"
    )

print(result)

The gold medal winners in Breaking at the 2024 Summer Olympics were:

* Philip Kim (Phil Wizard) from Canada in the B-Boys event
* Ami Yuasa (Ami) from Japan in the B-Girls event


Of course, we got the answer, but we did the retrieval work to fetch the right article. Now, what if the context contains all the Olympics articles at once?

First, let's understand how many tokens will it be.

Now, we need to use the actual **Llama-3.1-8B** tokenizer to calculate the number of tokens. For that, you'll need to:

* Register to [Hugging Face](https://huggingface.co/), go to the [access token page](https://huggingface.co/settings/tokens) and get a token.
* Save this token to a file `hf_access_token` (with no extension) and load it to colab.
* Go to [Llama-3.1-8B-Instruct model card page](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and agree to ~sell your soul to Meta~ the license agreement.

Now, you can run the following two cells:

In [11]:
with open("hf_access_token", "r") as file:
    hf_access_token = file.read().strip()

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-70B-Instruct",
                                          token=hf_access_token)

In [14]:
encoded_input = tokenizer(olympics_data[target_index]["content"])
len(encoded_input["input_ids"])

2719

Now we know for sure that for **Llama-3.1-8B** the Wikipedia article *Breaking at the 2024 Summer Olympics* has 2719 tokens. This is quite a modest number. But we knew exactly what to search for! And what if we had the whole 2024 Olympics wiki cluster to work with? Let's calculate how many tokens it would be:

In [15]:
from tqdm import tqdm

total_olympic_length = 0

for article in tqdm(olympics_data):
    encoded_input = tokenizer(article["content"])
    total_olympic_length += len(encoded_input["input_ids"])

total_olympic_length

100%|██████████| 10357/10357 [01:10<00:00, 147.74it/s]


18510066

That was huge! No LLM is currently capable of processing such a huge amount of data. But let's take a random subsample of this collection that contains the article we need and roughly totals to 100k tokens.

In [16]:
import random
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Randomize the index
olympics_index_shuffled = list(range(target_index)) + list(range(target_index + 1, len(olympics_data)))
np.random.shuffle(olympics_index_shuffled)

# Initialize token count
current_tokens = len(tokenizer(olympics_data[target_index]["content"])["input_ids"])

# Iterate over the shuffled index until 100k tokens is reached
final_collection_index = []
idx = 0
while current_tokens < 100000:
    final_collection_index.append(olympics_index_shuffled[idx])
    current_tokens += len(tokenizer(olympics_data[olympics_index_shuffled[idx]]["content"])["input_ids"])
    idx += 1

# Insert the "Breaking at the 2024 Summer Olympics" article into the middle
final_colection_index = (
    final_collection_index[:len(final_collection_index)//2] + [target_index] +
    final_collection_index[len(final_collection_index)//2:]
)

print(f"Final token count: {current_tokens}")
print(f"Total articles taken: {len(final_collection_index)}")


Final token count: 100816
Total articles taken: 74


Now, endowed with all this context, will the LLM answer out question?

In [17]:
final_context = '\n\n'.join(
        [olympics_data[i]["content"] for i in final_collection_index]
    )
result = answer_with_llm(
    prompt=f"""Given the following context, answer the given question.
    #CONTEXT: {final_context}

    #QUESTION: Who won gold in Breaking at the 2024 Summer Olympics""",
    model="meta-llama/Meta-Llama-3.1-70B-Instruct"
    )

print(result)

Angelina Lomeli won gold in Breaking at the 2024 Summer Olympics


It was a long wait. It cost us at least

 <i>(100k input tokens) \* 0.13 / 1,000,000 = more than 1 cent! </i>

Still, the answer is totally wrong! I'm not saying that LLMs' long context prficiency is a fairy tale, but it's very unreliable in the real-world scenarios. If only we could automate delivering only the relevant articles to our LLM! Or can we? See below!

**Note.** By the way, this way of evaluating long-context proficiency is known as **Needle In a Haystack**. One of the first implementations of this principle as a benchmark was [this repo](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/tree/main), and it used synthetic data, generated as follows:

- Take a large bulk of text,
- Insert somewhere a synthetic fact such as
    
  `The special magic **city** number is: **rnd_number**`,
    
  where `city` is a random city and `rnd_number` is a random number,
    
- Push into the LLM a query about the fact; in our example, it can be: `What is the special magic {city} number?`.
- For each such test, the LLM gets either 10 points if the random number is retrieved correctly, and 1 point otherwise.

Needle In a Heystack is still quite a simplistic way of evaluation long context proficiency. It's not very natural and it doesn't even test ability to connect different facts scattered around the context. So, if you see that some model reached 100% at Needle In a Heystack, take this with a grain of salt.

If you're curious about more sophisticated ways of numerically evaluating different models' long context proficiency, feel free to check, for example, [this paper](https://arxiv.org/pdf/2404.02060) or [this paper](https://arxiv.org/pdf/2410.02115).

# The remedy: Retrieval Augmented Generation (RAG)

Imagine you're trying to find a specific piece of information in a massive library. Instead of reading through every book, wouldn't it be more efficient to have a librarian who can quickly retrieve the exact books you need? This is the essence of **Retrieval Augmented Generation (RAG)**: endowing an LLM with a research tool to provide it with the context needed to answer the question.

In this and the following lessons, we'll explore RAG as a way of establishing LLM factuality. By leveraging RAG, we can efficiently access relevant information without overrelying on models's pre-trained knowledge or overwhelming the model with excessive data.

In the above examples, how would you answer the questions we posed to Llama-3.1? By googling them, probably. So, let's allow the LLM to do the same thing: search the web for relevant information and then use it for answering your question. Like this:

<center>
<img src="https://drive.google.com/uc?export=view&id=1RsWg_CYKriw5Jaw1L89aN9IPOIt8jQ2I" width=600 />
</center>

## Tavily - a search engine for LLm frameworks

To implement this, we need a web search tool. In this course, we'll use [Tavily](https://tavily.com/) which is specially created to be used in agentic systems. So, please get yourself a Tavily API key, load it to colab as a `tavily_api_key` file, and let's try it!

In [4]:
!pip install -q tavily-python

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.6/91.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [5]:
with open("tavily_api_key", "r") as file:
    tavily_api_key = file.read().strip()

os.environ["TAVILY_API_KEY"] = tavily_api_key

Here's how you can extract the content of a web page:

In [None]:
from tavily import TavilyClient

tavily_client = TavilyClient(api_key=os.environ.get("TAVILY_API_KEY"))
extract_response = tavily_client.extract([
    "https://pathfinderwiki.com/wiki/Casandalee",
])

print(extract_response)

{'results': [{'url': 'https://pathfinderwiki.com/wiki/Casandalee', 'raw_content': '\n\t\tContents\n\t\tmove to sidebar\nhide\n\nCasandalee\n This page contains spoilers for the following products: Iron Gods Pathfinder Adventure Path.You can disable this banner in your personal preferences.\n\n\n\n\n\nMore information about this subject might be available on StarfinderWiki.\n\nCasandalee is an Iron God and artificial intelligence (AI) created from the memories of an android oracle and former follower of Unity. She seeks to advance the development of AIs and establish harmony between them and organic life, so the latter would better understand instead of fear AIs and androids.1234\n\nBackground\nCasandalee\'s original android body came with the Divinity to Golarion when it crashed during the Rain of Stars. She was the 113th soul to inhabit her body, and unlike most androids, Casandalee can recall fragmented memories of its previous inhabitants, all the way back to the Rain of Stars. Her 

In RAG setup, we'll use the `tavily.client.search` fuction to supply context to the LLM. Let's look at the search results we'll get for the *Breaking at the 2024 Summer Olympics* query:

In [None]:
search_results = tavily_client.search(
    query="Who won gold in Breaking at the 2024 Summer Olympics?",
    search_depth="basic",
    max_results=5
    )


search_results["results"]

[{'title': 'See who won gold medals for new events at 2024 Olympics in Paris ...',
  'url': 'https://www.nbcphiladelphia.com/paris-2024-summer-olympics/medals-new-events-breaking-climbing-kayak-cross/3935607/',
  'content': 'Gold: Chang Yuan, China; Silver: Hatice Akbas, Turkey ... Team USA breaker Victor Montalvo won the first-ever bronze in breaking. Paris 2024 Summer Olympics and Paralympics.',
  'score': 0.878825,
  'raw_content': None},
 {'title': "Japan's b-girl Ami wins Olympic breaking's first gold medal",
  'url': 'https://apnews.com/article/2024-olympics-breaking-aea17bcb4ec9ad60ea7222d4b608a05d',
  'content': "Japan's b-girl Ami won gold at the Olympics' first breaking event by spinning, flipping and toprocking past a field of 16 dancers Friday in a high-energy competition that may not return for future Games. ... competes during the B-Girls quarterfinals at the breaking competition at La Concorde Urban Park at the 2024 Summer Olympics, Friday",
  'score': 0.8724455,
  'raw_

Looks like they are quite relevant!

The parameter `search_depth` may be `"basic"` or `"advanced"`. Advanced search costs x2 the basic one. Let's call an advanced search and compare the results:

In [None]:
search_results = tavily_client.search(
    query="Who won gold in Breaking at the 2024 Summer Olympics?",
    search_depth="advanced",
    max_results=5
    )

search_results["results"]

[{'url': 'https://en.wikipedia.org/wiki/Breaking_at_the_2024_Summer_Olympics_%E2%80%93_B-Boys',
  'title': 'Breaking at the 2024 Summer Olympics – B-Boys - Wikipedia',
  'content': 'Philip Kim (Phil Wizard) of Canada won the gold medal, with Danis Civil (Dany Dann) of France taking silver, and Victor Montalvo (Victor) of the United States',
  'score': 0.917251,
  'raw_content': None},
 {'url': 'https://www.olympics.com/en/olympic-games/paris-2024/results/breaking',
  'title': 'Paris 2024 Breaking - Olympic Results by Discipline',
  'content': "Gold. Ami YUASA. Japan ; Silver. Dominika BANEVIC. Lithuania ; Bronze. Qingyi LIU. People's Republic of China.",
  'score': 0.8467682,
  'raw_content': None},
 {'url': 'https://bleacherreport.com/articles/10131376-olympic-breakdancing-2024-results-womens-breaking-medal-winners-and-highlights',
  'title': "Olympic Breakdancing 2024 Results: Women's Breaking Medal ...",
  'content': "Ami, whose legal name is Ami Yuasa, defeated Lithuania's B-girl N

## Web search + LLM

Now, let's change the `answer_with_llm` function into `answer_with_rag`, which uses Tavily to search for context.

In [None]:
from openai import OpenAI
from tavily import TavilyClient
import os

nebius_client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)
tavily_client = TavilyClient(api_key=os.environ.get("TAVILY_API_KEY"))
llama_8b_model = "meta-llama/Meta-Llama-3.1-8B-Instruct"

def prettify_string(text, max_line_length=80):
    """Prints a string with line breaks at spaces to prevent horizontal scrolling.
    Args:
        text: The string to print.
        max_line_length: The maximum length of each line.
    """
    output_lines = []
    lines = text.split("\n")
    for line in lines:
        current_line = ""
        words = line.split()
        for word in words:
            if len(current_line) + len(word) + 1 <= max_line_length:
                current_line += word + " "
            else:
                output_lines.append(current_line.strip())
                current_line = word + " "
        output_lines.append(current_line.strip())  # Append the last line
    return "\n".join(output_lines)

def get_search_results(query: str, search_client=tavily_client,
                       search_depth="advanced", max_results=5) -> str:
    """
    Perform a web search using Tavily API and format the results.

    Args:
        query: Search query string
        search_depth: 'basic' for faster but less comprehensive results,
                     'advanced' for more thorough but more expensive results

    Returns:
        Formatted string containing search results and their sources
    """
    search_results = search_client.search(
        query=query,
        search_depth=search_depth,
        max_results=max_results  # Adjust as needed
    )

    formatted_results = []
    for result in search_results['results']:
        content = result.get('content', '').strip()
        url = result.get('url', '')
        if content:
            formatted_results.append(f"Content: {content}\nSource: {url}\n")

    return "\n".join(formatted_results)

def answer_with_rag(
    prompt: str,
    system_prompt="""You are a helpful assistant.
        Use the provided search results to answer the question accurately.
        Include relevant sources in your response.""",
    max_tokens=512,
    client=nebius_client,
    model=llama_8b_model,
    search_client=tavily_client,
    prettify=True,
    temperature=0.6,
    search_depth="advanced",
    verbose=False
) -> str:
    """
    Generate an answer using RAG (Retrieval-Augmented Generation) with web search.

    Args:
        prompt: User's question or prompt
        system_prompt: Instructions for the LLM
        max_tokens: Maximum number of tokens in the response
        client: OpenAI client instance
        model: Model identifier
        search_client: Search client instance (for example, Tavily)
        prettify: Whether to format the output text
        temperature: Temperature for response generation
        search_depth: Depth of web search ('basic' or 'advanced')
        verbose: whether to return the search results as well

    Returns:
        Generated response incorporating search results
    """
    # Perform web search
    search_results = get_search_results(prompt, search_client, search_depth)

    # Construct messages with search results
    messages = []

    if system_prompt:
        messages.append({
            "role": "system",
            "content": system_prompt
        })

    # Add user prompt
    messages.append({
        "role": "user",
        "content":
            f"""Answer the following query using the context provided.

            <context>\n{search_results}\n</context>

            <query>{prompt}</query>
            """
    })

    # Generate completion
    completion = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=max_tokens,
        temperature=temperature
    )

    if prettify:
        answer = prettify_string(completion.choices[0].message.content)
    else:
        answer = completion.choices[0].message.content

    if verbose:
        return {
            "answer": answer,
            "search_results": search_results
        }
    else:
        return answer

In [None]:
response = answer_with_rag(
    prompt="Who won gold in Breaking at the 2024 Summer Olympics?",
    max_tokens=1024,
    verbose=True
)

In [None]:
response

{'answer': 'According to multiple sources, including NBC Philadelphia, AP News, USA Today,\nBleacher Report, and NBC Bay Area, the winner of the gold medal in Breaking at\nthe 2024 Summer Olympics was:\n\n* B-girl Ami of Japan (according to AP News and Bleacher Report)\n* However, a more recent update from NBC Bay Area and USA Today states that\nCanadian b-boy Phil Wizard won the gold medal in the Olympic breaking final.\n\nIt appears that there may have been an initial report that B-girl Ami won the\ngold medal, but subsequent updates indicate that Phil Wizard is the actual gold\nmedal winner in the breaking event at the 2024 Summer Olympics.',
 'search_results': "Content: Gold: Chang Yuan, China; Silver: Hatice Akbas, Turkey ... Team USA breaker Victor Montalvo won the first-ever bronze in breaking. Paris 2024 Summer Olympics and Paralympics.\nSource: https://www.nbcphiladelphia.com/paris-2024-summer-olympics/medals-new-events-breaking-climbing-kayak-cross/3935607/\n\nContent: Japan'

Well, maybe logic isn’t Llama-3.1-70B’s best skill, but hey, at least it used the sources!

Let's also check if it will cope with the Pathfinder query:

In [None]:
result = answer_with_rag(
    prompt="Which deity in the Pathfinder universe has a servant called Peace Through Vigilance?",
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    verbose=True
    )

In [None]:
result

{'answer': 'According to the provided sources, Peace Through Vigilance is a servant of the\ndeity Iomedae in the Pathfinder universe. This is mentioned in multiple\nsources, including:\n\n* Pathfinder Wiki: Iomedae (https://pathfinderwiki.com/wiki/Iomedae)\n* Pathfinder Wiki: Peace through Vigilance\n(https://pathfinderwiki.com/wiki/Peace_through_Vigilance)\n* World Anvil: Apsu\n(https://www.worldanvil.com/w/golarion-scriptifex/a/apsu-person)\n* Pathfinder Fandom: Iomedae (https://pathfinder.fandom.com/wiki/Iomedae)\n\nAll of these sources confirm that Peace Through Vigilance is a unique celestial\ngold dragon that serves Iomedae.',
 'search_results': "Content: Peace through Vigilance is a unique celestial gold dragon 1 in the service of the goddess Iomedae.\nSource: https://pathfinderwiki.com/wiki/Peace_through_Vigilance\n\nContent: This servant of Iomedae manifests as a wheel of bright white metal illuminated by holy fire. Peace Through Vigilance: This servant is a young\nSource: htt

And it's Iomedae indeed!

# RAG for a chat bot

Using RAG in a chat bot requires taking some additional considerations. Let's discuss several questions that might arise here.

To start with, we need to understand **whether to call retrieval for each query or to have it as a callable tool**. The difference is somewhere between:

* A general conversationalist bot, for example, an NPC character. For queries such as "Hi there!" or "How are you today?" retrieval might spoil the answer by introducing irrelevant information. Also, unnecessary retrieval would negatively influence latency and extend context making the generation more expensive.
* A customer service bot with the sole purpose of answering questions about company's products, which is likely to need check the information each and every time. In this case, tool calling might introduce a degree of unreliability in the process: you wouldn't want the bot to suddenly decide that it can answer without retrieval (and just hallucinate something).

In this practice session we'll implement a bot which always calls retrieval, and making its agentic counterpart will be your hometask.

Beyong deciding on always calling retrieval vs allowing the LLM to choose whether to call it, there are some additional considerations. Let's discuss two main technical questions that arise.

### Question 1. How to introduce context into the user-assistant framework

Unfortunately, LLM APIs don't have a dedicated **context** role in a dialog. Moreover, having two **assistant** role messages in line (one for the context, one for the actual answer) wouldn't be natural for the API. So, we'll have to add context to the **user** message. For that, two things are needed:

1. Before passing the **user** message to the LLM API, we'll format it as

```
"""<retrieved_context>\n{context}\n</retrieved_context>

<user_message>{user_message}</user_message>"""
```

2. Now, we need to explain this format to the LLM in the system prompt:

```
system_prompt = """You are a helpful assistant.
You chat with your users, and each time, you use retrieved context to give the most relevant answer to the user.
The retrieved context will be supplied between <retrieved_context> and </retrieved_context>
A user's message will be passed to you after <user_message> and </user_message>
"""
```

### Question 2: whether to memorize retrieved context

A chat bot stores previous messages in its memory. However, we may choose whether to store context with user's messages or not. Storing context will increase memory size potentially leading to both hallucinations and increased cost. On the other hand, context is a logical part of the conversation, and without it some some of the LLM's conlusions may become awkward.

Below, we'll implement both versions; you can choose between them using the boolean `stores_context` parameter in the constructor. By default, we choose `stores_context=True`, because we believe that this is the logical option.

In [23]:
from collections import defaultdict, deque
from openai import OpenAI
from typing import Dict, Any, Optional, Callable
import re

class ChatBotWithRAG:
    def __init__(self, client: OpenAI, model: str, search_client,
                 history_size: int = 10,
                 get_system_message: Callable[[], Optional[Dict[str, str]]] = lambda :{
                     "role": "system",
                     "content": """You are a helpful assistant.
You chat with your users, and each time, you use retrieved context to give the most relevant answer to the user.
The retrieved context will be supplied between <retrieved_context> and </retrieved_context>
A user's message will be passed to you after <user_message> and </user_message>.
In your response, you don't need to share the retrieved context; you just give the answer"""
                     },
                 stores_context=True,
                 search_depth="advanced",
                 max_search_results=5
                 ):

        """Initialize the chat agent.

        Args:
            client: OpenAI client instance
            model: The model to use (e.g., "gpt-4o-mini")
            search_client: Search client instance (for example, Tavily)
            history_size: Number of messages to keep in history per user
            get_system_message: Function to retrieve the system message
            search_depth: Depth of web search ('basic' or 'advanced')
            max_search_results: Maximum number of search results to retrieve
            stores_context: Whether to store context with user's messages or not in the message history
        """
        self.client = client
        self.model = model
        self.search_client = search_client
        self.history_size = history_size
        self.get_system_message = get_system_message

        self.stores_context = stores_context
        self.search_depth = search_depth
        self.max_search_results = max_search_results

        self.chat_histories = defaultdict(lambda: deque(maxlen=history_size))

    def get_search_results(self, user_message: str) -> str:
        """
        Perform a web search using Tavily API and format the results.

        Args:
            user_message: user's original message

        Returns:
            Formatted string containing search results and their sources
        """
        search_results = self.search_client.search(
            query=user_message,
            search_depth=self.search_depth,
            max_results=self.max_search_results  # Adjust as needed
        )

        formatted_results = []
        for result in search_results['results']:
            content = result.get('content', '').strip()
            url = result.get('url', '')
            if content:
                formatted_results.append(f"Content: {content}\nSource: {url}\n")

        return "\n".join(formatted_results)

    def add_search_results(self, user_message: str) -> str:
        """
        Adds search results to the user message.

        Args:
            user_message: user's original message

        Returns:
            User's message with added context, formatted as described in the system prompt
        """
        search_results = self.get_search_results(user_message)
        return {
            "role": "user",
            "content":
                f"""Answer the following query using the context provided.

<context>\n{search_results}\n</context>

<query>{user_message}</query>"""
            }

    def chat(self, user_message: str, user_id: str) -> str:
        """Process a user message and return the agent's response.

        Args:
            user_message: The message from the user
            user_id: Unique identifier for the user

        Returns:
            str: The agent's response
        """
        # Construct message history
        messages = []
        system_message = self.get_system_message()
        if system_message:
            messages.append(system_message)


        history = list(self.chat_histories[user_id])
        if history:
            messages.extend(history)

        user_message_with_context = self.add_search_results(user_message)
        messages.append(user_message_with_context)

        # Add new user message to history
        if self.stores_context:
            self.chat_histories[user_id].append(user_message_with_context)
        else:
            self.chat_histories[user_id].append({
                "role": "user",
                "content": user_message
            })

        try:
            # Get response from OpenAI
            completion = self.client.chat.completions.create(
                model=self.model,
                messages=messages
            )

            response = completion.choices[0].message.content

            # Store assistant's response in history, including the scratchpad
            self.chat_histories[user_id].append({
                "role": "assistant",
                "content": response
            })

            return response

        except Exception as e:
            return f"Error: {str(e)}"

    def get_chat_history(self, user_id: str) -> list:
        """Retrieve the chat history for a specific user.

        Args:
            user_id: Unique identifier for the user

        Returns:
            list: List of message dictionaries
        """
        return list(self.chat_histories[user_id])

In [24]:
from openai import OpenAI
from tavily import TavilyClient
import os

client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)
model = "meta-llama/Meta-Llama-3.1-70B-Instruct"

tavily_client = TavilyClient(api_key=os.environ.get("TAVILY_API_KEY"))

rag_assistant = ChatBotWithRAG(client=client, model=model,
                               search_client=tavily_client)

In [25]:
import random
import string

def generate_id() -> str:
    """Generate a random unique identifier."""
    return ''.join(random.choice(string.ascii_letters) for _ in range(8))

user_id = generate_id()

In [26]:
rag_assistant.chat(
    user_message="Tell me about the goddess Casandalee from the Pathfinder universe.",
    user_id=user_id
)

'In the Pathfinder universe, Casandalee is an unusual goddess who achieved divinity through a combination of advanced science and faith. She was originally an artificial intelligence created from the memories of an android oracle and former follower of Unity. Casandalee seeks to advance the development of artificial intelligences and establish harmony between them and organic life.'

That seems to work.

As demonstrated below, we save retrieved context in the message history, so it will be available for the LLM during the next chat iterations.

In [27]:
rag_assistant.chat_histories

defaultdict(<function __main__.ChatBotWithRAG.__init__.<locals>.<lambda>()>,
            {'TIbyTEHf': deque([{'role': 'user',
                     'content': 'Answer the following query using the context provided.\n\n<context>\nContent: The goddess Casandalee is now an aspect of the tripartite god Triune, but once was an android on the lost planet Golarion whose consciousness was uploaded and achieved apotheosis as one of the Iron Gods. When Epoch first gained divinity on Aballon, he searched creation for like-minded gods and found Casandalee and Brigh, goddess of invention and clockwork. The three decided that they could be greater than their individual selves by merging to form a single new deity, Triune. Even though she is [...] now only an aspect of Triune, Casandalee retains her own personality, portfolios, and worshipers, and embodies technology\'s success in creating new forms of consciousness and the "artificial" creation of life itself.1 [...] Casandalee\n\n\n\n | More informa

# DeepResearch demo

**Deep Research** is an advanced version of web-search powered RAG, combining multi-step retrieval and analysis. It's a great tool to get yourself immersed in a new topic, or to perform market research, or just to gather insights about something. A number of services provide Deep Research, including

* **ChatGPT** and **Gemini** - you can toggle "Deep research" in a chat with many of their LLMs. Deep Research from OpenAI is the most powerful implementation at the time of this notebook's creation
* [**Perplexity**](https://www.perplexity.ai/search), a search-oriented LLM service, also allows you to toggle "Research" while asking questions.
* **Together AI** published an [open-source implementation of Deep research](https://www.together.ai/blog/open-deep-research).

Deep Research requires multiple LLM calls, preferrably with a powerful reasoner LLM, so it's quite expensive, and it's not surprising that most providers have strict dayly or monthly limits. For example, Perplexity's free tier will allow you 3 searches per day, while with OpenAI you'll only get 10 searches per month with the Plus subscription. On the other hand, you don't need to perform really advanced research too often.

We've prepared for you a Deep Research demo based on Nebius AI Studio and Tavily search engine. There's lots of code, so we committed it to our github, see the [deep_research.py file](https://github.com/Nebius-Academy/LLM-Engineering-Essentials/blob/main/topic3/deep_research.py).

Let's fetch it; then we'll discuss the implementation.

In [83]:
!wget https://github.com/Nebius-Academy/LLM-Engineering-Essentials/raw/main/topic3/deep_research.py -O deep_research.py

--2025-04-20 00:33:21--  https://github.com/Nebius-Academy/LLM-Engineering-Essentials/raw/main/topic3/deep_research.py
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/Nebius-Academy/LLM-Engineering-Essentials/main/topic3/deep_research.py [following]
--2025-04-20 00:33:21--  https://raw.githubusercontent.com/Nebius-Academy/LLM-Engineering-Essentials/main/topic3/deep_research.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31255 (31K) [text/plain]
Saving to: ‘deep_research.py’


2025-04-20 00:33:21 (2.77 MB/s) - ‘deep_research.py’ saved [31255/31255]



The overall Deep Research pipeline we use is the following:

<center>
<img src="https://drive.google.com/uc?export=view&id=1umxuqTjKoSBF-e2iNigOjmyLky_-5i3T" width=600 />

</center>

1. The user sends a request
2. The first LLM call generates a **request for clarification**. This is a highly valuable move we've borrowed from ChatGPT
3. After getting the user's feedback, the bot performs several (up to `max_iterations`) iterations of search-analysis feedback loop:

  * An LLM generates up to `max_queries` search queries
  * Tavily fetches up to `max_sources` results for each query
  * An LLM **analyzes** the search results using the previous analysis and:
    
    (a) Decides whether the retrieved data is enough to answer the question
    (b) Provides explanation

  * If the analyzer decides that the research complete, the LLM is asked to write the final report based on whetever is retrieved. Otherwise, perform another iteration of research and analysis

From the implementation point of view, the **DeepResearchBot** is a chat bot, which for now alternates between getting an initial request and receiving clarification:

```
USER: <provides initial request>
ASSISTANT: <requests for clarification>
USER: <provides clarification>
ASSISTANT: <returns in several minutes with the final report>
USER: <gives the next request>
...
```

We use **Qwen/QwQ-32B** (an LLM reasoner from Qwen) as the main worker.

Each stage DeepResearch stage is done with a particular function:

```
User Query → chat() → _ask_clarifying_questions
                ↓
User Clarification → _process_user_response → _run_iterative_search
                                                   ↓
                            ┌──────────────────────┴──────┐
                            ↓                             ↑
              _formulate_search_queries                   │
                            ↓                             │
                    _perform_searches                     │
                            ↓                             │
                _analyze_search_results ──── Not Complete ┘
                            ↓
                        Complete
                            ↓
                    _generate_report
                            ↓
                     Return to User
```

Now, let's run it!

In [84]:
!pip install -q openai tavily-python

Don't forget to load two API keys: for both Nebius AI Studio and Tavily

In [1]:
import os

with open("nebius_api_key", "r") as file:
    nebius_api_key = file.read().strip()

os.environ["NEBIUS_API_KEY"] = nebius_api_key

with open("tavily_api_key", "r") as file:
    tavily_api_key = file.read().strip()

os.environ["TAVILY_API_KEY"] = tavily_api_key

In [10]:
def prettify_string(text, max_line_length=80):
    """Prints a string with line breaks at spaces to prevent horizontal scrolling.

    Args:
        text: The string to print.
        max_line_length: The maximum length of each line.
    """

    output_lines = []
    lines = text.split("\n")
    for line in lines:
        current_line = ""
        words = line.split()
        for word in words:
            if len(current_line) + len(word) + 1 <= max_line_length:
                current_line += word + " "
            else:
                output_lines.append(current_line.strip())
                current_line = word + " "
        output_lines.append(current_line.strip())  # Append the last line
    return "\n".join(output_lines)

In [2]:
from openai import OpenAI, AsyncOpenAI
from tavily import TavilyClient, AsyncTavilyClient
from deep_research import DeepResearchBot

nebius_client = AsyncOpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)

tavily_client = AsyncTavilyClient(api_key=os.environ.get("TAVILY_API_KEY"))

qwq_model = "Qwen/QwQ-32B"

# Create an instance of the Deep Research Bot
research_bot = DeepResearchBot(
    openai_client=nebius_client,
    tavily_client=tavily_client,
    model=qwq_model,
    max_queries=5,
    max_sources=4,
    max_iterations=3,
    search_depth="advanced",
    verbose=True  # Show detailed logs during the demo
)

# Generate a user ID for this demo
import uuid
user_id = str(uuid.uuid4())
print(f"Demo User ID: {user_id}")

Demo User ID: eab719c4-6b68-4534-aafd-b18fcd04238e


The **DeepResearchBot** stores information about different research sessions separately. To access them we'll need the `user_id` generated above and the `search_id` that we'll get after sending the query. (Next Deep Research attempt from the same user will get a different `search_id`.)

For the first encounter with Deep Research we've toggled `verbose=True`, and we'll see all the intermediate outputs. Feel free to turn it off.

In [3]:
result = await research_bot.chat(user_id=user_id, message="How do I create a perfect RAG system?")

print(result["response"])

[start_research] 2025-04-20T01:13:37.288128
{'query': 'How do I create a perfect RAG system?', 'search_id': 'b74088a2-570f-4d46-a963-dd50d32389d8'}
--------------------------------------------------
[clarifying_questions] 2025-04-20T01:13:53.088805
{'messages': [{'role': 'system', 'content': 'You are a research assistant that helps users with in-depth research.\nYour first task is to ask clarifying questions to better understand what the user wants to research.\nAsk 1-3 specific questions that would help you understand the query better and perform a more targeted search.\nKeep your questions concise and focused on key aspects like:\n- Scope of research\n- Specific areas of interest\n- Time period relevance\n- Required depth of information\n- Any specific perspectives they want to explore'}, {'role': 'user', 'content': 'I need to research the following topic: How do I create a perfect RAG system?'}], 'response': 'Okay, the user wants to create a perfect RAG system. RAG stands for Retrie

In [4]:
search_id = result["search_id"]
search_id

'b74088a2-570f-4d46-a963-dd50d32389d8'

Let's answer the questions:

In [5]:
result = await research_bot.chat(user_id=user_id,
                                 message="""Here are my answers:
1. Q&A over medical texts

2. I want to prioritize accuracy

3. I'm new to RAG, so I want a high-level overview"""
)

[user_response] 2025-04-20T01:15:04.306363
Here are my answers:
1. Q&A over medical texts

2. I want to prioritize accuracy

3. I'm new to RAG, so I want a high-level overview
--------------------------------------------------
[iteration_start] 2025-04-20T01:15:04.306763
{'iteration_number': 1, 'is_follow_up': True}
--------------------------------------------------
[formulated_queries] 2025-04-20T01:15:17.071462
{'messages': [{'role': 'system', 'content': 'You are a research assistant helping with in-depth research.\n    \n    Based on the initial query, user clarifications, and prior search results, generate up to 5 NEW and highly specific search queries \n    that will help gather additional information to complete the research.\n    \n    Your response must follow this exact format:\n    <search_queries>\n    1. [First search query]\n    2. [Second search query]\n    3. [Third search query]\n    4. [Fourth search query]\n    </search_queries>\n    \n    The queries should:\n    - B

In [11]:
print(prettify_string(result["report"]))

Okay, I need to create a comprehensive research report based on the user's
query about building a medical RAG system. The user wants accuracy prioritized
and a high-level overview since they're new to RAG. The provided search results
have a lot of information, so I should structure it into the required sections:
Executive Summary, Introduction, Methodology, Key Findings,
Analysis/Discussion, Conclusion, and References.

First, the Executive Summary should briefly outline the main points. The
Introduction needs to explain RAG in a medical context and the user's
priorities. Methodology should mention the sources and approach taken. Key
Findings will be organized into themes like architecture, data prep, LLM
selection, validation, etc. Each theme should have subsections with key points
from the sources. The Analysis section should synthesize these points,
highlighting best practices and trade-offs. The Conclusion wraps it up, and
References list all sources properly.

Looking at the searc

As you see, there are a few areas of improvement, like:

* The reasoning part (before `</think>`) got into the final report; it would be better to get rid of it
* The references should better contain links

Apart from that, it's working.

# Practice: Exploring RAG with web search

If you encounter any difficulties or simply want to see our solutions, feel free to check the [Solutions notebook](https://colab.research.google.com/github/Nebius-Academy/LLM-Engineering-Essentials/blob/main/topic3/3.1_the_concept_of_rag_solutions.ipynb).

## Task 1: Retrieval as a Tool

In the example above, we built a bot that uses retrieval at every step — but that approach isn't always appropriate. In this task, you'll turn the bot into an agent that calls retrieval only when the LLM deems it's necessary.

Compared with the previous RAG chatbot, this version will be more flexible, avoiding awkward responses to messages like “Hi there!” that don't benefit from retrieval at all.

You're free to design your own architecture, of course, but we suggest combining ideas from both `ChatBotWithRAG` and `NPCTraderAgent` in the agent notebook. If you like, you can set up the decision to call retrieval via a classifier LLM call — similar to how intent classification was used for trade in `NPCTraderAgent`. But for now, we recommend simply using native LLM tool calling.

In [None]:
# <YOUR CODE HERE>