<a href="https://colab.research.google.com/github/Nebius-Academy/LLM-Engineering-Essentials/blob/main/topic3/3.3_advanced_rag_components_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Engineering Essentials 3.3. Advanced RAG components

# Practice solutions

## Task 1. Adding a reranker

In this task, you'll need to add the **reranking stage** to the `answer_with_rag` function.

Compare the results with and without reranking and with different reranking models. Try to come up with tricky and confusing prompts.

**Solution**. Here is our implementation. The interesting part starts with `answer_with_rag`. So just scroll till that moment!

In [None]:
!pip install lancedb pyarrow tiktoken -q
!pip install -qU langchain-text-splitters

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    separators=[
        "\n\n",
        "\n",
        ".",
        " "
    ],
    chunk_size=1024,
    chunk_overlap=128,
    length_function=len,
    is_separator_regex=False,
)

In [None]:
import os
from typing import List
from functools import partial

import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry

import openai
import pyarrow as pa

In [None]:
import os
import re

from tqdm import tqdm
from bs4 import BeautifulSoup
from markdown import markdown
from pathlib import Path


def markdown_to_text(markdown_string):
    """ Converts a markdown string to plaintext """

    # md -> html -> text since BeautifulSoup can extract text cleanly
    html = markdown(markdown_string)

    html = re.sub(r'<!--((.|\n)*)-->', '', html)
    html = re.sub('<code>bash', '<code>', html)

    # extract text
    soup = BeautifulSoup(html, "html.parser")
    text = ''.join(soup.findAll(text=True))

    text = re.sub('```(py|diff|python)', '', text)
    text = re.sub('```\n', '\n', text)
    text = re.sub('-         .*', '', text)
    text = text.replace('...', '')
    text = re.sub('\n(\n)+', '\n\n', text)

    return text


def prepare_files(input_dir="transformers/docs/source/en/", output_dir="docs"):
    # Convert string paths to Path objects
    input_dir = Path(input_dir)
    output_dir = Path(output_dir)

    # Check if input directory exists
    assert input_dir.is_dir(), "Input directory doesn't exist"
    output_dir.mkdir(parents=True, exist_ok=True)

    for root, subdirs, files in tqdm(os.walk(input_dir)):
        root_path = Path(root)
        for file_name in files:
            file_path = root_path / file_name
            parent = root_path.stem if root_path.stem != input_dir.stem else ""

            if file_path.is_file():
                with open(file_path, encoding="utf-8") as f:
                    md = f.read()
                text = markdown_to_text(md)

                output_file = output_dir / f"{parent}_{Path(file_name).stem}.txt"
                with open(output_file, "w", encoding="utf-8") as f:
                    f.write(text)


In [None]:
!git clone https://github.com/huggingface/transformers

Cloning into 'transformers'...
remote: Enumerating objects: 278088, done.[K
remote: Counting objects: 100% (133/133), done.[K
remote: Compressing objects: 100% (95/95), done.[K
remote: Total 278088 (delta 80), reused 40 (delta 36), pack-reused 277955 (from 3)[K
Receiving objects: 100% (278088/278088), 289.32 MiB | 23.16 MiB/s, done.
Resolving deltas: 100% (206618/206618), done.
Updating files: 100% (4934/4934), done.


In [None]:
prepare_files()

  text = ''.join(soup.findAll(text=True))
6it [00:06,  1.13s/it]


In [None]:
# This line is needed in case you've ran this cell before to clear the db dir
!rm -rf /tmp/lancedb

db = lancedb.connect("/tmp/lancedb")

# We use this model as the encoder: https://huggingface.co/BAAI/bge-small-en-v1.5
embed_func = get_registry().get("huggingface").create(name="BAAI/bge-small-en-v1.5")


class BasicSchema(LanceModel):
    '''
    This is how we store data in the database.
    We need to have a vector here, but apart from this, we may have many other fields
    '''
    text: str = embed_func.SourceField()
    vector: Vector(embed_func.ndims()) = embed_func.VectorField(default=None)

lance_table = db.create_table(
    "transformer_docs",
    mode='overwrite',
    schema=BasicSchema
)

# Populating the database

from tqdm import tqdm
splitted_docs = []

for file in tqdm(os.listdir("docs")):
    with open("docs/"+file, "r") as f:
        text = f.read()
        docs = text_splitter.create_documents([text])
        splitted_docs.extend([{"text": doc.page_content} for doc in docs])

lance_table.add(
    splitted_docs,
    on_bad_vectors='drop'  # or 'fill' with fill_value=0.0
)

100%|██████████| 515/515 [00:00<00:00, 4816.30it/s]


---
---

Now, the interesting part.

There's not much difference, to tell the truth. It can be summarized in the following snippet:

```python
    # Perform database search
    if table:
        try:
            stage_1_results = search_table(table, prompt,
                                           max_results=max_stage_1_results)

            # Here comes the reranker!
            if reranker_model:
                stage_1_docs = search_results_to_text(stage_1_results)
                search_results = reranker_model.rank(
                    prompt, stage_1_docs, return_documents=True, top_k=max_results
                )
            else:
                search_results = search_result_to_context(stage_1_results)
```

In [None]:
from math import e
from openai import OpenAI
import os
from sentence_transformers import CrossEncoder

nebius_client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)
llama_8b_model = "meta-llama/Meta-Llama-3.1-8B-Instruct"

reranker_model = CrossEncoder("mixedbread-ai/mxbai-rerank-base-v1")

def prettify_string(text, max_line_length=80):
    """Prints a string with line breaks at spaces to prevent horizontal scrolling.
    Args:
        text: The string to print.
        max_line_length: The maximum length of each line.
    """
    output_lines = []
    lines = text.split("\n")
    for line in lines:
        current_line = ""
        words = line.split()
        for word in words:
            if len(current_line) + len(word) + 1 <= max_line_length:
                current_line += word + " "
            else:
                output_lines.append(current_line.strip())
                current_line = word + " "
        output_lines.append(current_line.strip())  # Append the last line
    return "\n".join(output_lines)

def search_table(table, query, max_results=15):
    return table.search(query).limit(max_results).to_pydantic(BasicSchema)

def search_result_to_context(search_result):
    return "\n\n".join(
        [record.text for record in search_result]
    )

def search_results_to_text(search_result):
    return [record.text for record in search_result]

def answer_with_rag(
    prompt: str,
    system_prompt=None,
    max_tokens=512,
    client=nebius_client,
    model=llama_8b_model,
    reranker_model=reranker_model,
    table=None,
    prettify=True,
    temperature=0.6,
    max_stage_1_results=15,
    max_results=5,
    verbose=False
) -> str:
    """
    Generate an answer using RAG (Retrieval-Augmented Generation) with database search.

    Args:
        prompt: User's question or prompt
        system_prompt: Instructions for the LLM
        max_tokens: Maximum number of tokens in the response
        client: OpenAI client instance
        model: Model identifier
        search_client: Search client instance (for example, Tavily)
        prettify: Whether to format the output text
        temperature: Temperature for response generation
        search_depth: Depth of web search ('basic' or 'advanced')
        verbose: whether to return the search results as well

    Returns:
        Generated response incorporating search results
    """
    # Perform database search
    if table:
        try:
            stage_1_results = search_table(table, prompt,
                                           max_results=max_stage_1_results)

            # Here comes the reranker!
            if reranker_model:
                stage_1_docs = search_results_to_text(stage_1_results)
                search_results = reranker_model.rank(
                    prompt, stage_1_docs, return_documents=True, top_k=max_results
                )
            else:
                search_results = search_result_to_context(stage_1_results)

        except (AttributeError, ValueError) as err:
            print(err)
            stage_1_results = []
            search_results = []
    else:
        stage_1_results = []
        search_results = []

    # Construct messages with search results
    messages = []

    if system_prompt:
        messages.append({
            "role": "system",
            "content": system_prompt
        })

    # Add user prompt
    messages.append({
        "role": "user",
        "content":
            f"""Answer the following query using the context provided.

            <context>\n{search_results}\n</context>

            <query>{prompt}</query>
            """
    })

    # Generate completion
    completion = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=max_tokens,
        temperature=temperature
    )

    if prettify:
        answer = prettify_string(completion.choices[0].message.content)
    else:
        answer = completion.choices[0].message.content

    if verbose:
        return {
            "answer": answer,
            "stage_1_results": stage_1_results,
            "search_results": search_results
        }
    else:
        return answer

In [None]:
client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)
model = "meta-llama/Meta-Llama-3.1-8B-Instruct"

reranker_model = CrossEncoder("mixedbread-ai/mxbai-rerank-base-v1")

results = answer_with_rag("""How to quantize a model in 4 bits?""",
               client=client, model=model, reranker_model=reranker_model,
               table=lance_table, verbose=True,
               max_stage_1_results=15, max_results=5)
print(results["answer"])

Based on the provided context, the query can be answered as follows:

To quantize a model in 4 bits, you can use the `BitsAndBytesConfig` class from
the `transformers` library. Here is an example of how to do it:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig

model_id = "meta-llama/Llama-3.1-8B-Instruct"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
device_map="auto"
)
```

This code creates a `BitsAndBytesConfig` instance with `load_in_4bit=True` and
`bnb_4bit_compute_dtype=torch.bfloat16`, which tells the library to load the
model in 4 bits and use bfloat16 as the compute dtype. Then, it uses the
`from_pretrained` method to load the model with the specified quantization
configuration.

Alternatively, you can also use the example provided in the con

Let's take a look at the retrieved context pieces and their reranker scores:

In [None]:
results

{'answer': 'Based on the provided context, the query can be answered as follows:\n\nTo quantize a model in 4 bits, you can use the `BitsAndBytesConfig` class from\nthe `transformers` library. Here is an example of how to do it:\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer,\nBitsAndBytesConfig\n\nmodel_id = "meta-llama/Llama-3.1-8B-Instruct"\nquantization_config = BitsAndBytesConfig(\nload_in_4bit=True,\nbnb_4bit_compute_dtype=torch.bfloat16\n)\n\nmodel = AutoModelForCausalLM.from_pretrained(\nmodel_id,\nquantization_config=quantization_config,\ntorch_dtype=torch.bfloat16,\ndevice_map="auto"\n)\n```\n\nThis code creates a `BitsAndBytesConfig` instance with `load_in_4bit=True` and\n`bnb_4bit_compute_dtype=torch.bfloat16`, which tells the library to load the\nmodel in 4 bits and use bfloat16 as the compute dtype. Then, it uses the\n`from_pretrained` method to load the model with the specified quantization\nconfiguration.\n\nAlternatively, you ca

Now, let's set up a particular reranker and run the whole pipeline on our favourite query,

In [None]:
# This might require you to restart a session
!pip install -q mxbai-rerank

In [None]:
from mxbai_rerank import MxbaiRerankV2

reranker_model = MxbaiRerankV2("mixedbread-ai/mxbai-rerank-base-v2")

results = answer_with_rag("""How to quantize a model in 4 bits?""",
               client=client, model=model, reranker_model=reranker_model,
               table=lance_table, verbose=True,
               max_stage_1_results=15, max_results=5)
print(results["answer"])

You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Based on the provided context, to quantize a model in 4 bits, you can use the
`BitsAndBytesConfig` class from the Transformers library.

Here's an example code snippet from the context:

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM,
BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
model = AutoModelForCausalLM.from_pretrained(
"tiiuae/falcon-7b",
torch_dtype=torch.bfloat16,
device_map="auto",
quantization_config=quantization_config,
)
```

In this example, the `BitsAndBytesConfig` class is used to configure the
quantization of the model. The `load_in_4bit=True` parameter specifies that the
model should be loaded in 4-bit precision. The `bnb_4bit_compute_dtype`
parameter specifies the data type to use for 4-bit computations, and the
`bnb_4bit_quant_type` and

In [None]:
results

{'answer': 'Based on the provided context, to quantize a model in 4 bits, you can use the\n`BitsAndBytesConfig` class from the Transformers library.\n\nHere\'s an example code snippet from the context:\n\n```python\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM,\nBitsAndBytesConfig\n\nquantization_config = BitsAndBytesConfig(\nload_in_4bit=True,\nbnb_4bit_compute_dtype=torch.bfloat16,\nbnb_4bit_quant_type="nf4",\nbnb_4bit_use_double_quant=True,\n)\n\ntokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")\nmodel = AutoModelForCausalLM.from_pretrained(\n"tiiuae/falcon-7b",\ntorch_dtype=torch.bfloat16,\ndevice_map="auto",\nquantization_config=quantization_config,\n)\n```\n\nIn this example, the `BitsAndBytesConfig` class is used to configure the\nquantization of the model. The `load_in_4bit=True` parameter specifies that the\nmodel should be loaded in 4-bit precision. The `bnb_4bit_compute_dtype`\nparameter specifies the data type to use for 4-bit compu