# Textual RAG

![RAG Image](../data/rag.png)

Retrieval-Augmented Generation (RAG) is an AI technique that combines information retrieval with text generation. Instead of relying solely on a pre-trained language model’s internal knowledge, RAG dynamically retrieves relevant documents from an external knowledge base before generating a response.

![Why RAG Image](../data/why_rag.png)

1. **Improved Accuracy:** RAG enhances the factual correctness of generated responses by retrieving up-to-date and domain-specific information, reducing the likelihood of hallucinations (fabricated information).

2. **Better Generalization:** Since RAG dynamically retrieves relevant documents, it performs well across various domains without requiring extensive fine-tuning, making it more adaptable to new topics.

3. **Reduced Model Size Requirements:** Instead of embedding all knowledge within a large model, RAG leverages external databases, allowing for smaller, more efficient models while maintaining high-quality responses.

4. **Enhanced Explainability:** By referencing retrieved documents, RAG provides verifiable sources for its answers, making it more transparent and easier to trust compared to purely generative models.

5. **And more...**

In this exercise, you will learn how to implement a Retrieval-Augmented Generation (RAG) pipeline from scratch, without relying on tools like `langchain`. While `langchain` is a powerful framework that simplifies the development of RAG pipelines, it can sometimes lack flexibility for custom implementations, as it abstracts many components.

The different components of the pipeline are:  

- **Text extraction from PDFs** – Extract raw text from PDF files to make the content processable.  
- **Text chunking** – Break the extracted text into smaller, meaningful segments to improve retrieval efficiency.  
- **Embedding of the chunks** – Convert text chunks into numerical representations (embeddings) using a pre-trained model.  
- **Storage of the embeddings in a vector store** – Save the embeddings in a specialized database (vector store) to enable fast similarity searches.  
- **Relevant chunks retrieval** – Query the vector store to find the most relevant text chunks based on user input.  
- **Setting and prompting of the LLM for a RAG** – Structure prompts and configure the language model to integrate retrieved information into its responses.  
- **Additional tools for improved retrieval** – Use techniques like query expansion to reformulate user queries for better recall and reciprocal rank fusion to combine results from multiple retrieval methods.  
- **Final RAG pipeline implementation** – Integrate all components into a complete system that retrieves relevant information and generates enhanced responses using the language model.  

**Note:** To complete this exercise, you need an OpenAI API key, the PDF files, and the necessary libraries installed (see `requirements.txt`).  

In [1]:
!pip install -r requirements.txt



In [2]:
import os
import getpass
import json

import chromadb

from src.data_classes import Chunk
from src.data_processing import SimpleChunker, PDFExtractorAPI
from src.embedding import (
    OpenAITextEmbeddings,
    compute_openai_large_embedding_cost,
)
from src.vectorstore import (
    ChromaDBVectorStore,
    VectorStoreRetriever,
)
from src.llm import OpenAILLM
from src.rag import Generator, DefaultRAG, query_expansion

In [3]:
data_folder = "../data"

pdf_files = [
    "Explainable_machine_learning_prediction_of_edema_a.pdf",
    "Modeling tumor size dynamics based on real‐world electronic health records.pdf",
]
example_pdf_file = "Explainable_machine_learning_prediction_of_edema_a.pdf"
example_pdf_path = os.path.join(data_folder, example_pdf_file)

vector_store_collection = "text_collection"

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass()

# Example

The example uses only `Explainable_machine_learning_prediction_of_edema_a.pdf`. Please, have a quick look at it before starting the exercise.

In [5]:
test_question = "According to SHAP analysis, which factors were the most influential in predicting higher-grade edema (Grade 2+)?"

## LLM  

The LLM is the core of the RAG system, responsible for generating responses based on the retrieved information. There are many options available on-premise or online, each with different performance, speed, specialized knowledge and cost trade-offs. In this case, we use `gpt-4o-mini`.  

This LLM expects input in the form of a list of messages, where each message includes the content and the role of the speaker (e.g., system, user, assistant).  

Here is how they are defined here:

```python
class Roles(str, Enum):
    SYSTEM = "system"
    USER = "user"
    ASSISTANT = "assistant"
    TOOL = "tool"

class LLMMessage(BaseModel):
    content: Optional[str] = None
    role: Optional[Roles] = None
```

In [6]:
llm = OpenAILLM(temperature=0.5)

OpenAI LLM loaded: gpt-4o-mini; temperature: 0.5; seed: 42


In [7]:
print(test_question)

According to SHAP analysis, which factors were the most influential in predicting higher-grade edema (Grade 2+)?


In [8]:
answer, price = llm.generate([{"role": "user", "content": test_question}], verbose=True)

Total input tokens: 30
Total output tokens: 362
Total tokens: 392
Estimated cost: $0.0002


In [9]:
print(answer.content)

SHAP (SHapley Additive exPlanations) analysis is a method used to interpret the output of machine learning models by assigning each feature an importance value for a particular prediction. While I don't have access to specific datasets or studies conducted after October 2023, I can provide general guidance on how to interpret SHAP analysis results for predicting higher-grade edema (Grade 2+).

In a typical SHAP analysis for predicting medical conditions like edema, the most influential factors may include:

1. **Demographic Factors**: Age, sex, and ethnicity could play a significant role in the risk of developing higher-grade edema.

2. **Clinical History**: Previous medical history, including conditions like heart failure, kidney disease, or liver dysfunction, may be critical.

3. **Medication Use**: Certain medications, such as those that affect fluid retention (e.g., corticosteroids, NSAIDs), might be influential.

4. **Biomarkers**: Laboratory results, such as levels of electrolyte

## PDF Text Extraction  

The first step in the pipeline is to extract text from the document.  

In this exercise, we use the `MinerU` library, which under the hood uses among others `doclayout_yolo` for segmentation. Note that this model is not commercially permissive.

The choice of extraction tool should be carefully considered. Depending on the document type and formatting, different methods may be required to preserve text integrity and leverage structural elements such as headings, tables, or metadata for better processing (`pdfplumber` (better for tables), `Tesseract OCR` (for scanned PDFs), ect.).

In [10]:
data_extractor = PDFExtractorAPI()
_, text, _ = data_extractor.extract_text_and_images(example_pdf_path)

In [11]:
print(text[:1000])

DOI: [10.1111/cts.70010](https://doi.org/10.1111/cts.70010)

### **ARTICLE**

![](_page_0_Picture_4.jpeg)

# **Explainable machine learning prediction of edema adverse events in patients treated with tepotinib**

**Federico Amato[1](#page-0-0)** | **Rainer Strotmann[2](#page-0-1)** | **Roberto Castell[o1](#page-0-0)** | **Rolf Bruns[2](#page-0-1)** | **Vishal Ghori[3](#page-0-2)** | **Andreas John[e2](#page-0-1)** | **Karin Berghoff[2](#page-0-1)** | **Karthik Venkatakrishna[n4](#page-0-3)** | **Nadia Terranova[5](#page-0-4)**

<span id="page-0-0"></span>1 Swiss Data Science Center (EPFL and ETH Zurich), Lausanne, Switzerland

<span id="page-0-1"></span>2 The healthcare business of Merck KGaA, Darmstadt, Germany

<span id="page-0-2"></span>3 Ares Trading S.A., Eysins, Switzerland, an affiliate of Merck KGaA, Darmstadt, Germany

<span id="page-0-3"></span>4 EMD Serono, Billerica, Massachusetts, USA

<span id="page-0-4"></span>5 Quantitative Pharmacology, Ares Trading S.A., Lausanne, Swi

## Text Chunking  

The second step is to split the extracted text into smaller chunks, which will later be embedded and retrieved efficiently.  

In this exercise, we use a simple heuristic approach: the text is split iteratively—first by heading levels (`#`), then by line breaks (`\n`), and finally by sentence (`.`). Splitting only occurs if the resulting chunk exceeds a predefined length. However, more advanced techniques exist, such as **semantic chunking** (which splits based on meaning rather than syntax) or **agentic chunking** (which dynamically adapts chunk sizes based on context).  

Each chunk is enriched with metadata, including:  
- **Source file** – The document from which the chunk originates.  
- **Chunk counter** – The position of the chunk within the file.  
- **Unique identifier (`chunk_id`)** – Ensures each chunk can be referenced independently.  

Additional metadata could be included to enable more refined filtering and retrieval strategies.  

Here, our chunks are defined as:
```python
class Chunk(BaseModel):
    chunk_id: int
    content: str
    metadata: dict = Field(default_factory=dict)
    score: Optional[float] = None
```  

In [12]:
file_metadata = {"source_text": example_pdf_file}

text_chunker = SimpleChunker(max_chunk_size=1000)

chunks = text_chunker.chunk_text(text, file_metadata)

In [13]:
print(len(chunks))
chunks[0]

62


Chunk(chunk_id=0, content='DOI: [10.1111/cts.70010](https://doi.org/10.1111/cts.70010) ### **ARTICLE** ![](_page_0_Picture_4.jpeg) # **Explainable machine learning prediction of edema adverse events in patients treated with tepotinib** **Federico Amato[1](#page-0-0)** | **Rainer Strotmann[2](#page-0-1)** | **Roberto Castell[o1](#page-0-0)** | **Rolf Bruns[2](#page-0-1)** | **Vishal Ghori[3](#page-0-2)** | **Andreas John[e2](#page-0-1)** | **Karin Berghoff[2](#page-0-1)** | **Karthik Venkatakrishna[n4](#page-0-3)** | **Nadia Terranova[5](#page-0-4)** <span id="page-0-0"></span>1 Swiss Data Science Center (EPFL and ETH Zurich), Lausanne, Switzerland <span id="page-0-1"></span>2 The healthcare business of Merck KGaA, Darmstadt, Germany <span id="page-0-2"></span>3 Ares Trading S.A., Eysins, Switzerland, an affiliate of Merck KGaA, Darmstadt, Germany <span id="page-0-3"></span>4 EMD Serono, Billerica, Massachusetts, USA', metadata={'source_text': 'Explainable_machine_learning_prediction_of

## Embedding Model  

Once the text is split into chunks, each chunk is converted into a numerical representation (embedding) that captures its meaning.  

Here, we use OpenAI’s `text-embedding-3-large`, but other options exist, each with different trade-offs in on-premise vs online, accuracy, speed, and cost. The choice of model depends on the specific needs of the retrieval task.

In [14]:
_ = compute_openai_large_embedding_cost(chunks, verbose=True)

Total tokens: 13665
Estimated cost: $0.0018


In [15]:
embedding_model = OpenAITextEmbeddings()
embeddings = embedding_model.get_embedding([chunk.content for chunk in chunks])

In [16]:
print(embeddings.shape)
embeddings[0]

(62, 3072)


array([-0.03154377,  0.00651717, -0.01348095, ...,  0.00990808,
        0.00328753, -0.00139772], shape=(3072,))

## Vector Store and Retriever  

After embedding the chunks, they need to be stored for efficient retrieval. The choice of vector store depends on factors like accuracy, speed, and filtering options. In this exercise, we use `ChromaDB`.  

The next step is retrieving the most relevant chunks based on a query. In this implementation, the retriever uses only embeddings (sparse search). However, in some cases, dense search methods like BM25 or hybrid approaches combining both sparse and dense search can be used for better accuracy.

In [17]:
vector_store = ChromaDBVectorStore(vector_store_collection)
vector_store.insert_documents(chunks, embeddings)

In [18]:
print(test_question)

According to SHAP analysis, which factors were the most influential in predicting higher-grade edema (Grade 2+)?


In [19]:
retriever = VectorStoreRetriever(embedding_model, vector_store)
results = retriever.retrieve(test_question, 5)
results

[[{'chunk_id': '39',
   'score': 0.5416353940963745,
   'chunk': Chunk(chunk_id=39, content='Points are colored based on the edema grade at the following safety visit. SHAP, Shapley Additive exPlanations. with higher grades of edema, particularly grade 2+. On the other hand, for higher albumin levels the corresponding SHAP values are mostly negative and ranging from 0 to −0.5, suggesting a reduced risk of edema of grade 2+. The association between age greater than 70years and an increased likelihood of edemas of grades 2+ was also confirmed. Additionally, for all ages, higher SHAP values were assigned to patients who experienced edemas, particularly of grade 2+. Finally, within low ranges of cumulated dose in the interval [ *t* − 14 days, *t* ] normalized over 14days, higher SHAP values were assigned to samples corresponding to edemas of grades 2+. This could reflect the tendency to adjust administered doses in those cases where the risk of edema was identified. # **DISCUSSION**', meta

## Generator  

Once the LLM is set up, a specific prompt needs to be defined for the RAG system. This prompt must include the retrieved chunks as context. The prompt has to be adapted to each specific project.

In addition to the basic prompt, we incorporate **prompt engineering** by asking the LLM to justify its answer. The model is also instructed to indicate which chunks were most relevant in forming its response, improving **interpretability**, and to provide the answer in **JSON format** for easier data management.

In [20]:
default_system_prompt = """You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer."""
print(default_system_prompt)

You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer.


In [21]:
default_rag_template = """
Here are the relevant DOCUMENTS:
{context}

--------------------------------------------

Here is the USER QUESTION:
{query}

--------------------------------------------

Please think step-by-step and generate your output in json:
"""
print(default_rag_template)


Here are the relevant DOCUMENTS:
{context}

--------------------------------------------

Here is the USER QUESTION:
{query}

--------------------------------------------

Please think step-by-step and generate your output in json:



In [22]:
print(test_question)

According to SHAP analysis, which factors were the most influential in predicting higher-grade edema (Grade 2+)?


In [23]:
generator = Generator(
    llm, system_prompt=default_system_prompt, rag_template=default_rag_template
)

In [24]:
answer, cost = generator.generate(
    history=[],
    query=test_question,
    chunks=[
        results[0][0]["chunk"],
        Chunk(chunk_id=1, content="DATE: 1999.12.02", metadata={}),
    ],
    verbose=True,
)

You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer.

Here are the relevant DOCUMENTS:


Document 1: 
Points are colored based on the edema grade at the following safety visit. SHAP, Shapley Additive exPlanations. with higher grades of edema, particularly grade 2+. On the other hand, for higher albumin levels the corresponding SHAP values are mostly negative and ranging from 0 to −0.5, suggesting a reduced risk of edema of grade 2+. The association between age greater than 70years and an increased likelihood of edemas of grades 2+ was also confirmed. Additionally, for all ages, higher SHA

In [25]:
print(answer.content)

{
  "step_by_step_thinking": "I reviewed Document 1 to identify the factors influencing higher-grade edema (Grade 2+). The document mentions that higher edema grades correlate with certain factors, particularly age greater than 70 years, higher SHAP values for patients with edema, and lower albumin levels. Additionally, it discusses the impact of administered doses normalized over 14 days, suggesting that adjustments in doses occur when the risk of edema is identified. Therefore, the most influential factors identified are age, albumin levels, and cumulated dose adjustments.",
  "document_used": [1],
  "answer": "The most influential factors in predicting higher-grade edema (Grade 2+) according to SHAP analysis are age greater than 70 years, lower albumin levels, and adjustments in cumulated doses."
}


## RAG Tools  

There are several methods to improve the efficiency of a RAG pipeline, such as query contextualization, query reformulation, re-ranking, query expansion, etc.

In this notebook, we implement **query expansion** to enhance retrieval and apply **reciprocal rank fusion** to optimize the ranking of chunks when multiple queries are involved.

In [26]:
query_expansion_system_message = {
    "role": "system",
    "content": "You are a focused assistant designed to generate multiple, relevant search queries based solely on a single input query. Your task is to produce a list of these queries in English, without adding any further explanations or information.",
}

query_expansion_template_query = """
        Generate multiple search queries related to: {query}, and translate them in english if they are not already in english. Only output {expansion_number} queries in english.
        OUTPUT ({expansion_number} queries):
    """

In [27]:
print(test_question)

According to SHAP analysis, which factors were the most influential in predicting higher-grade edema (Grade 2+)?


In [28]:
answer, cost = query_expansion(
    test_question,
    llm,
    query_expansion_system_message,
    template_query_expansion=query_expansion_template_query,
    expansion_number=5,
)

answer

Total input tokens: 113
Total output tokens: 89
Total tokens: 202
Estimated cost: $0.0001


['1. What factors influence higher-grade edema according to SHAP analysis?  ',
 '2. How does SHAP analysis determine the predictors of Grade 2+ edema?  ',
 '3. Which variables are most significant in predicting higher-grade edema using SHAP?  ',
 '4. What insights does SHAP analysis provide on factors affecting Grade 2+ edema?  ',
 '5. Can SHAP analysis identify key predictors for severe edema (Grade 2+)?']

## RAG  

Finally, the RAG pipeline is defined by integrating all the previously discussed components into a unified process.

In [29]:
rag = DefaultRAG(
    llm=llm,
    text_embedding_model=embedding_model,
    text_vector_store=vector_store,
    generator=generator,
    query_expansion_system_message=query_expansion_system_message,
    query_expansion_template_query=query_expansion_template_query,
    params={"top_k": 5, "number_query_expansion": 3},
)

In [30]:
print(test_question)

According to SHAP analysis, which factors were the most influential in predicting higher-grade edema (Grade 2+)?


In [31]:
answer, sources, cost = rag.execute(test_question, {}, verbose=True)

Total input tokens: 113
Total output tokens: 59
Total tokens: 172
Estimated cost: $0.0001
Query expansion cost: 0.0001
Expanded queries:
1. What factors influence higher-grade edema (Grade 2+) according to SHAP analysis?  
2. How does SHAP analysis identify key predictors for Grade 2+ edema?  
3. Which variables are most significant in predicting Grade 2+ edema using SHAP analysis?

You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer.

Here are the relevant DOCUMENTS:


Document 1: 
Points are colored based on the edema grade at the following safety visit. SHAP, Shapley Additive exPlanati

In [32]:
print(json.dumps(answer, indent=3))

{
   "step_by_step_thinking": "To determine the most influential factors in predicting higher-grade edema (Grade 2+), I reviewed the relevant documents. Document 5 indicates that the past current edema grade is the most influential input, especially if the same grade persists to the next safety visit. Additionally, Document 5 also highlights that albumin is the most informative time-varying covariate for predicting Grade 2+ edemas. Document 6 further supports this by showing that lower albumin levels are associated with an increased risk of developing Grade 2+ edema. Furthermore, Document 1 mentions that age greater than 70 years is associated with a higher likelihood of Grade 2+ edemas. Therefore, the key factors identified are: past current edema grade, albumin levels, and age over 70 years.",
   "document_used": [
      1,
      5,
      6
   ],
   "answer": "The most influential factors in predicting higher-grade edema (Grade 2+) are: past current edema grade, lower albumin levels,

In [33]:
# The documents retrieved by the retriever:
print(len(sources))
print(sources[0])

6
{'chunk_id': '39', 'chunk': Chunk(chunk_id=39, content='Points are colored based on the edema grade at the following safety visit. SHAP, Shapley Additive exPlanations. with higher grades of edema, particularly grade 2+. On the other hand, for higher albumin levels the corresponding SHAP values are mostly negative and ranging from 0 to −0.5, suggesting a reduced risk of edema of grade 2+. The association between age greater than 70years and an increased likelihood of edemas of grades 2+ was also confirmed. Additionally, for all ages, higher SHAP values were assigned to patients who experienced edemas, particularly of grade 2+. Finally, within low ranges of cumulated dose in the interval [ *t* − 14 days, *t* ] normalized over 14days, higher SHAP values were assigned to samples corresponding to edemas of grades 2+. This could reflect the tendency to adjust administered doses in those cases where the risk of edema was identified. # **DISCUSSION**', metadata={'source_text': 'Explainable_m

In [34]:
print(cost)

0.000396


# Exercises

The different blocks are redefined below, and a new pipeline is created that uses both PDFs.

1. Quickly go through the code and the notebook above to ensure you understand how each block works.
2. Answer the following questions related to `Explainable_machine_learning_prediction_of_edema_a.pdf` and analyze the answers:
   1. "What was identified as the most important predictor for edema occurrence?"
   2. "Which machine learning algorithm performed best for predicting edema, and what was its F1 score?"
   3. "How did cumulative tepotinib dose impact edema predictions, and what insights did SHAP provide about this relationship?"
   4. Propose your own question.
3. Review the `Modeling tumor size dynamics based on real‐world electronic health records.pdf` and come up with a question. Ask it and analyze the answer, confirm that the retriever uses relevant chunks from this source.
4. Discuss how the pipeline could be improved to achieve better answers. If time permits, implement those changes.

In [35]:
data_extractor = PDFExtractorAPI()
text_chunker = SimpleChunker(max_chunk_size=1000)

chunks = []

for pdf_file in pdf_files:
    pdf_path = os.path.join(data_folder, pdf_file)
    _, text, _ = data_extractor.extract_text_and_images(pdf_path)
    chunks_curr = text_chunker.chunk_text(text, {"source_text": pdf_file})
    chunks.extend(chunks_curr)
    print(len(chunks))

len(chunks)

62
127


127

In [36]:
_ = compute_openai_large_embedding_cost(chunks)

Total tokens: 27879
Estimated cost: $0.0036


In [37]:
embedding_model = OpenAITextEmbeddings()
embeddings = embedding_model.get_embedding([chunk.content for chunk in chunks])

In [38]:
# Reset previous
client = chromadb.Client()
client.delete_collection(vector_store_collection)

# Create new one
vector_store = ChromaDBVectorStore(vector_store_collection)
vector_store.insert_documents(chunks, embeddings)

In [39]:
llm = OpenAILLM(temperature=1.0)

OpenAI LLM loaded: gpt-4o-mini; temperature: 1.0; seed: 42


In [40]:
system_prompt = """You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer."""
print(system_prompt)

You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer.


In [41]:
rag_template = """
Here are the relevant DOCUMENTS:
{context}

--------------------------------------------

Here is the USER QUESTION:
{query}

--------------------------------------------

Please think step-by-step and generate your output in json:
"""
print(rag_template)


Here are the relevant DOCUMENTS:
{context}

--------------------------------------------

Here is the USER QUESTION:
{query}

--------------------------------------------

Please think step-by-step and generate your output in json:



In [42]:
query_expansion_system_message = {
    "role": "system",
    "content": "You are a focused assistant designed to generate multiple, relevant search queries based solely on a single input query. Your task is to produce a list of these queries in English, without adding any further explanations or information.",
}

query_expansion_template_query = """
        Generate multiple search queries related to: {query}, and translate them in english if they are not already in english. Only output {expansion_number} queries in english.
        OUTPUT ({expansion_number} queries):
    """

In [43]:
generator = Generator(llm, system_prompt=system_prompt, rag_template=rag_template)

In [44]:
rag = DefaultRAG(
    llm=llm,
    text_embedding_model=embedding_model,
    text_vector_store=vector_store,
    generator=generator,
    query_expansion_system_message=query_expansion_system_message,
    query_expansion_template_query=query_expansion_template_query,
    params={"top_k": 1, "number_query_expansion": 0},
)

In [45]:
answer, sources, cost = rag.execute(
    "Here goes my amazing question!",
    {},
    verbose=True,
)

You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer.

Here are the relevant DOCUMENTS:


Document 1: 
<span id="page-11-1"></span>Additional supporting information can be found online in the Supporting Information section at the end of this article. **How to cite this article:** Amato F, Strotmann R, Castello R, et al. Explainable machine learning prediction of edema adverse events in patients treated with tepotinib. *Clin Transl Sci*. 2024;17:e70010. doi[:10.1111/cts.70010](https://doi.org/10.1111/cts.70010)

Document 2: 
DOI: [10.1111/cts.70010](https://doi.org/10.1111/cts.70010) ### **

In [46]:
print(json.dumps(answer, indent=3))

{
   "step_by_step_thinking": "I analyzed the given documents to answer the question regarding the machine learning study on edema in patients treated with tepotinib. I focused particularly on Document 3, which discusses the specific study objectives, the factors evaluated (54 covariates), and the use of machine learning to assess predictors of edema. Additionally, Document 5 outlines the use of SHAP values to interpret the model's predictions, indicating how different inputs affect the prediction of edema severity. These details collectively illustrate how the study utilizes machine learning for assessing edemas in a clinical context.",
   "document_used": [
      2,
      3,
      5
   ],
   "answer": "The study uses machine learning to identify baseline and time-varying factors that predict the likelihood of edema in patients treated with tepotinib. It assesses 54 covariates and employs explainability tools like SHAP values to analyze the relationship between these factors and edema

In [47]:
# The documents retrieved by the retriever:
print(len(sources))
print(sources[0])

5
{'chunk_id': '61', 'chunk': Chunk(chunk_id=61, content='<span id="page-11-1"></span>Additional supporting information can be found online in the Supporting Information section at the end of this article. **How to cite this article:** Amato F, Strotmann R, Castello R, et al. Explainable machine learning prediction of edema adverse events in patients treated with tepotinib. *Clin Transl Sci*. 2024;17:e70010. doi[:10.1111/cts.70010](https://doi.org/10.1111/cts.70010)', metadata={'source_text': 'Explainable_machine_learning_prediction_of_edema_a.pdf', 'document_chunk_id': 61}, data_type=<DataType.TEXT: 'text'>, score=None), 'fused_score': 0.016129032258064516, 'average_original_score': 1.6858880519866943}


In [48]:
print(cost)

0.0003126


# Solutions

## 1. Quickly go through the code and the notebook above to ensure you understand how each block works.

No solution provided for this exercise.

## 2. Questions about `Explainable_machine_learning_prediction_of_edema_a.pdf`

Answer the following questions related to `Explainable_machine_learning_prediction_of_edema_a.pdf` and analyze the answers.

### 2.1 "What was identified as the most important predictor for edema occurrence?"

In [49]:
answer, sources, cost = rag.execute(
    "What was identified as the most important predictor for edema occurrence?",
    {},
    verbose=True,
)

You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer.

Here are the relevant DOCUMENTS:


Document 1: 
Consistently with the above sensitivity analysis, past current edema grade was found to be the most influential input, particularly if a same grade persisted to the following safety visit. The exposure-derived features were also informative for the model probability predictions. Albumin was found as the most informative time-varying covariate, especially for predicting edemas of grades 2+. Figure [3](#page-7-1) illustrates the contribution of the input variables toward the predicted proba

In [50]:
print(json.dumps(answer, indent=3))

{
   "step_by_step_thinking": "To answer the question about the most important predictor for edema occurrence, I reviewed Document 1, which highlighted that the past current edema grade was found to be the most influential input. Additionally, it mentioned that albumin was notably informative as a time-varying covariate. However, the specific mention of the 'current edema grade' as the most informative input indicates that this is the primary focus when predicting future occurrences of edema grades 2+. Therefore, the current edema grade stands out as the most critical predictor of edema occurrence.",
   "document_used": [
      1
   ],
   "answer": "The past current edema grade was identified as the most important predictor for edema occurrence."
}


In [51]:
# The documents retrieved by the retriever:
print(len(sources))
print(sources[0])

5
{'chunk_id': '32', 'chunk': Chunk(chunk_id=32, content='Consistently with the above sensitivity analysis, past current edema grade was found to be the most influential input, particularly if a same grade persisted to the following safety visit. The exposure-derived features were also informative for the model probability predictions. Albumin was found as the most informative time-varying covariate, especially for predicting edemas of grades 2+. Figure [3](#page-7-1) illustrates the contribution of the input variables toward the predicted probability of edemas of grades 2+. The analysis reveals that the current edema grade is the most informative input, as patients with a history of edemas of grades 2+ are considered highly likely to experience the same grade in the future. Interestingly, albumin once again emerges as the most informative among the longitudinal covariates, with lower levels associated <span id="page-6-0"></span>', metadata={'source_text': 'Explainable_machine_learning

### 2.2 "Which machine learning algorithm performed best for predicting edema, and what was its F1 score?"

In [52]:
answer, sources, cost = rag.execute(
    "Which machine learning algorithm performed best for predicting edema, and what was its F1 score?",
    {},
    verbose=True,
)

You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer.

Here are the relevant DOCUMENTS:


Document 1: 
Generalization performances were assessed on the previously unused test set, resulting in a weighted F1 score of 0.959. Effects of calibration via Isotonic Regression on such model are shown in Figures [S3,](#page-11-0) [S4.](#page-11-0) Then, SHAP values were used to determine the 10 most relevant predictors for this model. Finally, a last model was trained using only such predictors, leading to a weighted F1 score of 0.961. Precision and recall values for this model are reported in Tabl

In [53]:
print(json.dumps(answer, indent=3))

{
   "step_by_step_thinking": "I analyzed the documents to find information on the performance of machine learning algorithms used for predicting edema. Document 2 mentions that the Random Forest (RF) algorithm performs better than Gradient Boosting Trees (GBT) for edema prediction. Document 1 provides specific F1 scores, noting an F1 score of 0.961 for models trained with relevant predictors. Additionally, Document 4 indicates that the RF model achieved an F1 score of 0.994 during evaluations on follow-up data. This suggests that the best algorithm for predicting edema is Random Forest, and its highest reported F1 score is 0.994.",
   "document_used": [
      2,
      1,
      4
   ],
   "answer": "The Random Forest algorithm performed best for predicting edema, with an F1 score of 0.994."
}


In [54]:
# The documents retrieved by the retriever:
print(len(sources))
print(sources[0])

5
{'chunk_id': '29', 'chunk': Chunk(chunk_id=29, content='Generalization performances were assessed on the previously unused test set, resulting in a weighted F1 score of 0.959. Effects of calibration via Isotonic Regression on such model are shown in Figures [S3,](#page-11-0) [S4.](#page-11-0) Then, SHAP values were used to determine the 10 most relevant predictors for this model. Finally, a last model was trained using only such predictors, leading to a weighted F1 score of 0.961. Precision and recall values for this model are reported in Table [S2](#page-11-0), showing consistent results across the different output classes. As increased age was previously found to be associated with increasing risk of edema,[10](#page-10-6) the performances of the model have been verified within the different age terciles, showing consistent results across them (weighted F1 score equal to 0.969, 0.975, and 0.938 for the three terciles, respectively).', metadata={'source_text': 'Explainable_machine_l

### 2.3 "How did cumulative tepotinib dose impact edema predictions, and what insights did SHAP provide about this relationship?"

In [55]:
answer, sources, cost = rag.execute(
    "How did cumulative tepotinib dose impact edema predictions, and what insights did SHAP provide about this relationship?",
    {},
    verbose=True,
)

You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer.

Here are the relevant DOCUMENTS:


Document 1: 
The second objective of the study was the identification of the factors predicting edema occurrence and evolution over time. The Shapley Additive exPlanations (SHAP) method was used to investigate the role different factors have toward a specific estimation of edema occurrence obtained via the best predictive model, both at population and patient level. The use of this approach overcomes the lack of explainability of ML models, which approximate complex nonlinear functions from data in a 

In [56]:
print(json.dumps(answer, indent=3))

{
   "step_by_step_thinking": "I examined Document 4, where it is noted that a higher cumulative tepotinib dose correlates with a lower probability of experiencing severe edema (grade 2+). This suggests that patients receiving longer treatment durations experience fewer severe adverse effects due to cumulative dosing. Additionally, Document 5 discusses the use of SHAP analysis, which revealed that certain covariates, including age and serum albumin levels, significantly influence edema predictions. Together, these documents indicate that while higher cumulative doses tend to reduce severe edema occurrences, SHAP provided insightful relationships that inform how various factors, including treatment duration and specific patient characteristics, interact with edema risk.",
   "document_used": [
      4,
      5
   ],
   "answer": "Higher cumulative tepotinib dose is associated with a lower probability of severe edema (grade 2+). SHAP analysis identified that serum albumin levels and pati

In [57]:
# The documents retrieved by the retriever:
print(len(sources))
print(sources[0])

5
{'chunk_id': '11', 'chunk': Chunk(chunk_id=11, content='The second objective of the study was the identification of the factors predicting edema occurrence and evolution over time. The Shapley Additive exPlanations (SHAP) method was used to investigate the role different factors have toward a specific estimation of edema occurrence obtained via the best predictive model, both at population and patient level. The use of this approach overcomes the lack of explainability of ML models, which approximate complex nonlinear functions from data in a not straightforwardly interpretable manner.[20,21](#page-10-11) # **METHODS** # **Clinical data** Data from 612 patients enrolled in five Phase I/II clinical studies with tepotinib were collected (NCT01014936, NCT01832506, NCT01988493, NCT02115373, VISION – NCT02864992).', metadata={'source_text': 'Explainable_machine_learning_prediction_of_edema_a.pdf', 'document_chunk_id': 11}, data_type=<DataType.TEXT: 'text'>, score=None), 'fused_score': 0.0

### 2.4 Propose your own question.

No solution provided for this exercise.

## 3. Review the `Modeling tumor size dynamics based on real‐world electronic health records.pdf` and come up with a question. Ask it and analyze the answer, confirm that the retriever uses relevant chunks from this source.

In [58]:
answer, sources, cost = rag.execute(
    "What was the rationale for using an ON/OFF treatment effect model instead of a dose-dependent model?",
    {},
    verbose=True,
)

You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer.

Here are the relevant DOCUMENTS:


Document 1: 
The paucity of baseline images prevented us to estimate the growth rate constant that was fixed to a literature value obtained from a population similar to our cohort[.4](#page-10-3) Sensitivity analyses provided confidence regarding the suitability of the value for our analysis. Our data supported an ON/OFF treatment effect, with better model performances as compared to a model with a treatment effect linearly dependent on the doses. This is in line with literature evidence indicating th

In [59]:
print(json.dumps(answer, indent=3))

{
   "step_by_step_thinking": "To understand the rationale for using an ON/OFF treatment effect model over a dose-dependent model, I analyzed several documents. Document 1 states that the data supported an ON/OFF treatment effect with better model performances as compared to a model with a treatment effect linearly dependent on the doses. Document 3 mentions a significantly lower objective function value (OFV) in the ON/OFF treatment effect model compared to the dose-dependent model, indicating that the ON/OFF model is a better fit for the data. Together, these documents suggest that the ON/OFF treatment effect provides a more accurate representation of how treatment may affect outcomes in this patient population, especially at the dose levels used in clinical practice.",
   "document_used": [
      1,
      3
   ],
   "answer": "The rationale for using an ON/OFF treatment effect model is based on its better model performance and lower objective function value (OFV) compared to a dose-

In [60]:
# The documents retrieved by the retriever:
print(len(sources))
print(sources[0])

5
{'chunk_id': '103', 'chunk': Chunk(chunk_id=103, content='The paucity of baseline images prevented us to estimate the growth rate constant that was fixed to a literature value obtained from a population similar to our cohort[.4](#page-10-3) Sensitivity analyses provided confidence regarding the suitability of the value for our analysis. Our data supported an ON/OFF treatment effect, with better model performances as compared to a model with a treatment effect linearly dependent on the doses. This is in line with literature evidence indicating that, at the dose ranges used in clinical practice, the exposure-response relationship of ICIs is at the plateau of the maximal response.[4,32](#page-10-3) Additionally, we investigated different combinations of *k*kill to deal with limited data in our population of patients ![](_page_8_Figure_1.jpeg) <span id="page-8-0"></span>**FIGURE 2** Forest plots of covariate effects on TTB0 for the model including clinical covariates.', metadata={'source

In [None]:
for source in sources:
    print(source["chunk"].metadata)

{'source_text': 'Modeling tumor size dynamics based on real‐world electronic health records.pdf', 'document_chunk_id': 41}
{'source_text': 'Modeling tumor size dynamics based on real‐world electronic health records.pdf', 'document_chunk_id': 19}
{'source_text': 'Modeling tumor size dynamics based on real‐world electronic health records.pdf', 'document_chunk_id': 27}
{'source_text': 'Modeling tumor size dynamics based on real‐world electronic health records.pdf', 'document_chunk_id': 18}
{'source_text': 'Explainable_machine_learning_prediction_of_edema_a.pdf', 'document_chunk_id': 43}


The four most relevant sources come indeed from the correct document.

## 4. Discuss how the pipeline could be improved to achieve better answers. If time permits, implement those changes.

No solution provided for this exercise.

Participants can discuss every step of the pipeline (data extraction, embedding models, retrieval improvement, sending more chunks, providing to the LLM part of the metadata (which document does the chunk come from), better prompting, ect.).

----------------