# Textual RAG

![RAG Image](../data/rag.png)

Retrieval-Augmented Generation (RAG) is an AI technique that combines information retrieval with text generation. Instead of relying solely on a pre-trained language model’s internal knowledge, RAG dynamically retrieves relevant documents from an external knowledge base before generating a response.

![Why RAG Image](../data/why_rag.png)

1. **Improved Accuracy:** RAG enhances the factual correctness of generated responses by retrieving up-to-date and domain-specific information, reducing the likelihood of hallucinations (fabricated information).

2. **Better Generalization:** Since RAG dynamically retrieves relevant documents, it performs well across various domains without requiring extensive fine-tuning, making it more adaptable to new topics.

3. **Reduced Model Size Requirements:** Instead of embedding all knowledge within a large model, RAG leverages external databases, allowing for smaller, more efficient models while maintaining high-quality responses.

4. **Enhanced Explainability:** By referencing retrieved documents, RAG provides verifiable sources for its answers, making it more transparent and easier to trust compared to purely generative models.

5. **And more...**

In this exercise, you will learn how to implement a Retrieval-Augmented Generation (RAG) pipeline from scratch, without relying on tools like `langchain`. While `langchain` is a powerful framework that simplifies the development of RAG pipelines, it can sometimes lack flexibility for custom implementations, as it abstracts many components.

The different components of the pipeline are:  

- **Text extraction from PDFs** – Extract raw text from PDF files to make the content processable.  
- **Text chunking** – Break the extracted text into smaller, meaningful segments to improve retrieval efficiency.  
- **Embedding of the chunks** – Convert text chunks into numerical representations (embeddings) using a pre-trained model.  
- **Storage of the embeddings in a vector store** – Save the embeddings in a specialized database (vector store) to enable fast similarity searches.  
- **Relevant chunks retrieval** – Query the vector store to find the most relevant text chunks based on user input.  
- **Setting and prompting of the LLM for a RAG** – Structure prompts and configure the language model to integrate retrieved information into its responses.  
- **Additional tools for improved retrieval** – Use techniques like query expansion to reformulate user queries for better recall and reciprocal rank fusion to combine results from multiple retrieval methods.  
- **Final RAG pipeline implementation** – Integrate all components into a complete system that retrieves relevant information and generates enhanced responses using the language model.  

**Note:** To complete this exercise, you need an OpenAI API key, the PDF files, and the necessary libraries installed (see `requirements.txt`).  

In [1]:
import os
import getpass
import json

import chromadb

from src.data_classes import Chunk
from src.data_processing import SimpleChunker, PDFExtractorAPI
from src.embedding import (
    OpenAITextEmbeddings,
    compute_openai_large_embedding_cost,
)
from src.vectorstore import (
    ChromaDBVectorStore,
    VectorStoreRetriever,
)
from src.llm import OpenAILLM
from src.rag import Generator, DefaultRAG, query_expansion

In [2]:
data_folder = "../data"

pdf_files = [
    "Explainable_machine_learning_prediction_of_edema_a.pdf",
    "Modeling tumor size dynamics based on real‐world electronic health records.pdf",
]
example_pdf_file = "Explainable_machine_learning_prediction_of_edema_a.pdf"
example_pdf_path = os.path.join(data_folder, example_pdf_file)

vector_store_collection = "text_collection"

In [3]:
os.environ["OPENAI_API_KEY"] = getpass.getpass()

# Example

The example uses only `Explainable_machine_learning_prediction_of_edema_a.pdf`. Please, have a quick look at it before starting the exercise.

In [4]:
test_question = "According to SHAP analysis, which factors were the most influential in predicting higher-grade edema (Grade 2+)?"

## LLM  

The LLM is the core of the RAG system, responsible for generating responses based on the retrieved information. There are many options available on-premise or online, each with different performance, speed, specialized knowledge and cost trade-offs. In this case, we use `gpt-4o-mini`.  

This LLM expects input in the form of a list of messages, where each message includes the content and the role of the speaker (e.g., system, user, assistant).  

Here is how they are defined here:

```python
class Roles(str, Enum):
    SYSTEM = "system"
    USER = "user"
    ASSISTANT = "assistant"
    TOOL = "tool"

class LLMMessage(BaseModel):
    content: Optional[str] = None
    role: Optional[Roles] = None
```

In [5]:
llm = OpenAILLM(temperature=0.5)

OpenAI LLM loaded: gpt-4o-mini; temperature: 0.5; seed: 42


In [6]:
print(test_question)

According to SHAP analysis, which factors were the most influential in predicting higher-grade edema (Grade 2+)?


In [7]:
answer, price = llm.generate([{"role": "user", "content": test_question}], verbose=True)

Total input tokens: 30
Total output tokens: 306
Total tokens: 336
Estimated cost: $0.0002


In [8]:
print(answer.content)

SHAP (SHapley Additive exPlanations) analysis is a method used to interpret the output of machine learning models by assigning each feature an importance value for a particular prediction. While I don't have access to specific datasets or studies conducted after October 2023, I can provide general insights into factors that are commonly influential in predicting higher-grade edema (Grade 2+) based on existing literature and clinical knowledge.

Typically, the following factors may be influential in predicting higher-grade edema:

1. **Clinical Characteristics**: Patient demographics (age, sex), comorbidities (e.g., diabetes, hypertension), and previous medical history can significantly impact edema severity.

2. **Treatment Factors**: The type and dosage of medications (e.g., chemotherapy agents, corticosteroids) can play a crucial role. Certain treatments may increase the risk of edema.

3. **Radiological Findings**: Imaging characteristics, such as tumor size, location, and the prese

## PDF Text Extraction  

The first step in the pipeline is to extract text from the document.  

In this exercise, we use the `MinerU` library, which under the hood uses among others `doclayout_yolo` for segmentation. Note that this model is not commercially permissive.

The choice of extraction tool should be carefully considered. Depending on the document type and formatting, different methods may be required to preserve text integrity and leverage structural elements such as headings, tables, or metadata for better processing (`pdfplumber` (better for tables), `Tesseract OCR` (for scanned PDFs), ect.).

In [9]:
data_extractor = PDFExtractorAPI()
_, text, _ = data_extractor.extract_text_and_images(example_pdf_path)

In [10]:
print(text[:1000])

DOI: [10.1111/cts.70010](https://doi.org/10.1111/cts.70010)

### **ARTICLE**

![](_page_0_Picture_4.jpeg)

# **Explainable machine learning prediction of edema adverse events in patients treated with tepotinib**

**Federico Amato[1](#page-0-0)** | **Rainer Strotmann[2](#page-0-1)** | **Roberto Castell[o1](#page-0-0)** | **Rolf Bruns[2](#page-0-1)** | **Vishal Ghori[3](#page-0-2)** | **Andreas John[e2](#page-0-1)** | **Karin Berghoff[2](#page-0-1)** | **Karthik Venkatakrishna[n4](#page-0-3)** | **Nadia Terranova[5](#page-0-4)**

<span id="page-0-0"></span>1 Swiss Data Science Center (EPFL and ETH Zurich), Lausanne, Switzerland

<span id="page-0-1"></span>2 The healthcare business of Merck KGaA, Darmstadt, Germany

<span id="page-0-2"></span>3 Ares Trading S.A., Eysins, Switzerland, an affiliate of Merck KGaA, Darmstadt, Germany

<span id="page-0-3"></span>4 EMD Serono, Billerica, Massachusetts, USA

<span id="page-0-4"></span>5 Quantitative Pharmacology, Ares Trading S.A., Lausanne, Swi

## Text Chunking  

The second step is to split the extracted text into smaller chunks, which will later be embedded and retrieved efficiently.  

In this exercise, we use a simple heuristic approach: the text is split iteratively—first by heading levels (`#`), then by line breaks (`\n`), and finally by sentence (`.`). Splitting only occurs if the resulting chunk exceeds a predefined length. However, more advanced techniques exist, such as **semantic chunking** (which splits based on meaning rather than syntax) or **agentic chunking** (which dynamically adapts chunk sizes based on context).  

Each chunk is enriched with metadata, including:  
- **Source file** – The document from which the chunk originates.  
- **Chunk counter** – The position of the chunk within the file.  
- **Unique identifier (`chunk_id`)** – Ensures each chunk can be referenced independently.  

Additional metadata could be included to enable more refined filtering and retrieval strategies.  

Here, our chunks are defined as:
```python
class Chunk(BaseModel):
    chunk_id: int
    content: str
    metadata: dict = Field(default_factory=dict)
    score: Optional[float] = None
```  

In [11]:
file_metadata = {"source_text": example_pdf_file}

text_chunker = SimpleChunker(max_chunk_size=1000)

chunks = text_chunker.chunk_text(text, file_metadata)

In [12]:
print(len(chunks))
chunks[0]

62


Chunk(chunk_id=0, content='DOI: [10.1111/cts.70010](https://doi.org/10.1111/cts.70010) ### **ARTICLE** ![](_page_0_Picture_4.jpeg) # **Explainable machine learning prediction of edema adverse events in patients treated with tepotinib** **Federico Amato[1](#page-0-0)** | **Rainer Strotmann[2](#page-0-1)** | **Roberto Castell[o1](#page-0-0)** | **Rolf Bruns[2](#page-0-1)** | **Vishal Ghori[3](#page-0-2)** | **Andreas John[e2](#page-0-1)** | **Karin Berghoff[2](#page-0-1)** | **Karthik Venkatakrishna[n4](#page-0-3)** | **Nadia Terranova[5](#page-0-4)** <span id="page-0-0"></span>1 Swiss Data Science Center (EPFL and ETH Zurich), Lausanne, Switzerland <span id="page-0-1"></span>2 The healthcare business of Merck KGaA, Darmstadt, Germany <span id="page-0-2"></span>3 Ares Trading S.A., Eysins, Switzerland, an affiliate of Merck KGaA, Darmstadt, Germany <span id="page-0-3"></span>4 EMD Serono, Billerica, Massachusetts, USA', metadata={'source_text': 'Explainable_machine_learning_prediction_of

## Embedding Model  

Once the text is split into chunks, each chunk is converted into a numerical representation (embedding) that captures its meaning.  

Here, we use OpenAI’s `text-embedding-3-large`, but other options exist, each with different trade-offs in on-premise vs online, accuracy, speed, and cost. The choice of model depends on the specific needs of the retrieval task.

In [13]:
_ = compute_openai_large_embedding_cost(chunks, verbose=True)

Total tokens: 13665
Estimated cost: $0.0018


In [14]:
embedding_model = OpenAITextEmbeddings()
embeddings = embedding_model.get_embedding([chunk.content for chunk in chunks])

In [15]:
print(embeddings.shape)
embeddings[0]

(62, 3072)


array([-0.03147229,  0.00638142, -0.01330947, ...,  0.00983923,
        0.00326732, -0.00159018], shape=(3072,))

## Vector Store and Retriever  

After embedding the chunks, they need to be stored for efficient retrieval. The choice of vector store depends on factors like accuracy, speed, and filtering options. In this exercise, we use `ChromaDB`.  

The next step is retrieving the most relevant chunks based on a query. In this implementation, the retriever uses only embeddings (sparse search). However, in some cases, dense search methods like BM25 or hybrid approaches combining both sparse and dense search can be used for better accuracy.

In [16]:
vector_store = ChromaDBVectorStore(vector_store_collection)
vector_store.insert_documents(chunks, embeddings)

In [17]:
print(test_question)

According to SHAP analysis, which factors were the most influential in predicting higher-grade edema (Grade 2+)?


In [18]:
retriever = VectorStoreRetriever(embedding_model, vector_store)
results = retriever.retrieve(test_question, 5)
results

[[{'chunk_id': '39',
   'score': 0.5417101979255676,
   'chunk': Chunk(chunk_id=39, content='Points are colored based on the edema grade at the following safety visit. SHAP, Shapley Additive exPlanations. with higher grades of edema, particularly grade 2+. On the other hand, for higher albumin levels the corresponding SHAP values are mostly negative and ranging from 0 to −0.5, suggesting a reduced risk of edema of grade 2+. The association between age greater than 70years and an increased likelihood of edemas of grades 2+ was also confirmed. Additionally, for all ages, higher SHAP values were assigned to patients who experienced edemas, particularly of grade 2+. Finally, within low ranges of cumulated dose in the interval [ *t* − 14 days, *t* ] normalized over 14days, higher SHAP values were assigned to samples corresponding to edemas of grades 2+. This could reflect the tendency to adjust administered doses in those cases where the risk of edema was identified. # **DISCUSSION**', meta

## Generator  

Once the LLM is set up, a specific prompt needs to be defined for the RAG system. This prompt must include the retrieved chunks as context. The prompt has to be adapted to each specific project.

In addition to the basic prompt, we incorporate **prompt engineering** by asking the LLM to justify its answer. The model is also instructed to indicate which chunks were most relevant in forming its response, improving **interpretability**, and to provide the answer in **JSON format** for easier data management.

In [19]:
default_system_prompt = """You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer."""
print(default_system_prompt)

You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer.


In [20]:
default_rag_template = """
Here are the relevant DOCUMENTS:
{context}

--------------------------------------------

Here is the USER QUESTION:
{query}

--------------------------------------------

Please think step-by-step and generate your output in json:
"""
print(default_rag_template)


Here are the relevant DOCUMENTS:
{context}

--------------------------------------------

Here is the USER QUESTION:
{query}

--------------------------------------------

Please think step-by-step and generate your output in json:



In [21]:
print(test_question)

According to SHAP analysis, which factors were the most influential in predicting higher-grade edema (Grade 2+)?


In [22]:
generator = Generator(
    llm, system_prompt=default_system_prompt, rag_template=default_rag_template
)

In [23]:
answer, cost = generator.generate(
    history=[],
    query=test_question,
    chunks=[
        results[0][0]["chunk"],
        Chunk(chunk_id=1, content="DATE: 1999.12.02", metadata={}),
    ],
    verbose=True,
)

You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer.

Here are the relevant DOCUMENTS:


Document 1: 
Points are colored based on the edema grade at the following safety visit. SHAP, Shapley Additive exPlanations. with higher grades of edema, particularly grade 2+. On the other hand, for higher albumin levels the corresponding SHAP values are mostly negative and ranging from 0 to −0.5, suggesting a reduced risk of edema of grade 2+. The association between age greater than 70years and an increased likelihood of edemas of grades 2+ was also confirmed. Additionally, for all ages, higher SHA

In [24]:
print(answer.content)

{
  "step_by_step_thinking": "I reviewed Document 1 to identify the factors influencing higher-grade edema (Grade 2+). The document mentions that higher edema grades correlate with certain factors, particularly age greater than 70 years, higher SHAP values for patients with edema, and lower albumin levels. Additionally, it discusses the influence of cumulated dose within a specific time frame. Therefore, the most influential factors in predicting higher-grade edema include age, albumin levels, and cumulated dose adjustments.",
  "document_used": [1],
  "answer": "The most influential factors in predicting higher-grade edema (Grade 2+) according to SHAP analysis are age greater than 70 years, lower albumin levels, and higher SHAP values assigned to patients experiencing edema."
}


## RAG Tools  

There are several methods to improve the efficiency of a RAG pipeline, such as query contextualization, query reformulation, re-ranking, query expansion, etc.

In this notebook, we implement **query expansion** to enhance retrieval and apply **reciprocal rank fusion** to optimize the ranking of chunks when multiple queries are involved.

In [25]:
query_expansion_system_message = {
    "role": "system",
    "content": "You are a focused assistant designed to generate multiple, relevant search queries based solely on a single input query. Your task is to produce a list of these queries in English, without adding any further explanations or information.",
}

query_expansion_template_query = """
        Generate multiple search queries related to: {query}, and translate them in english if they are not already in english. Only output {expansion_number} queries in english.
        OUTPUT ({expansion_number} queries):
    """

In [26]:
print(test_question)

According to SHAP analysis, which factors were the most influential in predicting higher-grade edema (Grade 2+)?


In [27]:
answer, cost = query_expansion(
    test_question,
    llm,
    query_expansion_system_message,
    template_query_expansion=query_expansion_template_query,
    expansion_number=5,
)

answer

Total input tokens: 113
Total output tokens: 89
Total tokens: 202
Estimated cost: $0.0001


['1. What factors influence higher-grade edema according to SHAP analysis?  ',
 '2. How does SHAP analysis determine the predictors of Grade 2+ edema?  ',
 '3. Which variables are most significant in predicting higher-grade edema using SHAP?  ',
 '4. What insights does SHAP analysis provide on factors affecting Grade 2+ edema?  ',
 '5. Can SHAP analysis identify key predictors for severe edema (Grade 2+)?']

## RAG  

Finally, the RAG pipeline is defined by integrating all the previously discussed components into a unified process.

In [28]:
rag = DefaultRAG(
    llm=llm,
    text_embedding_model=embedding_model,
    text_vector_store=vector_store,
    generator=generator,
    query_expansion_system_message=query_expansion_system_message,
    query_expansion_template_query=query_expansion_template_query,
    params={"top_k": 5, "number_query_expansion": 3},
)

In [29]:
print(test_question)

According to SHAP analysis, which factors were the most influential in predicting higher-grade edema (Grade 2+)?


In [30]:
answer, sources, cost = rag.execute(test_question, {}, verbose=True)

Total input tokens: 113
Total output tokens: 59
Total tokens: 172
Estimated cost: $0.0001
Query expansion cost: 0.0001
Expanded queries:
1. What factors influence higher-grade edema (Grade 2+) according to SHAP analysis?  
2. How does SHAP analysis identify key predictors for Grade 2+ edema?  
3. Which variables are most significant in predicting Grade 2+ edema using SHAP analysis?

You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer.

Here are the relevant DOCUMENTS:


Document 1: 
Points are colored based on the edema grade at the following safety visit. SHAP, Shapley Additive exPlanati

In [31]:
print(json.dumps(answer, indent=3))

{
   "step_by_step_thinking": "To determine the most influential factors in predicting higher-grade edema (Grade 2+), I analyzed the relevant documents. Document 1 indicates that higher albumin levels are associated with a reduced risk of edema, while age greater than 70 years correlates with an increased likelihood of higher-grade edemas. Document 2 highlights the ranking of inputs based on SHAP values, suggesting that certain covariates significantly impact the predicted probabilities of edemas of grade 2+. Document 5 emphasizes that past current edema grade is the most influential input, especially if the same grade persists. Additionally, Document 6 confirms that older age increases the probability of higher-grade edema. Therefore, the primary influential factors identified are past edema grade, age, and albumin levels.",
   "document_used": [
      1,
      2,
      5,
      6
   ],
   "answer": "The most influential factors in predicting higher-grade edema (Grade 2+) according to

In [32]:
# The documents retrieved by the retriever:
print(len(sources))
print(sources[0])

6
{'chunk_id': '39', 'chunk': Chunk(chunk_id=39, content='Points are colored based on the edema grade at the following safety visit. SHAP, Shapley Additive exPlanations. with higher grades of edema, particularly grade 2+. On the other hand, for higher albumin levels the corresponding SHAP values are mostly negative and ranging from 0 to −0.5, suggesting a reduced risk of edema of grade 2+. The association between age greater than 70years and an increased likelihood of edemas of grades 2+ was also confirmed. Additionally, for all ages, higher SHAP values were assigned to patients who experienced edemas, particularly of grade 2+. Finally, within low ranges of cumulated dose in the interval [ *t* − 14 days, *t* ] normalized over 14days, higher SHAP values were assigned to samples corresponding to edemas of grades 2+. This could reflect the tendency to adjust administered doses in those cases where the risk of edema was identified. # **DISCUSSION**', metadata={'source_text': 'Explainable_m

In [33]:
print(cost)

0.0004002


# Exercises

The different blocks are redefined below, and a new pipeline is created that uses both PDFs.

1. Quickly go through the code and the notebook above to ensure you understand how each block works.
2. Answer the following questions related to `Explainable_machine_learning_prediction_of_edema_a.pdf` and analyze the answers:
   1. "What was identified as the most important predictor for edema occurrence?"
   2. "Which machine learning algorithm performed best for predicting edema, and what was its F1 score?"
   3. "How did cumulative tepotinib dose impact edema predictions, and what insights did SHAP provide about this relationship?"
   4. Propose your own question.
3. Review the `Modeling tumor size dynamics based on real‐world electronic health records.pdf` and come up with a question. Ask it and analyze the answer, confirm that the retriever uses relevant chunks from this source.
4. Discuss how the pipeline could be improved to achieve better answers. If time permits, implement those changes.

In [34]:
data_extractor = PDFExtractorAPI()
text_chunker = SimpleChunker(max_chunk_size=1000)

chunks = []

for pdf_file in pdf_files:
    pdf_path = os.path.join(data_folder, pdf_file)
    _, text, _ = data_extractor.extract_text_and_images(pdf_path)
    chunks_curr = text_chunker.chunk_text(text, {"source_text": pdf_file})
    chunks.extend(chunks_curr)
    print(len(chunks))

len(chunks)

62
127


127

In [35]:
_ = compute_openai_large_embedding_cost(chunks)

Total tokens: 27879
Estimated cost: $0.0036


In [36]:
embedding_model = OpenAITextEmbeddings()
embeddings = embedding_model.get_embedding([chunk.content for chunk in chunks])

In [37]:
# Reset previous
client = chromadb.Client()
client.delete_collection(vector_store_collection)

# Create new one
vector_store = ChromaDBVectorStore(vector_store_collection)
vector_store.insert_documents(chunks, embeddings)

In [38]:
llm = OpenAILLM(temperature=1.0)

OpenAI LLM loaded: gpt-4o-mini; temperature: 1.0; seed: 42


In [39]:
system_prompt = """You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer."""
print(system_prompt)

You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer.


In [40]:
rag_template = """
Here are the relevant DOCUMENTS:
{context}

--------------------------------------------

Here is the USER QUESTION:
{query}

--------------------------------------------

Please think step-by-step and generate your output in json:
"""
print(rag_template)


Here are the relevant DOCUMENTS:
{context}

--------------------------------------------

Here is the USER QUESTION:
{query}

--------------------------------------------

Please think step-by-step and generate your output in json:



In [41]:
query_expansion_system_message = {
    "role": "system",
    "content": "You are a focused assistant designed to generate multiple, relevant search queries based solely on a single input query. Your task is to produce a list of these queries in English, without adding any further explanations or information.",
}

query_expansion_template_query = """
        Generate multiple search queries related to: {query}, and translate them in english if they are not already in english. Only output {expansion_number} queries in english.
        OUTPUT ({expansion_number} queries):
    """

In [42]:
generator = Generator(llm, system_prompt=system_prompt, rag_template=rag_template)

In [43]:
rag = DefaultRAG(
    llm=llm,
    text_embedding_model=embedding_model,
    text_vector_store=vector_store,
    generator=generator,
    query_expansion_system_message=query_expansion_system_message,
    query_expansion_template_query=query_expansion_template_query,
    params={"top_k": 1, "number_query_expansion": 0},
)

In [44]:
answer, sources, cost = rag.execute(
    "Here goes my amazing question!",
    {},
    verbose=True,
)

You are a helpful assistant, and your task is to answer questions using relevant documents. Please first think step-by-step by mentioning which documents you used and then answer the question. Organize your output in a json formatted as dict{"step_by_step_thinking": Str(explanation), "document_used": List(integers), "answer": Str{answer}}. Your responses will be read by someone without specialized knowledge, so please have a definite and concise answer.

Here are the relevant DOCUMENTS:


Document 1: 
<span id="page-11-1"></span>Additional supporting information can be found online in the Supporting Information section at the end of this article. **How to cite this article:** Amato F, Strotmann R, Castello R, et al. Explainable machine learning prediction of edema adverse events in patients treated with tepotinib. *Clin Transl Sci*. 2024;17:e70010. doi[:10.1111/cts.70010](https://doi.org/10.1111/cts.70010)

Document 2: 
DOI: [10.1111/cts.70010](https://doi.org/10.1111/cts.70010) ### **

In [45]:
print(json.dumps(answer, indent=3))

{
   "step_by_step_thinking": "I analyzed the given documents to answer the question regarding the machine learning study on edema in patients treated with tepotinib. I focused particularly on Document 3, which discusses the specific study objectives, the factors evaluated (54 covariates), and the use of machine learning to assess predictors of edema. I also looked at Document 2 for article citation and additional context. Document 5 provides insights into the SHAP (Shapley Additive Explanations) values, which indicate how the model interprets the importance of different inputs in predicting edema. This combination of information helps in understanding the machine learning approach taken in the study.",
   "document_used": [
      2,
      3,
      5
   ],
   "answer": "The study on explainable machine learning predicts edema adverse events in patients treated with tepotinib by assessing 54 covariates as potential predictors. It uses SHAP values to determine the influence of these inpu

In [46]:
# The documents retrieved by the retriever:
print(len(sources))
print(sources[0])

5
{'chunk_id': '61', 'chunk': Chunk(chunk_id=61, content='<span id="page-11-1"></span>Additional supporting information can be found online in the Supporting Information section at the end of this article. **How to cite this article:** Amato F, Strotmann R, Castello R, et al. Explainable machine learning prediction of edema adverse events in patients treated with tepotinib. *Clin Transl Sci*. 2024;17:e70010. doi[:10.1111/cts.70010](https://doi.org/10.1111/cts.70010)', metadata={'source_text': 'Explainable_machine_learning_prediction_of_edema_a.pdf', 'document_chunk_id': 61}, data_type=<DataType.TEXT: 'text'>, score=None), 'fused_score': 0.016129032258064516, 'average_original_score': 1.6858607530593872}


In [47]:
print(cost)

0.00031919999999999995


----------------