<a href="https://colab.research.google.com/github/zahere-dev/multi-representation-indexing-advanced-rag/blob/main/Advanced_RAG_Multi_representation_indexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Semi-structured RAG

Many documents contain a mixture of content types, including text and tables.

Semi-structured data can be challenging for conventional RAG for at least two reasons:

* Text splitting may break up tables, corrupting the data in retrieval
* Embedding tables may pose challenges for semantic similarity search

This cookbook shows how to perform RAG on documents with semi-structured data:

* We will use [Unstructured](https://unstructured.io/) to parse both text and tables from documents (PDFs).
* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text along with table summaries better suited for retrieval.
* We will use [LCEL](https://python.langchain.com/docs/expression_language/) to implement the chains used.

The overall flow is here:

![MVR.png](attachment:7b5c5a30-393c-4b27-8fa1-688306ef2aef.png)

## Packages

In [None]:
! pip install langchain langchain-chroma "unstructured[all-docs]" pydantic lxml langchainhub langchain_openai nltk==3.9.1

The PDF partitioning used by Unstructured will use:

* `tesseract` for Optical Character Recognition (OCR)
*  `poppler` for PDF rendering and processing

In [None]:
!sudo apt install tesseract-ocr
!pip install pytesseract
!pip install pdf2image
!apt-get install poppler-utils

## Data Loading

### Partition PDF tables and text


We use the Unstructured [`partition_pdf`](https://unstructured-io.github.io/unstructured/core/partition.html#partition-pdf), which segments a PDF document by using a layout model.

This layout model makes it possible to extract elements, such as tables, from pdfs.

We also can use `Unstructured` chunking, which:

* Tries to identify document sections (e.g., Introduction, etc)
* Then, builds text blocks that maintain sections while also honoring user-defined chunk sizes

In [2]:
import nltk
print(nltk.__version__)
nltk.download('punkt_tab')

3.9.1


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [3]:
path = "/content/Immunotherapy_in_Non-Small-Cell_Lung_Cancer.pdf"

In [4]:
from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

# Get elements
raw_pdf_elements = partition_pdf(
    filename=path,
    # Unstructured first finds embedded image blocks
    extract_images_in_pdf=False,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=path,
)

yolox_l0.05.onnx:   0%|          | 0.00/217M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/115M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/46.8M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


We can examine the elements extracted by `partition_pdf`.

`CompositeElement` are aggregated chunks.

In [5]:
# Create a dictionary to store counts of each type
category_counts = {}

for element in raw_pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.CompositeElement'>": 24,
 "<class 'unstructured.documents.elements.Table'>": 5}

In [6]:
class Element(BaseModel):
    type: str
    text: Any


# Categorize by type
categorized_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))

# Tables
table_elements = [e for e in categorized_elements if e.type == "table"]
print(len(table_elements))

# Text
text_elements = [e for e in categorized_elements if e.type == "text"]
print(len(text_elements))

5
24


## Multi-vector retriever

Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) to produce summaries of tables and, optionally, text.

With the summary, we will also store the raw table elements.

The summaries are used to improve the quality of retrieval, [as explained in the multi vector retriever docs](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector).

The raw tables are passed to the LLM, providing the full table context for the LLM to generate the answer.  

### Summaries

In [7]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

We create a simple summarize chain for each element.

You can also see, re-use, or modify the prompt in the Hub [here](https://smith.langchain.com/hub/rlm/multi-vector-retriever-summarization).

```
from langchain import hub
obj = hub.pull("rlm/multi-vector-retriever-summarization")
```

In [8]:
# Prompt
prompt_text = """You are an assistant tasked with summarizing tables and text. \
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] =  userdata.get("OPENAI_API_KEY")
model = ChatOpenAI(temperature=0, model="gpt-4o-mini")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

In [9]:
# Apply to tables
tables = [i.text for i in table_elements]
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

In [23]:
for table in table_summaries:
  print(table)
  print("\n")

The summarized findings from the clinical trials CheckMate 017, CheckMate 057, and Keynote-010 indicate that:

1. **CheckMate 017**: In patients with advanced squamous cell carcinoma (SCC) non-small cell lung cancer (NSCLC) who progressed after first-line chemotherapy, nivolumab (nivo) showed significantly better overall survival (OS) compared to docetaxel (doce), with a median OS of 9.2 months for nivo versus 6.0 months for doce. The one-year OS rate was 42% for nivo compared to 24% for doce, and the response rate (RR) was 20% for nivo versus 9% for doce (p = 0.008).

2. **CheckMate 057**: In patients with advanced non-SCC NSCLC who had progressed after platinum-based chemotherapy, nivo also demonstrated superior OS (12.2 months for nivo vs. 9.4 months for doce; HR, 0.73; p = 0.002). The one-year OS rate was 51% for nivo versus 39% for doce, and the RR was 19% for nivo versus 12% for doce (p = 0.02).

3. **Keynote-010**: Pembrolizumab (pembro) showed prolonged OS in previously treated

In [10]:
# Apply to texts
texts = [i.text for i in text_elements]
text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})

### Add to vectorstore

Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries:

* `InMemoryStore` stores the raw text, tables
* `vectorstore` stores the embedded summaries

In [11]:
import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"))

# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

# Add texts
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries)
]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content=s, metadata={id_key: table_ids[i]})
    for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))

In [15]:
query = "What is Dual Immunotherapy without Chemotherapy?"
sub_docs = vectorstore.similarity_search(query)
sub_docs[0]


Document(metadata={'doc_id': '1f2b41d3-0c50-431d-b27c-3515508fb92c'}, page_content='The text discusses significant advancements in immunotherapy for metastatic non-small cell lung cancer (NSCLC), particularly focusing on single-agent and dual immunotherapy approaches. Key findings include:\n\n1. **Single-Agent Immunotherapy**: Initial trials of anti-PD-1 therapies like nivolumab and pembrolizumab showed better response rates (20%) compared to chemotherapy (9-13%). Nivolumab was FDA-approved in 2016 after demonstrating improved overall survival (OS) in patients with PD-L1 ≥ 1%. Notable trials include Checkmate-017 and Checkmate-057, which confirmed the efficacy of nivolumab. Pembrolizumab also showed significant OS benefits in patients with high PD-L1 expression.\n\n2. **First-Line Studies**: Trials such as Keynote-024 and Keynote-042 demonstrated that pembrolizumab outperformed platinum-based chemotherapy in terms of progression-free survival (PFS) and OS for patients with varying PD-L

In [20]:
retrieved_docs = retriever.invoke(query)
retrieved_docs[0]

'Significant improvements in\n\n17 months; HR 0.58; 95% CI,\n\n0.46–0.72) and median OS (NR\n\nvs. 52.4 months; HR 0.72; 95%\n\nSignificant improvements in\n\n2. Immunotherapy for Metastatic NSCLC\n\n2.1. Single-Agent Immunotherapy\n\nThe first trials for anti-program death ligand 1 (anti-PD(L)1) monotherapy were used in the second-line or third-line setting, with encouraging results demonstrating response rates of around 20%, compared to 9–13% for chemotherapy [4–6,31]. Following the landmark phase III trial Checkmate-017 comparing nivolumab to docetaxel, the Food and Drug Administration (FDA) approved nivolumab as the first anti-PD-1 agent for NSCLC with PDL-1 ≥ 1% in 2016. Similar results were seen in Checkmate-057 (nivolumab in non-squamous NSCLC) and Keynote-010 (pembrolizumab in advanced NSCLC) [5,32], particularly in PDL-1 > 50% patients who derived a median overall survival (OS) of 14.9 months on pembrolizumab vs. 8.2 months on docetaxel (HR 0.54, 95% CI [0.38–0.77]). Clinical 

## RAG

Run [RAG pipeline](https://python.langchain.com/docs/expression_language/cookbook/retrieval).

In [16]:
from langchain_core.runnables import RunnablePassthrough

# Prompt template
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
model = ChatOpenAI(temperature=0, model="gpt-4o-mini")

# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [25]:
chain.invoke("What is Dual Immunotherapy without Chemotherapy?")

'Dual Immunotherapy without Chemotherapy refers to a treatment approach for metastatic non-small cell lung cancer (NSCLC) that involves the use of two immune checkpoint inhibitors (ICIs) without the addition of chemotherapy. An example of this is the combination of nivolumab and ipilimumab, which has been FDA-approved for first-line treatment in patients with PD-L1 expression ≥ 1% and without EGFR/ALK alterations. \n\nIn the phase III trial Checkmate 227, this combination demonstrated a positive overall survival (OS) benefit across all PD-L1 expression subgroups, with a 4-year OS rate of 29% for patients receiving nivolumab plus ipilimumab compared to 18% for those receiving chemotherapy. This indicates that dual immunotherapy can provide significant long-term survival benefits for patients with advanced NSCLC.'