<a href="https://colab.research.google.com/github/shivanshs9/making-malai/blob/main/rag-pdf-tables/parse_summarize.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Semi-structured RAG

Many documents contain a mixture of content types, including text and tables.

Semi-structured data can be challenging for conventional RAG for at least two reasons:

* Text splitting may break up tables, corrupting the data in retrieval
* Embedding tables may pose challenges for semantic similarity search

This cookbook shows how to perform RAG on documents with semi-structured data:

* We will use [Unstructured](https://unstructured.io/) to parse both text and tables from documents (PDFs).
* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text along with table summaries better suited for retrieval.
* We will use [LCEL](https://python.langchain.com/docs/expression_language/) to implement the chains used.

The overall flow is here:

![MVR.png](attachment:7b5c5a30-393c-4b27-8fa1-688306ef2aef.png)

## Packages

In [27]:
! pip install langchain langchain-chroma unstructured-client langchain-community langchain-unstructured pydantic lxml langchainhub

Collecting langchain-community
  Downloading langchain_community-0.3.12-py3-none-any.whl.metadata (2.9 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain
  Downloading langchain-0.3.12-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.24 (from langchain)
  Downloading langchain_core-0.3.25-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.0-py3-none-any.whl.metadata (3.5 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.3 (from langchain)
  Downloading langchain_text_splitters-0.3.3-py3-none-any.whl.metadata (2.3 kB)
Downloading langchain_community-0.3.12-py3-none-any.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m39.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langchain-0.3.12-py3-none-any.whl (1.0 MB)
[2K   [90m━━━━━

The PDF partitioning used by Unstructured will use:

* `tesseract` for Optical Character Recognition (OCR)
*  `poppler` for PDF rendering and processing

In [1]:
! apt install tesseract-ocr
! apt install libtesseract-dev
! apt-get install poppler-utils
! pip install pytesseract

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 49 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 4,816 kB in 3s (1,689 kB/s)
Selecting previously unselected package tesseract-ocr-eng.
(Reading database ... 123633 files and directories currently installed.)
Preparing to unpack .../tesseract-ocr-

## Data Loading

### Partition PDF tables and text

Apply to the [`LLaMA2`](https://arxiv.org/pdf/2307.09288.pdf) paper.

We use the Unstructured [`partition_pdf`](https://unstructured-io.github.io/unstructured/core/partition.html#partition-pdf), which segments a PDF document by using a layout model.

This layout model makes it possible to extract elements, such as tables, from pdfs.

We also can use `Unstructured` chunking, which:

* Tries to identify document sections (e.g., Introduction, etc)
* Then, builds text blocks that maintain sections while also honoring user-defined chunk sizes

In [6]:
base_path = "/content"
output_path = f'{base_path}/output'

filename="r04 sub1.pdf"
pdf_langs=['jpn']

In [7]:
import json
from typing import Any
from google.colab import userdata
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared, errors

client = UnstructuredClient(
    api_key_auth=userdata.get("UNSTRUCTURED_API_KEY"),
    server_url=userdata.get("UNSTRUCTURED_API_URL"),
)

req = {
    "partition_parameters": {
        "files": {
            "content": open(f'{base_path}/{filename}', "rb"),
            "file_name": filename,
        },
        # Post processing to aggregate text once we have the title
        "chunking_strategy": shared.ChunkingStrategy.BY_TITLE,
        "strategy": shared.Strategy.HI_RES,
        "languages": pdf_langs,
        # Chunking params to aggregate text blocks
        # Attempt to create a new chunk 3800 chars
        # Attempt to keep chunks > 2000 chars
        "max_characters": 4000,
        "new_after_n_chars": 3000,
        "combine_under_n_chars": 2000,
        "split_pdf_allow_failed": True,    # If True, the partitioning continues even if some pages fail.
        "split_pdf_concurrency_level": 15  # Set the number of concurrent request to the maximum value: 15.
    }
}

try:
    res = client.general.partition(request=req)
    if res.elements is None:
      raise Exception('no file passed maybe')
    element_dicts = [element for element in res.elements]

    # Print the processed data's first element only.
    print(element_dicts[0])

    # Write the processed data to a local file.
    json_elements = json.dumps(element_dicts, indent=2)

    with open(f'{output_path}/{filename}.json', "w") as file:
        file.write(json_elements)
except (errors.HTTPValidationError, errors.ServerError) as e:
    # handle e.data: errors.HTTPValidationErrorData
    # handle e.data: errors.ServerErrorData
    print(e.data)
    raise(e)
except Exception as e:
    print(e)




{'type': 'CompositeElement', 'element_id': '482eac2054285106253eb19d05c4c00e', 'text': '運転免許統計（令和４年版）補足資料１\n\n警察庁交通局運転免許課\n\n目 次', 'metadata': {'filetype': 'application/pdf', 'languages': ['jpn'], 'page_number': 1, 'orig_elements': 'eJy1k9tu2zAMhl8l0HVs6Cxq77CiQHsXF4EOdOLBsY1E3hoUe/dJTjOgQzagxXojkL9+WqQ/cPNCsMcDDmnbRfJlRTgHSalxFTqDlVRMV156qJTn0WopUUZP1itywOSiSy7XvJAwjsfYDS7hacl7dx7ntN1jt9unrHAhbK55lX90Me2zyrSSWZ3GbkilbrMRnNdmvTJU1uxpvfqda1urkjMhaC1uCUtFVsjpfEp4KJPcd8/YP0wuIPmZLyImDKkbh23o3em0nY6jzzZaG2UFZEPb9ZjOEy6191/J0vCwm91umWpDvk0DKU9MWdkO88HjsUxRPp7wucxJmtlS6ZsZEHkzKyZjjp1gzWyiYSWm0MxtW06JKLNHQigKKzGafBouLx6b/RDzLUQv8hmCaWatrF38rDR47ffOHY8udd/xsTSSO/oTq7ZccM9C5T34SlrAyjswFYCOUukgKTOfhlUxKBiZ0armBdtVAA61XDgyAbW9qVyKPopWMPXf0ILXOkPysS2ogBWEThbkzGUl/8O/4wfn+fuABWgDFb6ivOyhpFg5plSFAaloWWToPw8Ys7psFRfXNXvNDVxy4Opmvvj/ieqDJPhbEkYjrvIueP5mCx671Odnnn4B321rjw==', 'filename': 'r04 sub1.pdf'}}


We can examine the elements extracted by `partition_pdf`.

`CompositeElement` are aggregated chunks.

In [11]:
# Create a dictionary to store counts of each type
category_counts = {}

for element in res.elements:
    category = element["type"]
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
category_counts

{'CompositeElement': 38, 'Table': 52}

In [19]:
from typing_extensions import TypedDict
from typing import List
from enum import Enum

class Metadata(TypedDict):
  text_as_html: str
  page_number: int
  languages: List[str]
  filename: str

class ElementType(str, Enum):
  COMPOSITE = 'CompositeElement'
  TABLE = 'Table'

class Element(TypedDict):
  type: ElementType
  text: str
  metadata: Metadata

elements: List[Element]
try:
  elements = res.elements
except Exception as e:
  elements = json.load(open(f'{output_path}/{filename}.json'))

table_elements = [e for e in elements if e["type"] == ElementType.TABLE]
text_elements = [e for e in elements if e['type'] == ElementType.COMPOSITE]

## Multi-vector retriever

Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) to produce summaries of tables and, optionally, text.

With the summary, we will also store the raw table elements.

The summaries are used to improve the quality of retrieval, [as explained in the multi vector retriever docs](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector).

The raw tables are passed to the LLM, providing the full table context for the LLM to generate the answer.  

### Summaries

In [28]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.chat_models import ChatPerplexity

We create a simple summarize chain for each element.

You can also see, re-use, or modify the prompt in the Hub [here](https://smith.langchain.com/hub/rlm/multi-vector-retriever-summarization).

```
from langchain import hub
obj = hub.pull("rlm/multi-vector-retriever-summarization")
```

In [35]:
# Prompt
prompt_text = """You are an assistant tasked with summarizing tables. \
The text is in Japanese language, so feel free to translate to English before giving an output. \
The table format is in HTML. \
Give a concise summary of the table. Table: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
model = ChatPerplexity(temperature=0, model="llama-3.1-sonar-large-128k-online", pplx_api_key=userdata.get("PPLX_API_KEY"))
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

In [36]:
# Apply to tables
tables = [i['metadata']['text_as_html'] for i in table_elements]
table_summaries = summarize_chain.batch(tables[1:5], {"max_concurrency": 5})

In [39]:
for summary in table_summaries:
  print(summary)
  print("---")

The provided table appears to be a statistical breakdown of various types of vehicle licenses or registrations in different Japanese prefectures. Here is a concise summary of the table:

## Columns Explanation
- **種類**: Type of vehicle license or registration.
- **都道府**: Prefectures in Japan.
- The subsequent columns represent different categories of vehicle licenses, including:
  - **第二種免許** (Second-class license)
  - **第一種免許** (First-class license)
  - Various specific types such as **大型** (Large), **中型** (Medium), **普通** (Ordinary), **大型特殊** (Large special), etc.
  - **小計** (Subtotal)
  - Other specific categories like **中型垢** (Medium dirt), **準中型田垣** (Semi-medium field fence), etc.
  - **原付** (Light vehicle or motorcycle)

## Key Points
- The table lists the number of vehicle licenses or registrations for each prefecture in Japan.
- Each row represents a different prefecture, and the columns break down the numbers into various categories of vehicle licenses.
- The data includes bot

In [None]:
# Apply to texts
texts = [i['text'] for i in text_elements]
text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})

### Add to vectorstore

Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries:

* `InMemoryStore` stores the raw text, tables
* `vectorstore` stores the embedded summaries

In [None]:
import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

# Add texts
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries)
]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content=s, metadata={id_key: table_ids[i]})
    for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))

## RAG

Run [RAG pipeline](https://python.langchain.com/docs/expression_language/cookbook/retrieval).

In [None]:
from langchain_core.runnables import RunnablePassthrough

# Prompt template
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
model = ChatOpenAI(temperature=0, model="gpt-4")

# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [None]:
chain.invoke("What is the number of training tokens for LLaMA2?")

'The number of training tokens for LLaMA2 is 2.0T.'

We can check the [trace](https://smith.langchain.com/public/4739ae7c-1a13-406d-bc4e-3462670ebc01/r) to see what chunks were retrieved:

This includes Table 1 of the paper, showing the Tokens used for training.

```
Training Data Params Context GQA Tokens LR Length 7B 2k 1.0T 3.0x 10-4 See Touvron et al. 13B 2k 1.0T 3.0 x 10-4 LiaMa 1 (2023) 33B 2k 14T 1.5 x 10-4 65B 2k 1.4T 1.5 x 10-4 7B 4k 2.0T 3.0x 10-4 Liama 2 A new mix of publicly 13B 4k 2.0T 3.0 x 10-4 available online data 34B 4k v 2.0T 1.5 x 10-4 70B 4k v 2.0T 1.5 x 10-4
```