<a href="https://colab.research.google.com/github/stackbacker/Langchain-work/blob/main/Multimodal_RAG_with_GPT_4o_and_Pathway.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Hands-on Multimodal RAG with GPT-4o and Pathway**

INSERT PHOTO HERE

**Multimodal Retrieval-Augmented Generation (MM-RAG)** systems are transforming the way we enhance Language Models and Generative AI. By incorporating a variety of data types within one application, these systems significantly expand their capabilities and applications.

While traditional [RAG systems](https://pathway.com/blog/retrieval-augmented-generation-beginners-guide-rag-apps) primarily use and parse text, Multimodal RAG systems integrate multimedia elements such as images, audio, and video. This integration is beneficial even for use cases that might initially seem like pure text scenarios, such as handling charts, data, and information stored as images.

By the end of this hands-on guide, you will:

Have a concise understanding of Multimodal RAG systems
Appreciate the enhanced retrieval and generation capabilities offered by multimodal search and RAG, especially in contexts involving financial data and complex visual elements.
See an app template that can be replicated enabling you to build a multimodal RAG application using production-ready open source frameworks such as Pathway.

# Legacy Retrieval Augmented Generation (RAG)
A Retrieval-Augmented Generation (RAG) system enhances Language Models by fetching relevant information from external data sources that aren't part of the model's training data. This retrieved context is then incorporated into the user prompt, helping the model generate more accurate and contextually informed responses without the need for extensive retraining. RAG systems are particularly useful in addressing issues such as:
- **Privacy for Enterprise Use-Cases**: Ensuring sensitive information is kept secure, in a Faraday Cage.
- **High accuracy**: Reducing the LLM application’s tendency to generate incorrect information by 90% or more.
- **Verifiability of Information**: Providing references to verify the generated content.
- **Lower Compute Costs**: Reducing the need for frequent retraining.
- **Scalability**: Easily updating and expanding the model’s knowledge base.



# How is Multimodal RAG Different?

Traditional RAG systems are limited to text-based data since most LLMs understand only text, leading to less coherent outputs when images or text stored as images are involved. This is now changing.
New generative models, both closed and open source, can understand text and images. With these advancements, multimodal RAG systems can retrieve and process multimedia data, such as images, audio, and video, alongside text. By integrating multimodal search and retrieval with LLMs, we achieve more coherent outputs, especially for complex queries requiring diverse information formats. This approach significantly enhances performance, as demonstrated in the example below.


INSERT PHOTO 2 HERE

# Why is Multimodal Search and RAG Useful?

Multimodal search and RAG allows systems to access and interpret diverse data types, leading to richer and more accurate responses. For instance:
- **Visual Data**: Tables, charts, and diagrams, especially in critical use cases like financial documents, can be efficiently interpreted using models like GPT-4o. This enhances the accuracy of generative AI applications. An example of the same can be seen in this [popular example](https://github.com/pathwaycom/llm-app/blob/7e6a32985a3932daf71178230220993553a5e893/examples/pipelines/gpt_4o_multimodal_rag/src/_parser_utils.py#L116) or below in this guide, where visual data is parsed as images to improve understanding and searchability.
- **Indexing**: The explained content from tables is saved with the document chunk into the index, making it easily searchable and more useful for specific queries. This ensures that diverse data types are readily accessible, enhancing the system's performance and utility.
- **Multimodal In-Context Learning**: Modern multimodal RAG systems are capable of in-context learning. For example, they can generate images from demonstrations, meaning you can feed the model demonstration images and text so it generates new images that follow the visual characteristics of these in-context examples. This capability further broadens the applications and effectiveness of multimodal RAG systems.


# Architecture Used for Multimodal RAG for Production Use Cases

Building a multimodal RAG system for production requires a robust and scalable architecture that can handle diverse data types and ensure seamless integration and retrieval of context. This architecture must efficiently manage data ingestion, processing, and querying, while providing accurate and timely responses to user queries. Key components include data parsers, vector databases, LLMs, and real-time data synchronization tools.








## Specific Architecture for This Guide

Building a multimodal RAG system for production requires a robust and scalable architecture capable of handling diverse data types, ensuring seamless integration, and providing accurate responses. Key components include data parsers, vector databases, LLMs, and real-time data synchronization tools.

INSERT PHOTO 3 HERE

## Leveraging Pathway for Multimodal Search and RAG

Pathway enhances this architecture by providing real-time data synchronization, secure document handling, and a built-in vector store. Pathway’s enterprise connectors enable incremental synchronization with platforms like Sharepoint and Google Drive. This allows us to perform live document indexing, ensuring efficient and secure data management.

## Key Components of the Multimodal RAG Architecture

- **BaseRAGQuestionAnswerer Class**: Integrates foundational RAG components.
- **GPT-4o by Open AI**: Used for extracting and understanding multimodal data, generating vector embeddings, and for answering queries with retrieved context.
- **Pathway**: Provides real-time synchronization, secure document handling, and a robust in-memory vector store for indexing.

This architecture ensures our multimodal RAG system is efficient, scalable, and capable of handling complex data types, making it ideal for production use cases, especially in finance where understanding data within PDFs is crucial.


# **Step by Step Guide for Multimodal RAG**

**Finance Use Case: Understanding Data within PDFs**

In this guide, we focus on a popular finance use case: **understanding data within PDFs**. Financial documents often contain **complex tables** and **charts** that require precise interpretation. We’ve seen examples where you can do this with open source models, having the entire multimodal RAG pipeline within a Faraday cage so **data stays within your ecosystem**.

However, here we use Open AI’s popular Multimodal LLM, [**GPT-4o**](https://openai.com/index/hello-gpt-4o/). It’s used at two key stages:
1. **Parsing Process**: Tables are extracted as images, and GPT-4o then explains the content of these tables in detail. The explained content is saved with the document chunk into the index for easy searchability.
   
2. **Answering Questions**: Questions are sent to the LLM with the relevant context, including parsed tables. This allows the generation of accurate responses based on the comprehensive multimodal context.



## **Install Required Libraries**

In this cell, we install all the necessary libraries required for the project. These libraries include:

- **pathway[xpack-llm]>=0.11.0**: Provides tools for building and deploying LLM applications.
- **openparse==0.5.6**: Library for parsing various document formats including PDFs.
- **python-dotenv==1.0.1**: Manages environment variables from a `.env` file.
- **unstructured[all-docs]==0.10.28**: A library for working with unstructured document formats.
- **mpmath==1.3.0**: A library for arbitrary-precision arithmetic.
- **pydantic**: Data validation and settings management using Python type annotations.
- **pypdf**: A library for working with PDF documents.
- **Pillow**: The Python Imaging Library for opening, manipulating, and saving many different image file formats.


In [1]:
!pip install pathway[xpack-llm]>=0.11.0 openparse==0.5.6 python-dotenv==1.0.1 unstructured[all-docs]==0.10.28 mpmath==1.3.0 pydantic pypdf Pillow


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 1.8.0 requires sqlglot<=20.11,>=20.8.0, but you have sqlglot 10.6.1 which is incompatible.
cudf-cu12 24.4.1 requires pandas<2.2.2dev0,>=2.0, but you have pandas 2.2.2 which is incompatible.
google-colab 1.0.0 requires pandas==2.0.3, but you have pandas 2.2.2 which is incompatible.
ibis-framework 8.0.0 requires sqlglot<=20.11,>=18.12.0, but you have sqlglot 10.6.1 which is incompatible.[0m[31m
[0m

## **Set Up OpenAI API Key**

In this cell, we set the OpenAI API key as an environment variable. Replace the placeholder with your actual API key.


In [2]:
OPENAI_API_KEY = "Paste your OpenAI API key"

In [3]:
import os

# Set the OpenAI API key
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

##**Document parsers:**

Functions that take raw bytes and return a list of text
chunks along with their metadata.

###**Import Necessary Libraries**

This cell imports the required libraries and modules for custom document parsing and related functionalities. It also sets up logging and initializes the OpenAIChat for language model interactions.


In [4]:
# Imports for the custom document parser and related functionality
import asyncio
import base64
import concurrent.futures
import io
import logging
from typing import List, Literal, Union

import PIL
from openparse.pdf import Pdf
from openparse import DocumentParser, consts, tables, text
from openparse.schemas import ParsedDocument, TableElement
from openparse.tables.parse import (
    Bbox,
    PyMuPDFArgs,
    TableTransformersArgs,
    UnitableArgs, _ingest_with_pymupdf,
    _ingest_with_table_transformers,
    _ingest_with_unitable,
)

from openparse.tables.utils import adjust_bbox_with_padding, crop_img_with_padding
from pathway.internals import udfs
from pathway.xpacks.llm._utils import _coerce_sync
from pathway.xpacks.llm.llms import OpenAIChat
from pydantic import BaseModel,ConfigDict, Field

###**Default Table Parse Prompt**

We define the default prompt that will be used for parsing tables from images. This prompt instructs the language model to explain the table in JSON format, ensuring no details are skipped.



In [5]:
# Define default prompt for table parsing
DEFAULT_TABLE_PARSE_PROMPT = """Explain the given table in JSON format in detail.
Do not skip over details or units/metrics.
Make sure column and row names are understandable.
If it is not a table, return 'No table.' ."""

###**Logging Configuration**

Configure logging settings if needed. This helps in debugging and monitoring the code execution.

In [None]:
# Configure logging if needed
# logging.basicConfig(
#     level=logging.INFO,
#     format="%(asctime)s %(name)s %(levelname)s %(message)s",
#     datefmt="%Y-%m-%d %H:%M:%S",
# )

logger = logging.getLogger(__name__)

###**LLM Table Parsing Function**

Define the function `llm_parse_table` that uses OpenAIChat to parse a table from a base64-encoded image. The function sends the image and prompt to the language model and returns the parsed table content.

In [None]:
def llm_parse_table(
    image, model="gpt-4o", prompt=DEFAULT_TABLE_PARSE_PROMPT, **kwargs
) -> str:
    """
    Use OpenAIChat to parse a table image encoded as base64.

    Args:
    - image: Base64-encoded image string.
    - model: LLM model to use (default: "gpt-4o").
    - prompt: Prompt for the language model (default: DEFAULT_TABLE_PARSE_PROMPT).
    - kwargs: Additional keyword arguments.

    Returns:
    - str: Parsed table content as a string.
    """
    content = [
        {"type": "text", "text": prompt},
        {
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{image}"},
        },
    ]

    messages = [{"role": "user", "content": content}]

    logger.info(f"Parsing table, model: {model}\nmessages: {str(content)[:350]}...")

    # Use _coerce_sync to run llm_parse_table synchronously
    response = _coerce_sync(chat.__wrapped__)(model=model, messages=messages, **kwargs)

    return response


###**LLMArgs Class**

Define the `LLMArgs` class using Pydantic. This class models the arguments needed for LLM table parsing, including the parsing algorithm, minimum table confidence, LLM model, and prompt.

In [None]:
class LLMArgs(BaseModel):
    """
    Pydantic model for LLM parsing arguments.
    """
    parsing_algorithm: Literal["llm"] = Field(default="llm")
    min_table_confidence: float = Field(default=0.9, ge=0.0, le=1.0)
    llm_model: str = Field(default="gpt-4o")
    prompt: str = Field(default=DEFAULT_TABLE_PARSE_PROMPT)

    model_config = ConfigDict(extra="forbid")


 ### **Convert Args Dictionary to Model**

Define the function `_table_args_dict_to_model` which converts a dictionary of table parsing arguments to the appropriate model based on the parsing algorithm specified.

In [None]:
def _table_args_dict_to_model(args_dict: dict):
    """
    Convert a dictionary of table parsing arguments to the appropriate model.

    Args:
    - args_dict: Dictionary of table parsing arguments.

    Returns:
    - Union[TableTransformersArgs, PyMuPDFArgs, UnitableArgs, LLMArgs]: Parsed table arguments as a model.
    """
    if args_dict["parsing_algorithm"] == "table-transformers":
        return TableTransformersArgs(**args_dict)
    elif args_dict["parsing_algorithm"] == "pymupdf":
        return PyMuPDFArgs(**args_dict)
    elif args_dict["parsing_algorithm"] == "unitable":
        return UnitableArgs(**args_dict)
    elif args_dict["parsing_algorithm"] == "llm":
        return LLMArgs(**args_dict)
    else:
        raise ValueError(
            f"Unsupported parsing_algorithm: {args_dict['parsing_algorithm']}"
        )



###**Convert Image to Base64**

Define the function `img_to_b64` which converts a PIL image to a base64-encoded string. This is used for sending images to the language model.

In [None]:
def img_to_b64(img: PIL.Image) -> str:
    """
    Convert a PIL image to a base64-encoded string.

    Args:
    - img: PIL Image object.

    Returns:
    - str: Base64-encoded image string.
    """
    buffer = io.BytesIO()
    img.save(buffer, format="PNG")
    buffer.seek(0)

    img_bytes = buffer.read()

    return base64.b64encode(img_bytes).decode("utf-8")


###**Ingest Tables with LLM**

Define the function `_ingest_with_llm` which uses the language model to ingest and parse tables from a PDF document. The function converts PDF pages to images, detects table bounding boxes, crops the table images, and sends them to the language model for parsing.

In [None]:
def _ingest_with_llm(
    doc: Pdf,
    args: LLMArgs,
    verbose: bool = False,
) -> List[TableElement]:
    try:
        from openparse.tables.table_transformers.ml import find_table_bboxes
        from openparse.tables.utils import doc_to_imgs

    except ImportError as e:
        raise ImportError(
            "Table detection and extraction requires the `torch`, `torchvision` and `transformers` libraries to be installed.",  # noqa: E501
            e,
        )
    pdoc = doc.to_pymupdf_doc()
    pdf_as_imgs = doc_to_imgs(pdoc)

    pages_with_tables = {}
    for page_num, img in enumerate(pdf_as_imgs):
        pages_with_tables[page_num] = find_table_bboxes(img, args.min_table_confidence)

    tables = []
    image_ls = []
    for page_num, table_bboxes in pages_with_tables.items():
        page = pdoc[page_num]
        for table_bbox in table_bboxes:
            padding_pct = 0.05
            padded_bbox = adjust_bbox_with_padding(
                bbox=table_bbox.bbox,
                page_width=page.rect.width,
                page_height=page.rect.height,
                padding_pct=padding_pct,
            )
            table_img = crop_img_with_padding(pdf_as_imgs[page_num], padded_bbox)

            img = img_to_b64(table_img)

            image_ls.append(img)

    with concurrent.futures.ThreadPoolExecutor() as executor:
        task_results = list(
            executor.map(
                lambda img: llm_parse_table(img, args.llm_model, args.prompt),
                image_ls,
            )
        )

    for table_str in task_results:
        fy0 = page.rect.height - padded_bbox[3]
        fy1 = page.rect.height - padded_bbox[1]

        table_elem = TableElement(
            bbox=Bbox(
                page=page_num,
                x0=padded_bbox[0],
                y0=fy0,
                x1=padded_bbox[2],
                y1=fy1,
                page_width=page.rect.width,
                page_height=page.rect.height,
            ),
            text=table_str,
        )

        tables.append(table_elem)

    return tables

###**Ingest Function**

Define the main `ingest` function which decides which table parsing method to use based on the provided arguments. It supports various parsing algorithms including `table-transformers`, `pymupdf`, `unitable`, and `llm`.

In [None]:
def ingest(
    doc: Pdf,
    parsing_args: Union[
        TableTransformersArgs, PyMuPDFArgs, UnitableArgs, LLMArgs, None
    ] = None,
    verbose: bool = False,
) -> List[TableElement]:
    if isinstance(parsing_args, TableTransformersArgs):
        return _ingest_with_table_transformers(doc, parsing_args, verbose)
    elif isinstance(parsing_args, PyMuPDFArgs):
        return _ingest_with_pymupdf(doc, parsing_args, verbose)
    elif isinstance(parsing_args, UnitableArgs):
        return _ingest_with_unitable(doc, parsing_args, verbose)
    elif isinstance(parsing_args, LLMArgs):
        return _ingest_with_llm(doc, parsing_args, verbose)
    else:
        raise ValueError("Unsupported parsing_algorithm.")


### **Custom Document Parser Class**

Define the `CustomDocumentParser` class which extends the base `DocumentParser` class. This custom parser uses the language model to parse tables from documents. The `parse` method combines text and table parsing results into a single parsed document.


In [8]:
class CustomDocumentParser(DocumentParser):
    """
    Custom document parser using multi-modal LLM.

    Uses pymupdf to parse the document and runs LLM on table images.

    Args:
    - DocumentParser: Base class for document parsing.

    Methods:
    - parse(doc): Parse a given document with multi-modal LLM.
    """

    def parse(self, doc) -> ParsedDocument:
        """
        Parse a given document with multi-modal LLM.

        Args:
        - doc: Document to be parsed.

        Returns:
        - ParsedDocument: Parsed document containing nodes and metadata.
        """
        text_engine = "pymupdf"
        text_elems = text.ingest(doc, parsing_method=text_engine)
        text_nodes = self._elems_to_nodes(text_elems)

        table_nodes = []
        table_args_obj = None
        if self.table_args:
            table_args_obj = _table_args_dict_to_model(self.table_args)
            table_elems = ingest(doc, table_args_obj, verbose=self._verbose)
            table_nodes = self._elems_to_nodes(table_elems)

        nodes = text_nodes + table_nodes
        nodes = self.processing_pipeline.run(nodes)

        parsed_doc = ParsedDocument(
            nodes=nodes,
            filename="Path(file).name",
            num_pages=doc.num_pages,
            coordinate_system=consts.COORDINATE_SYSTEM,
            table_parsing_kwargs=(
                table_args_obj.model_dump() if table_args_obj else None
            ),
            creation_date=doc.file_metadata.get("creation_date"),
            last_modified_date=doc.file_metadata.get("last_modified_date"),
            last_accessed_date=doc.file_metadata.get("last_accessed_date"),
            file_size=doc.file_metadata.get("file_size"),
        )
        return parsed_doc


### **OpenParse Class**

The `OpenParse` class is defined here, extending the `pw.UDF` class. This class uses the `open-parse` library to parse documents. The parsing algorithm can be specified through the `table_args` dictionary, allowing the use of different algorithms like `llm`, `unitable`, `pymupdf`, and `table-transformers`.

### Arguments:
- **table_args**: A dictionary containing the table parser arguments. By default, it uses the `llm` algorithm.
- **cache_strategy**: Defines the caching mechanism. If provided, it should be a valid `CacheStrategy` object to enable caching.


### **Document Parsing Method**

The `__wrapped__` method of the `OpenParse` class handles the core functionality of parsing the document. It reads the contents of a PDF file, uses the `CustomDocumentParser` to parse the document, and returns the parsed content as a list of tuples containing the text and metadata.

### Steps:
1. **Read the PDF file** using `PdfReader`.
2. **Parse the document** using `CustomDocumentParser`.
3. **Extract nodes** from the parsed content.
4. **Log the number of nodes** parsed.
5. **Return the parsed documents** as a list of tuples with text and metadata.

In [9]:

import logging
from io import BytesIO

import pathway as pw
from pathway.internals import udfs
from pathway.optional_import import optional_imports

logger = logging.getLogger(__name__)


class OpenParse(pw.UDF):
    """
    Parse document using `https://github.com/Filimoa/open-parse <https://github.com/Filimoa/open-parse>`_.

    `parsing_algorithm` can be one of `llm`, `unitable`, `pymupdf`, `table-transformers`.
    While using in the VectorStoreServer, splitter can be set to `None` as OpenParse already chunks the documents.


    Args:
        - table_args: dict containing the table parser arguments.
        - cache_strategy: Defines the caching mechanism. To enable caching,
            a valid `CacheStrategy` should be provided.
            See `Cache strategy <https://pathway.com/developers/api-docs/udfs#pathway.udfs.CacheStrategy>`_
            for more information. Defaults to None.
    """

    def __init__(
        self,
        table_args: dict = {"parsing_algorithm": "llm"},
        cache_strategy: udfs.CacheStrategy | None = None,
    ):
        with optional_imports("xpack-llm"):
            import openparse  # noqa:F401
            from pypdf import PdfReader  # noqa:F401

            # from ._parser_utils import CustomDocumentParser

        super().__init__(cache_strategy=cache_strategy)

        self.doc_parser = CustomDocumentParser(table_args=table_args)

        self.kwargs = dict(table_args=table_args)

    def __wrapped__(self, contents: bytes) -> list[tuple[str, dict]]:
        import openparse
        from pypdf import PdfReader

        reader = PdfReader(stream=BytesIO(contents))
        doc = openparse.Pdf(file=reader)

        parsed_content = self.doc_parser.parse(doc)
        nodes = [i for i in parsed_content.nodes]

        logger.info(
            f"OpenParser completed parsing, total number of nodes: {len(nodes)}"
        )

        metadata: dict = {}
        docs = list(map(lambda x: (x.dict()["text"], metadata), nodes))

        return docs


##**Document Processing and Question Answering Setup:**




### **Create Data Directory**

Create a 'data' directory if it doesn't already exist. This is where the uploaded files will be stored.
Then upload your pdf document


In [10]:
# Create the 'data' folder if it doesn't exist
!mkdir -p data

# Upload a file directly into the 'data' folder
from google.colab import files

uploaded = files.upload()
for filename in uploaded.keys():
    import shutil
    shutil.move(filename, f'./data/{filename}')



Saving 20230203_alphabet_10K.pdf to 20230203_alphabet_10K.pdf


### **Import Necessary Modules**

Import additional necessary modules, set up the environment variable for Tesseract, and configure logging settings.


In [11]:
import logging
import os

os.environ["TESSDATA_PREFIX"] = "/usr/share/tesseract/tessdata/"

from dotenv import load_dotenv
import pathway as pw
from pathway.udfs import DiskCache, ExponentialBackoffRetryStrategy
from pathway.xpacks.llm import embedders, llms, prompts
from pathway.xpacks.llm.question_answering import BaseRAGQuestionAnswerer
from pathway.xpacks.llm.vector_store import VectorStoreServer


In [12]:
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(name)s %(levelname)s %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)


### **Read Document**

Read the document from the data folder. This cell assumes that the uploaded file is a sample document in the 'data' folder.


In [13]:
path = "./data/"

# Assuming 'sample_document.txt' was uploaded
# Read the document from the data folder
folder = pw.io.fs.read(
    path=path,
    format="binary",
    with_metadata=True,
)


**Check If your file has been read**

In [14]:
!ls data

20230203_alphabet_10K.pdf


In [15]:
sources = [
    folder,
]

chat = llms.OpenAIChat(
    model="gpt-4o",
    retry_strategy=ExponentialBackoffRetryStrategy(max_retries=6),
    cache_strategy=DiskCache(),
    temperature=0.0,
)

In [16]:
parser = OpenParse()
embedder = embedders.OpenAIEmbedder(cache_strategy=DiskCache())

doc_store = VectorStoreServer(
    *sources,
    embedder=embedder,
    splitter=None,  # OpenParse parser handles the chunking
    parser=parser,
)

print("Cell 4: Pathway components configured.")

Cell 4: Pathway components configured.


In [17]:
app = BaseRAGQuestionAnswerer(
        llm=chat,
        indexer=doc_store,
        search_topk=6,
        short_prompt_template=prompts.prompt_qa,
    )

### **Configure and Run Question Answering Server**

Configure and run the question answering server using `BaseRAGQuestionAnswerer`. This server listens on port 8000 and processes incoming queries.


In [18]:
app.build_server(host="0.0.0.0", port=8000)

    https://beartype.readthedocs.io/en/latest/api_roar/#pep-585-deprecations
  warn(


In [19]:
# app.run_server()

In [20]:
import threading

In [21]:
t = threading.Thread(target=app.run_server, name="BaseRAGQuestionAnswerer")
t.daemon = True
thr = t.start()

**List Documents**

List documents processed by the server using the `requests` library. This is an alternative to using the curl command.


In [24]:
!curl -X 'POST'   'http://0.0.0.0:8000/v1/pw_list_documents'   -H 'accept: */*'   -H 'Content-Type: application/json'

[{"created_at": 1718873636, "modified_at": 1718873636, "owner": "root", "path": "data/20230203_alphabet_10K.pdf", "seen_at": 1718873664}]

In [25]:
import requests

url = "http://0.0.0.0:8000/v1/pw_list_documents"
headers = {
    "accept": "*/*",
    "Content-Type": "application/json"
}

response = requests.post(url, headers=headers)

In [26]:
response.json()

[{'created_at': 1718873636,
  'modified_at': 1718873636,
  'owner': 'root',
  'path': 'data/20230203_alphabet_10K.pdf',
  'seen_at': 1718873664}]

### **Ask Questions and Get answers**

Query the server to get answers from the documents. This cell sends a prompt to the server and receives the response.

Make changes to the prompt and ask questions to get information from your documents

In [27]:
!curl -X 'POST'   'http://0.0.0.0:8000/v1/pw_ai_answer'   -H 'accept: */*'   -H 'Content-Type: application/json'   -d '{"prompt": "How much was Operating lease cost in 2021?`"}'


"$2,699 million"

# **Conclusion**

This showcase demonstrates the setup of a robust Retrieval-Augmented Generation (RAG) pipeline using GPT-4o and Pathway, specifically tailored for processing financial reports and tables. By integrating advanced natural language processing with multimodal capabilities, this solution enhances accuracy and usability in handling complex document structures.

### **Key Highlights:**
- **Advanced Table Parsing**: Utilizing GPT-4o to extract and understand table data from PDFs, improving accuracy in answering queries based on financial information.
  
- **Dynamic Document Synchronization**: The pipeline automatically updates document indices as files are added or modified, ensuring real-time access to the latest data.

- **Comparative Advantage**: Demonstrates superior performance over traditional RAG approaches, particularly in handling table-based queries with precision.

### **Architecture Overview:**
The architecture leverages Pathway's modules, including document parsers, LLMs, and indexing strategies, orchestrated via the BaseRAGQuestionAnswerer class. This setup supports seamless integration and efficient query handling.

### **Next Steps:**
Explore advanced features such as re-ranking for query prioritization and hybrid indexing for enhanced retrieval efficiency.
Customize your RAG application with tailored document processing and UI design to optimize user interaction.

###Ready to start building?
Check out a range of easy to use [app templates](https://pathway.com/developers/showcases), and begin building amazing Multimodal RAG apps with the completely free community version of Pathway.

### Learn More:
- [Pathway Documentation](https://pathway.com/)

