<a href="https://colab.research.google.com/github/sugarforever/Advanced-RAG/blob/main/01_semi_structured_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced RAG - 01. RAG on Semi-structured data

**What is RAG?**

Retrieval augmented generation (RAG) is a natural language processing (NLP) technique that employes the capabilities of retrieval and generative based AI models.

**What is Naive RAG?**

Naive RAG often refers to splitting documents into chunks, embedding them, and retrieving chunks based on semantic similarity search to a user question.

It's simple, but of poor overall performance.

**That's why we need Advanced RAG.**

In this tutorials (**Advanced RAG**), we will learn the techniques and best practices in RAG application development, that can improve the quality of the RAG.

It's crucial to the success of a RAG application.

## 01. RAG on Semi-structured data

### Introduction

#### ✏️ What is Structured Data?

Structured data is organized information with a predefined format, typically stored in rows and columns, making it easy to search and analyze.

#### ✏️ What is Unstructured Data?

Unstructured data is information that lacks a specific format or organization, often in the form of text, images, or multimedia, making it challenging to analyze without specialized techniques.

#### ✏️ What is Semi-structured Data?

Apparently, semi-structured data is the mix of them above.

It's challenging for RAG to process semi-structured data, as:

1. Text splitting may break up tables
2. Tables and images are challenging for embedding and semantic search

The typical semi-structured data is PDF document that contains text, tables, images and so on.

In this tutorial, let's use the following components to showcase how to build RAG on top of semi-structured data:

1. ✂️ [unstructured](https://github.com/Unstructured-IO/unstructured)
  
  Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

  We will use it to parse PDF documents and extract different types of elements seperately, such as text, table, and image

2. 🦜 [LangChain](https://github.com/langchain-ai/langchain)

3. 🗂 [Chromadb](https://github.com/chroma-core/chroma)

  Vector data storage

The PDF document we use in this example is the [NVIDIA Statement of Changes](https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/381953f9-934e-4cc8-b099-144910676bad.pdf). It's a small PDF file containing several tables which is a good example for quick data processing and clear demonstration.

### Prepare Environment

Let's install the necessary Python packages.

In [1]:
# !pip install langchain unstructured[all-docs] pydantic lxml openai chromadb tiktoken -q -U

Download the PDF file and name it as `statement_of_changes.pdf`.

In [2]:
# ! wget -O statement_of_changes.pdf https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/381953f9-934e-4cc8-b099-144910676bad.pdf

Install required platform packages:

- poppler-utils
  
  A collection of command-line utilities built on Poppler's library API, to manage PDF and extract contents

- tesseract-ocr

  Optical character recognition engine

In [3]:
# ! apt-get install poppler-utils tesseract-ocr

In [4]:
import os

os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["AZURE_OPENAI_API_VERSION"] = "2023-09-01-preview"
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://nnitkitn.openai.azure.com/"
os.environ["AZURE_OPENAI_API_KEY"] = "e111e1b9b44641a880e5331a4617354d"

### Coding

1. Use `unstructured` library to partition the PDF document into different type of elements.

In [5]:
# ! ping https://huggingface.co
! pip list | grep nltk

nltk                                     3.8.1


In [1]:
from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
    filename="statement_of_changes.pdf",
    extract_images_in_pdf=False,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=".",
)

This function will be deprecated in a future release and `unstructured` will simply use the DEFAULT_MODEL from `unstructured_inference.model.base` to set default model name


LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

2. Categorize the elements

In [None]:
category_counts = {}

for element in raw_pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

unique_categories = set(category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.CompositeElement'>": 2}

In [None]:
class Element(BaseModel):
    type: str
    text: Any

table_elements = []
text_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        table_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        text_elements.append(Element(type="text", text=str(element)))

In [None]:
print(len(table_elements))
print(len(text_elements))

0
2


In [None]:
table_elements[0]

IndexError: list index out of range

In [None]:
table_elements[2]

IndexError: list index out of range

In [1]:
from langchain.chat_models import ChatOpenAI
from langchain.chat_models import AzureChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser

3. Build up summarization chain with LangChain framework

In [6]:
prompt_text = """
  You are responsible for concisely summarizing table or text chunk:

  {element}
"""
prompt = ChatPromptTemplate.from_template(prompt_text)
# summarize_chain = {"element": lambda x: x} | prompt | ChatOpenAI(temperature=0, model="gpt-3.5-turbo") | StrOutputParser()
summarize_chain = {"element": lambda x: x} | prompt | AzureChatOpenAI(temperature=0, model="gpt-3.5-turbo", api_key="e111e1b9b44641a880e5331a4617354d", api_version="2023-09-01-preview", azure_endpoint="https://nnitkitn.openai.azure.com/") | StrOutputParser()

4. Summarize each text and table element

In [7]:
tables = [i.text for i in table_elements]
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

texts = [i.text for i in text_elements]
text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})

NameError: name 'table_elements' is not defined

5. Use LangChain MultiVectorRetriever to associate summaries of tables and texts with original text chunks in parent-child relationship.

In [None]:
import uuid

from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.schema.document import Document
from langchain.storage import InMemoryStore
from langchain.vectorstores import Chroma

id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings()),
    docstore=InMemoryStore(),
    id_key=id_key,
)

# Add texts
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries)
]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content=s, metadata={id_key: table_ids[i]})
    for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))

In [None]:
from langchain.schema.runnable import RunnablePassthrough

template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | ChatOpenAI(temperature=0, model="gpt-4")
    | StrOutputParser()
)

In [None]:
chain.invoke("How many stocks were disposed? Who is the beneficial owner?")

'2300 stocks were disposed. The beneficial owner is the Welch-Drell 2009 Revocable Trust.'

6. Experiment with GPT-3.5

Looks it doesn't perform as well as GPT-4.

In [None]:
# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    # | ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
    | AzureChatOpenAI(temperature=0, model="gpt-4-32k", api_key="a79431d2657d405ebdf8f01795688bcb", api_version="2023-09-01-preview", azure_endpoint="https://nnitkitn.openai.azure.com/")
    | StrOutputParser()
)
chain.invoke("How many stocks were disposed? Who is the beneficial owner?")

'Based on the given context, it is not possible to determine how many stocks were disposed or who the beneficial owner is. The context does not provide any specific information about the disposal of stocks or the identification of the beneficial owner.'