# Retrieval from pdf tables using OCR table extraction

**Table of contents**<a id='toc0_'></a>    
- 1. [Load documents](#toc1_)    
  - 1.1. [Unstructured Pytesseract loader](#toc1_1_)    
  - 1.2. [Paddle OCR loader](#toc1_2_)    
- 2. [Simple Retrieval](#toc2_)    
  - 2.1. [Text splitting](#toc2_1_)    
  - 2.2. [Embedding and storing](#toc2_2_)    
  - 2.3. [Model and retrieval](#toc2_3_)    
- 3. [Multivector retrieval](#toc3_)    
  - 3.1. [Text splitting](#toc3_1_)    
  - 3.2. [Summarization](#toc3_2_)    
  - 3.3. [Embedding and storing](#toc3_3_)    
  - 3.4. [Model and retrieval](#toc3_4_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

This notebook includes an example of a simple retierver, and an example of a  multivector retriever using extracted data from pdf tables

In [1]:
import tqdm as notebook_tqdm
import os
import sys

current_dir = os.getcwd()
kit_dir = os.path.abspath(os.path.join(current_dir, ".."))
repo_dir = os.path.abspath(os.path.join(kit_dir, ".."))

sys.path.append(kit_dir)
sys.path.append(repo_dir)

from dotenv import load_dotenv
from utils.sambanova_endpoint import SambaNovaEndpoint
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

In [4]:
import glob
# Provide location of pdf files
folder_loc = os.path.join(kit_dir,'data/sample_data/sample_pdfs')
pdf_files = [f for f in glob.glob(f'{folder_loc}/*.pdf')]
sample_pdf = pdf_files[0]

## 1. <a id='toc1_'></a>[Load documents](#toc0_)

### 1.1. <a id='toc1_1_'></a>[Unstructured Pytesseract loader](#toc0_)

For runing this loader you should install the pyteseract and poppler-utils packages in your machine, or run this notebook over the data_extarction docker container

This loader uses behind the scenes Unstructured and pytesseract module to perform a layout detection, then transcribe text, and tables as Html tables

In [4]:
from data_extraction.src.pdf_table_text_extraction import UnstructuredPdfPytesseractLoader

loader = UnstructuredPdfPytesseractLoader(sample_pdf)
docs=loader.load()

for doc in docs:
    print(f'{doc.page_content}\n---\n')

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


commercial and public sector industries and is as disruptive decades ago. And like the internet—Al promises decisive to organizations that can leverage it for innovation sooner

Accelerate Data-Driven Decision-Making with State-of-the-Art AI

AI is increasingly being adopted across commercial and public sector industries and is as disruptive today as the advent of the internet a few decades ago. And like the internet—AI promises decisive competitive and operational advantages to organizations that can leverage it for innovation sooner rather than later.

ROADBLOCKS TO INNOVATION

Today, all but the big tech giants face seemingly insurmountable obstacles to bring AI applications into production. These challenges include:

Critical skill gaps due to machine learning talent scarcity • Lack of expertise in computing architectures • Difficulty in keeping on top of latest models and techniques • Investment justiﬁcation without proof of prior impact

UNLOCKING THE FUTURE

The AI talent shorta

### 1.2. <a id='toc1_2_'></a>[Paddle OCR loader](#toc0_)

For runing this loader you should run this notebook over the paddle-ocr environment the data_extarction_paddel docker container

This loader uses behind the scenes Paddle OCR and Paddle Structure modules to perform a layout detection, then mask images and equations, transcribe text, and tables as Html tables

In [3]:
from data_extraction.src.multi_column_ocr import PaddleOCRLoader

loader = PaddleOCRLoader(sample_pdf, output_folder=os.path.join(kit_dir,'data/extraction'))
docs=loader.load()

for doc in docs:
    print(f'{doc.page_content}\n---\n')

[2024/02/06 15:47:02] ppocr DEBUG: Namespace(help='==SUPPRESS==', use_gpu=False, use_xpu=False, ir_optim=True, use_tensorrt=False, min_subgraph_size=15, shape_info_filename=None, precision='fp32', gpu_mem=500, image_dir=None, det_algorithm='DB', det_model_dir='/Users/jorgep/.paddleocr/whl/det/en/en_PP-OCRv3_det_infer', det_limit_side_len=960, det_limit_type='max', det_db_thresh=0.3, det_db_box_thresh=0.6, det_db_unclip_ratio=1.5, max_batch_size=10, use_dilation=False, det_db_score_mode='fast', det_east_score_thresh=0.8, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_sast_score_thresh=0.5, det_sast_nms_thresh=0.2, det_sast_polygon=False, det_pse_thresh=0, det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_box_type='quad', det_pse_scale=1, scales=[8, 16, 32], alpha=1.0, beta=1.0, fourier_degree=5, det_fce_box_type='poly', rec_algorithm='SVTR_LCNet', rec_model_dir='/Users/jorgep/.paddleocr/whl/rec/en/en_PP-OCRv3_rec_infer', rec_image_shape='3, 48, 320', rec_batch_num=6, max_te

## 2. <a id='toc2_'></a>[Simple Retrieval](#toc0_)

This is an example of a simple RAG pipeline for QA over the document data

Documents are split into smaller chunks and their embeddings are stored in a vector database

then, for each user question the **k** most related chunks are retrieved and used as a context for QA

### 2.1. <a id='toc2_1_'></a>[Text splitting](#toc0_)

In [5]:
#Define a recursive character text spliter and use en of table </table> for splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, 
                                                chunk_overlap=200, 
                                                length_function=len,
                                                separators=["\n\n", "</table>", "\n"])
chunks = text_splitter.split_documents(docs)

for chunk in chunks:
    print(f'{chunk.page_content}\n---\n')

commercial and public sector industries and is as disruptive decades ago. And like the internet—Al promises decisive to organizations that can leverage it for innovation sooner

Accelerate Data-Driven Decision-Making with State-of-the-Art AI

AI is increasingly being adopted across commercial and public sector industries and is as disruptive today as the advent of the internet a few decades ago. And like the internet—AI promises decisive competitive and operational advantages to organizations that can leverage it for innovation sooner rather than later.

ROADBLOCKS TO INNOVATION

Today, all but the big tech giants face seemingly insurmountable obstacles to bring AI applications into production. These challenges include:

Critical skill gaps due to machine learning talent scarcity • Lack of expertise in computing architectures • Difficulty in keeping on top of latest models and techniques • Investment justiﬁcation without proof of prior impact

UNLOCKING THE FUTURE
---

UNLOCKING THE FU

### 2.2. <a id='toc2_2_'></a>[Embedding and storing](#toc0_)

In [6]:
# Define HuggingFace embeding model
encode_kwargs = {"normalize_embeddings": True}
embeddings = HuggingFaceInstructEmbeddings(
    model_name="BAAI/bge-large-en",
    query_instruction="Represent this sentence for searching relevant passages: ",
    encode_kwargs=encode_kwargs,
)

# Store the embeddings on an in-memory simple vector database 
vectorstore = FAISS.from_documents(documents=chunks, embedding=embeddings)


load INSTRUCTOR_Transformer
max_seq_length  512


### 2.3. <a id='toc2_3_'></a>[Model and retrieval](#toc0_)

In [9]:
# Define SambaNova runing model
load_dotenv(os.path.join(repo_dir,'.env'))

# Select expert should be removed if not using CoE
llm = SambaNovaEndpoint(
    model_kwargs={"do_sample": False, "temperature": 0.0, "select_expert": "llama-2-7b-chat-hf"},
)

In [10]:
# Define retriever
retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.5, "k": 4},
)

In [11]:
# define retrieval cahin
simple_retrieval_chain = RetrievalQA.from_llm(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
    input_key="question",
    output_key="answer",
)

In [12]:
custom_prompt_template = """<s>[INST] <<SYS>>\nUse the following pieces of context to answer the question at the end. 
Be carefull when reading tables in html format do not mix information betwen rows or columns,
If the answer is not in context for answering, say that you don't know, don't try to make up an answer or provide an answer not extracted from provided context. 
Cross check if the answer is contained in provided context. If not than say "I do not have information regarding this."

Context
{context}
End of context
<</SYS>>

Question: {question}
Helpful Answer: [/INST]"""

CUSTOMPROMPT = PromptTemplate(
    template=custom_prompt_template, input_variables=["context", "question"]
)

## Inject custom prompt
simple_retrieval_chain.combine_documents_chain.llm_chain.prompt = CUSTOMPROMPT

In [13]:
user_question = "What is the main advantage of using SambaNova's DataScale platform compared to traditional AI computing solutions?"
response = simple_retrieval_chain({"question": user_question})

  warn_deprecated(


In [14]:
print(f'Response ={response["answer"]}')

Response = According to the provided context, the main advantage of using SambaNova's DataScale platform compared to traditional AI computing solutions is its ability to scale seamlessly from one to hundreds of systems to meet the demands of modern AI computing, while delivering efficiency with a software-defined-hardware approach and a highly flexible modular architecture. Additionally, DataScale is optimized from algorithms to silicon, which allows it to deliver unrivaled performance, accuracy, and ease


## 3. <a id='toc3_'></a>[Multivector retrieval](#toc0_)

This is an exmaple of a multivector RAG pipeline for QA over the document data


Documents are split into smaller chunks. chunks are summarized by an llm,

then, the summaries embeddings and the original chunks are stored in a vector database. 

later on, for each user question the most related summaries idicate the original chunks to be retrieved and used as a context for QA

In [15]:

from langchain_core.documents import Document
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.vectorstores import Chroma
from langchain.storage import InMemoryByteStore
import uuid

### 3.1. <a id='toc3_1_'></a>[Text splitting](#toc0_)

In [16]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, 
                                                    chunk_overlap=200, 
                                                    length_function=len,
                                                    separators=["\n\n", "</table>", "\n"])

chunks=text_splitter.split_documents(docs)
for chunk in chunks:
    print(f'{chunk.page_content}\n---\n')

commercial and public sector industries and is as disruptive decades ago. And like the internet—Al promises decisive to organizations that can leverage it for innovation sooner

Accelerate Data-Driven Decision-Making with State-of-the-Art AI

AI is increasingly being adopted across commercial and public sector industries and is as disruptive today as the advent of the internet a few decades ago. And like the internet—AI promises decisive competitive and operational advantages to organizations that can leverage it for innovation sooner rather than later.

ROADBLOCKS TO INNOVATION

Today, all but the big tech giants face seemingly insurmountable obstacles to bring AI applications into production. These challenges include:

Critical skill gaps due to machine learning talent scarcity • Lack of expertise in computing architectures • Difficulty in keeping on top of latest models and techniques • Investment justiﬁcation without proof of prior impact

UNLOCKING THE FUTURE
---

UNLOCKING THE FU

### 3.2. <a id='toc3_2_'></a>[Summarization](#toc0_)

In [17]:
#Define summarization chain
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("<s>[INST] <<SYS>>Summarize the following document :<<SYS>>\n\n{doc}[/INST]")
    | llm
    | StrOutputParser()
)


In [18]:
# call summarization chain in batch
summaries = chain.batch(chunks, {"max_concurrency": 1})

# create an id for each original chunk
doc_ids = [str(uuid.uuid4()) for _ in chunks]

# create a document over each sumary and store in the metadata de doc_id of the original chunks
id_key = "doc_id"
summarized_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

for summary in summarized_docs:
    print(f'{summary.page_content}\n---\n')

 The document discusses the impact of Artificial Intelligence (AI) on various industries, comparing its disruptive potential to that of the internet. It highlights the challenges faced by organizations in adopting AI, including skill gaps, lack of expertise, and difficulty in keeping up with the latest models and techniques. The document also emphasizes the importance of leveraging AI for innovation and competitive advantage, and suggests that organizations that can overcome these challenges
---

 SambaNova offers a solution to the AI talent shortage crisis by enabling organizations to build and deploy AI solutions for natural language processing, high-resolution computer vision, and recommendation. SambaNova provides state-of-the-art accuracy, scalability, and ease of use, allowing companies to access AI capabilities at a lower cost and time investment than developing in-house infrastructure and machine learning expertise.
---

 SambaNova offers an AI platform that can be delivered as

### 3.3. <a id='toc3_3_'></a>[Embedding and storing](#toc0_)

In [19]:
# Define HuggingFace embeding model
encode_kwargs = {"normalize_embeddings": True}
embeding=HuggingFaceInstructEmbeddings(model_name="BAAI/bge-large-en",
    query_instruction="Represent this sentence for searching relevant passages: ",
    encode_kwargs=encode_kwargs)

load INSTRUCTOR_Transformer
max_seq_length  512


In [20]:
# Define The vectorstore to use to index the summarized chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=embeding)

# Define the storage layer for the parent documents (original chunks)
store = InMemoryByteStore()

# define the retriever (empty to start), pasing the vectorstore and the store of original chunks
multivector_retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

In [21]:
# emmbeding and adding the summarized_chunks to the vectorstore
multivector_retriever.vectorstore.add_documents(summarized_docs)

# storing the original chunks and their respective doc_id to the docstore
multivector_retriever.docstore.mset(list(zip(doc_ids, chunks)))

### 3.4. <a id='toc3_4_'></a>[Model and retrieval](#toc0_)

In [22]:
# define retrieval cahin
multivector_retrieval_chain = RetrievalQA.from_llm(
    llm=llm,
    retriever=multivector_retriever,
    return_source_documents=True,
    input_key="question",
    output_key="answer",
)

In [23]:
custom_prompt_template = """<s>[INST] <<SYS>>\nUse the following pieces of context to answer the question at the end. 
Be carefull when reading tables in html format do not mix information betwen rows or columns,
If the answer is not in context for answering, say that you don't know, don't try to make up an answer or provide an answer not extracted from provided context. 
Cross check if the answer is contained in provided context. If not than say "I do not have information regarding this."

Context
{context}
End of context
<</SYS>>

Question: {question}
Helpful Answer: [/INST]"""
CUSTOMPROMPT = PromptTemplate(
    template=custom_prompt_template, input_variables=["context", "question"]
)
## Inject custom prompt
multivector_retrieval_chain.combine_documents_chain.llm_chain.prompt = CUSTOMPROMPT

In [24]:
user_question = "What is the main advantage of using SambaNova's DataScale platform compared to traditional AI computing solutions?"
response = multivector_retrieval_chain({"question": user_question})

In [25]:
print(f'Response ={response["answer"]}')

Response = According to the provided context, the main advantage of using SambaNova's DataScale platform compared to traditional AI computing solutions is that it offers unrivaled performance, accuracy, scale, and ease of use. DataScale is an integrated AI software and hardware accelerator that is optimized from algorithms to silicon, delivering efficiency with a software-defined-hardware approach and a highly flexible modular architecture. Additionally, DataScale can scale seamlessly from one to hundreds
