# OCR PDF extraction

**Table of contents**<a id='toc0_'></a>    
- 1. [Unstructured Pytesseract loader](#toc1_)    
- 2. [Paddle OCR loader](#toc2_)    
- 3. [Simple Retrieval](#toc3_)    
  - 3.1. [Text splitting](#toc3_1_)    
  - 3.2. [Embedding and storing](#toc3_2_)    
  - 3.3. [Model and retrieval](#toc3_3_)    
- 4. [Multivector retrieval](#toc4_)    
  - 4.1. [Text splitting](#toc4_1_)    
  - 4.2. [Summarization](#toc4_2_)    
  - 4.3. [Embedding and storing](#toc4_3_)    
  - 4.4. [Model and retrieval](#toc4_4_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

This notebook shows examples of text and table extraction from pdf files with different **OCR** packages

It also includes an example of a simple retierver, and an example of a  multivector retriever using extracted data

In [1]:
import tqdm as notebook_tqdm
import sys
sys.path.append('../src')
from dotenv import load_dotenv
from models.sambanova_endpoint import SambaNovaEndpoint
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

In [2]:
import glob
# Provide location of pdf files
folder_loc = 'sample_data/sample_pdfs'
pdf_files = [f for f in glob.glob(f'{folder_loc}/*.pdf')]
sample_pdf = pdf_files[0]

## 1. <a id='toc1_'></a>[Unstructured Pytesseract loader](#toc0_)

For runing this loader you should install the pyteseract and poppler-utils packages in your machine, or run this notebook over the data_extarction docker container

This loader uses behind the scenes Unstructured and pytesseract module to perform a layout detection, then transcribe text, and tables as Html tables

In [3]:
from src.pdf_table_text_extraction import UnstructuredPdfPytesseractLoader

loader = UnstructuredPdfPytesseractLoader(sample_pdf)
docs=loader.load()

for doc in docs:
    print(f'{doc.page_content}\n---\n')

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


commercial and public sector industries and is as disruptive decades ago. And like the internet—Al promises decisive to organizations that can leverage it for innovation sooner

Accelerate Data-Driven Decision-Making with State-of-the-Art AI

AI is increasingly being adopted across commercial and public sector industries and is as disruptive today as the advent of the internet a few decades ago. And like the internet—AI promises decisive competitive and operational advantages to organizations that can leverage it for innovation sooner rather than later.

ROADBLOCKS TO INNOVATION

Today, all but the big tech giants face seemingly insurmountable obstacles to bring AI applications into production. These challenges include:

Critical skill gaps due to machine learning talent scarcity • Lack of expertise in computing architectures • Difficulty in keeping on top of latest models and techniques • Investment justiﬁcation without proof of prior impact

UNLOCKING THE FUTURE

The AI talent shorta

## 2. <a id='toc2_'></a>[Paddle OCR loader](#toc0_)

For runing this loader you should run this notebook over the paddle-ocr environment the data_extarction_paddel docker container

This loader uses behind the scenes Paddle OCR and Paddle Structure modules to perform a layout detection, then mask images and equations, transcribe text, and tables as Html tables

In [4]:
from src.multi_column_ocr import PaddleOCRLoader

loader = PaddleOCRLoader(sample_pdf)
docs=loader.load()

for doc in docs:
    print(f'{doc.page_content}\n---\n')

**figure l**
Al is increasingly being adopted across commercial and public sector industries and is as disruptive
today as the advent of the internet a few decades ago.And like the internet-Al promises decisive
competitive and operational advantages to organizations that can leverage it for innovation sooner
rather than later.
Today,all but the big tech giants face seemingly insurmountable obstacles to bring Al applications into production.These
challenges include:
Critical skill gaps due to machine learning talent scarcity
Lack of expertise in computing architectures
Difficulty in keeping on top of latest models and technigues
Investment justification without proof of prior impact
UNLOCKING THE FUTURE
The Al talent shortage continues to burden companies across all industries. And the reality is, unless you're a company lik
Google or Facebook,you'll be hard-pressed to attract top talent,resulting in a subpar Al solution.It's time you demand
solution to this challenge and it's time that

## 3. <a id='toc3_'></a>[Simple Retrieval](#toc0_)

This is an example of a simple RAG pipeline for QA over the document data

Documents are split into smaller chunks and their embeddings are stored in a vector database

then, for each user question the **k** most related chunks are retrieved and used as a context for QA

### 3.1. <a id='toc3_1_'></a>[Text splitting](#toc0_)

In [5]:
#Define a recursive character text spliter and use en of table </table> for splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, 
                                                chunk_overlap=200, 
                                                length_function=len,
                                                separators=["\n\n", "</table>", "\n"])
chunks = text_splitter.split_documents(docs)

for chunk in chunks:
    print(f'{chunk.page_content}\n---\n')

**figure l**
Al is increasingly being adopted across commercial and public sector industries and is as disruptive
today as the advent of the internet a few decades ago.And like the internet-Al promises decisive
competitive and operational advantages to organizations that can leverage it for innovation sooner
rather than later.
Today,all but the big tech giants face seemingly insurmountable obstacles to bring Al applications into production.These
challenges include:
Critical skill gaps due to machine learning talent scarcity
Lack of expertise in computing architectures
Difficulty in keeping on top of latest models and technigues
Investment justification without proof of prior impact
UNLOCKING THE FUTURE
The Al talent shortage continues to burden companies across all industries. And the reality is, unless you're a company lik
Google or Facebook,you'll be hard-pressed to attract top talent,resulting in a subpar Al solution.It's time you demand
---

Google or Facebook,you'll be hard-presse

### 3.2. <a id='toc3_2_'></a>[Embedding and storing](#toc0_)

In [7]:
# Define HuggingFace embeding model
encode_kwargs = {"normalize_embeddings": True}
embeddings = HuggingFaceInstructEmbeddings(
    model_name="BAAI/bge-large-en",
    query_instruction="Represent this sentence for searching relevant passages: ",
    encode_kwargs=encode_kwargs,
)

# Store the embeddings on an in-memory simple vector database 
vectorstore = FAISS.from_documents(documents=chunks, embedding=embeddings)


load INSTRUCTOR_Transformer
max_seq_length  512


### 3.3. <a id='toc3_3_'></a>[Model and retrieval](#toc0_)

In [8]:
# Define SambaNova runing model
load_dotenv('export.env')

llm = SambaNovaEndpoint(
    model_kwargs={"do_sample": False, "temperature": 0.0},
)

In [9]:
# Define retriever
retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.5, "k": 4},
)

In [10]:
# define retrieval cahin
simple_retrieval_chain = RetrievalQA.from_llm(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
    input_key="question",
    output_key="answer",
)

In [11]:
custom_prompt_template = """<s>[INST] <<SYS>>\nUse the following pieces of context to answer the question at the end. 
Be carefull when reading tables in html format do not mix information betwen rows or columns,
If the answer is not in context for answering, say that you don't know, don't try to make up an answer or provide an answer not extracted from provided context. 
Cross check if the answer is contained in provided context. If not than say "I do not have information regarding this."

Context
{context}
End of context
<</SYS>>

Question: {question}
Helpful Answer: [/INST]"""

CUSTOMPROMPT = PromptTemplate(
    template=custom_prompt_template, input_variables=["context", "question"]
)

## Inject custom prompt
simple_retrieval_chain.combine_documents_chain.llm_chain.prompt = CUSTOMPROMPT

In [12]:
user_question = "What is the main advantage of using SambaNova's DataScale platform compared to traditional AI computing solutions?"
response = simple_retrieval_chain({"question": user_question})

In [13]:
print(f'Response ={response["answer"]}')

Response = According to the provided context, the main advantage of using SambaNova's DataScale platform compared to traditional AI computing solutions is that it provides a 5x improvement in performance compared to a comparable GPU running the same models. This is mentioned in the statement by BRONIS DE SUPINSKI, CHIEF TECHNOLOGY OFFICER, LAWRENCE LIVERMORE NATIONAL LABORATORY, who says,


## 4. <a id='toc4_'></a>[Multivector retrieval](#toc0_)

This is an exmaple of a multivector RAG pipeline for QA over the document data


Documents are split into smaller chunks. chunks are summarized by an llm,

then, the summaries embeddings and the original chunks are stored in a vector database. 

later on, for each user question the most related summaries idicate the original chunks to be retrieved and used as a context for QA

In [14]:

from langchain_core.documents import Document
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.vectorstores import Chroma
from langchain.storage import InMemoryByteStore
import uuid

### 4.1. <a id='toc4_1_'></a>[Text splitting](#toc0_)

In [15]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, 
                                                    chunk_overlap=200, 
                                                    length_function=len,
                                                    separators=["\n\n", "</table>", "\n"])

chunks=text_splitter.split_documents(docs)
for chunk in chunks:
    print(f'{chunk.page_content}\n---\n')

**figure l**
Al is increasingly being adopted across commercial and public sector industries and is as disruptive
today as the advent of the internet a few decades ago.And like the internet-Al promises decisive
competitive and operational advantages to organizations that can leverage it for innovation sooner
rather than later.
Today,all but the big tech giants face seemingly insurmountable obstacles to bring Al applications into production.These
challenges include:
Critical skill gaps due to machine learning talent scarcity
Lack of expertise in computing architectures
Difficulty in keeping on top of latest models and technigues
Investment justification without proof of prior impact
UNLOCKING THE FUTURE
The Al talent shortage continues to burden companies across all industries. And the reality is, unless you're a company lik
Google or Facebook,you'll be hard-pressed to attract top talent,resulting in a subpar Al solution.It's time you demand
---

Google or Facebook,you'll be hard-presse

### 4.2. <a id='toc4_2_'></a>[Summarization](#toc0_)

In [16]:
#Define summarization chain
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("<s>[INST] <<SYS>>Summarize the following document :<<SYS>>\n\n{doc}[/INST]")
    | llm
    | StrOutputParser()
)


In [17]:
# call summarization chain in batch
summaries = chain.batch(chunks, {"max_concurrency": 1})

# create an id for each original chunk
doc_ids = [str(uuid.uuid4()) for _ in chunks]

# create a document over each sumary and store in the metadata de doc_id of the original chunks
id_key = "doc_id"
summarized_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

for summary in summarized_docs:
    print(f'{summary.page_content}\n---\n')

 The document discusses the increasing adoption of Artificial Intelligence (Al) across various industries and its potential to bring significant competitive and operational advantages to organizations. However, it highlights the challenges faced by companies, excluding big tech giants, in implementing Al applications due to skill gaps, lack of expertise in computing architectures, and difficulty in keeping up with the latest models and techniques. Additionally, the document mentions the scarcity of machine learning talent
---

 SambaNova offers a next-generation AI platform that enables organizations to build and deploy AI solutions for natural language processing, high-resolution computer vision, and recommendation. The platform is designed to be affordable and attainable, with a CapEx-modeled rack infrastructure solution or an OpEx-modeled subscription service that is fully managed by SambaNova and comes with pretrained models. The platform offers state-of-the-art
---

 SambaNova Sys

### 4.3. <a id='toc4_3_'></a>[Embedding and storing](#toc0_)

In [18]:
# Define HuggingFace embeding model
encode_kwargs = {"normalize_embeddings": True}
embeding=HuggingFaceInstructEmbeddings(model_name="BAAI/bge-large-en",
    query_instruction="Represent this sentence for searching relevant passages: ",
    encode_kwargs=encode_kwargs)

load INSTRUCTOR_Transformer
max_seq_length  512


In [19]:
# Define The vectorstore to use to index the summarized chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=embeding)

# Define the storage layer for the parent documents (original chunks)
store = InMemoryByteStore()

# define the retriever (empty to start), pasing the vectorstore and the store of original chunks
multivector_retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

In [20]:
# emmbeding and adding the summarized_chunks to the vectorstore
multivector_retriever.vectorstore.add_documents(summarized_docs)

# storing the original chunks and their respective doc_id to the docstore
multivector_retriever.docstore.mset(list(zip(doc_ids, chunks)))

### 4.4. <a id='toc4_4_'></a>[Model and retrieval](#toc0_)

In [21]:
# define retrieval cahin
multivector_retrieval_chain = RetrievalQA.from_llm(
    llm=llm,
    retriever=multivector_retriever,
    return_source_documents=True,
    input_key="question",
    output_key="answer",
)

In [22]:
custom_prompt_template = """<s>[INST] <<SYS>>\nUse the following pieces of context to answer the question at the end. 
Be carefull when reading tables in html format do not mix information betwen rows or columns,
If the answer is not in context for answering, say that you don't know, don't try to make up an answer or provide an answer not extracted from provided context. 
Cross check if the answer is contained in provided context. If not than say "I do not have information regarding this."

Context
{context}
End of context
<</SYS>>

Question: {question}
Helpful Answer: [/INST]"""
CUSTOMPROMPT = PromptTemplate(
    template=custom_prompt_template, input_variables=["context", "question"]
)
## Inject custom prompt
multivector_retrieval_chain.combine_documents_chain.llm_chain.prompt = CUSTOMPROMPT

In [23]:
user_question = "What is the main advantage of using SambaNova's DataScale platform compared to traditional AI computing solutions?"
response = multivector_retrieval_chain({"question": user_question})

In [24]:
print(f'Response ={response["answer"]}')

Response = According to the provided context, the main advantage of using SambaNova's DataScale platform compared to traditional AI computing solutions is that it allows businesses to deploy fully managed Al solutions within weeks, as opposed to the typical 12-18 months. This is made possible by SambaNova's subscription-based AI acceleration services platform, which enables enterprises to jump-start their Al initiatives by leveraging SambaNova's expertise
