In [1]:
from llmsherpa.readers import LayoutPDFReader
import openai
from llama_index.core import Document
from llama_index.core import VectorStoreIndex
from sentence_transformers import SentenceTransformer
import torch

  from .autonotebook import tqdm as notebook_tqdm


## RAG & parsing pdfs

### What is RAG?
- RAG stands for Retrieval-Augmented Generation.
- It is a technique that combines retrieval of relevant document information with generative models to enhance the quality and relevance of generated responses.
- RAG systems typically retrieve relevant documents from a knowledge base or corpus and use them to inform the generation of responses.

### Parsing PDFs
- Parsing PDFs is more challenging than parsing text like docx or txt files due to the complex structure of PDF documents.
    - What is complex about pdfs?
        - Pds are meant to be hard to edit, and highly formatted.
        - They can contain images, tables, and various layouts that make text extraction difficult.
        - pdfs are do not contain accessible text, but rather a visual representation of text.

### The `nlm-ingestor` parser:
- We are going to use a parser specifically designed for RAG and PDF documents called `nlm-ingestor` from the `llmsherpa`module. Why?
    - It is designed to handle the complexities of PDF documents and data structures useful for the Retrieval portion of RAG.
    - It combines OCR (Optical Character Recognition), the text layer of PDFs, and the font objects (text co-ordinates/boundary box, graphics and font data) to parse the structure and content of PDF documents.
    - It can also handle other document formats like docx, txt, and images.

### Install requirements:
- The easiest way to install and run the parser server is with Docker.
    - Why Docker?
        - Problem: I (Cameron) write python on Windows, and the parser uses `libxml2` and `libxslt` as dependencies which require building from source with tools not normally included in Windows by default. More information is found [here](https://lxml.de/installation.html#source-builds-on-ms-windows).

        - Solution: The Windows Subsystem for Linux (WSL2) + Docker. Docker is a container manager that can pull and build github projects and run them in a sort of virtual machine. With WSL2 we install a Linux distribution (default is Ubuntu) that Docker can use as its backend instead of Windows.

### Install process:
1. setup WSL2 if you are on Windows, instructions are found [here](https://learn.microsoft.com/en-us/windows/wsl/install). 
2. Install Docker using the setup guide, instructions are found [text](https://docs.docker.com/desktop/features/wsl/#turn-on-docker-desktop-wsl-2).
3. Install the the `nlm-ingestor`in the Docker UI's terminal Window with these commands:
    - get the current parser build: `docker pull ghcr.io/nlmatics/nlm-ingestor:latest`
    - Run the server: `docker run -p 5010:5001 ghcr.io/nlmatics/nlm-ingestor:latest`
    - nlm-ingestor readme [link](https://github.com/nlmatics/nlm-ingestor?tab=readme-ov-file#about)
4. now install `pip install llmsherpa`

### Step 1: parser through the server
The code below will parse a PDF file through the server and return the parsed data as a JSON object. THE SERVER MUST BE RUNNING FOR THIS TO WORK!


In [2]:
llmsherpa_api_url = "http://localhost:5010/api/parseDocument?renderFormat=all" # local API endpoint
pdf_url = "ExampleRFPs/GoodFit/IETSS DRAFT PWS v2 for RFI FINAL to POST_no_contents.pdf"
# "ExampleRFPs/GoodFit/IETSS DRAFT PWS v2 for RFI FINAL to POST.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)


The basic idea of the Retrieval step is to ask questions of the document. To do this we will use the `sentence_transformers` library to embed the questions and the document text, and then use cosine similarity to find the most relevant parts of the document that match the question. This is called "semantic search". As the name implies, it is a search for the meaning of the text rather than exact matches of words. The parser will return a `Document` object that contains the document and the parse data as a tree node structure. We will use the `Document` object to extract the text and embed it for semantic search. We need to:

### RAG step 2: Extracting text with context
- extract text and provide context about the document sections and structure that will be used for the semantic search.


In [None]:
for table in doc.tables():
    print("Table:")
    print(table.to_context_text())

In [None]:
for section in doc.sections():
    print(f"Section: {section.title}")
    print(f"Children: {[child.to_text() for child in section.children]}")

The code cell below shows how the nlm server chunks the pdf with smart chunking. 'include_section_info' is set to true, this lets us see the section information for the context of each chunk. Each chunk is a logical unit of text that is semantically meaningful like paragraphs, tables, or list items. 

  

In [None]:
for i,chunk in enumerate(doc.chunks()):
    print(f"-----Chunk {i}-----")
    # print(chunk.to_text(include_children=True, recurse=True))
    print(chunk.to_context_text(include_section_info=True))
    print("----------")
    

In [None]:
for i,chunk in enumerate(doc.sections()):
    print(f"Chunk {i}:")
    print(chunk.to_context_text(include_section_info=True))

As you can see, there are many options for the parser to control how to get structured data from the PDF including sectioning context in smart chunking, and the ability to include or exclude images, tables, and other elements. 

The next step requires us to choose a method best suited for the type of query we want to perform. For example, if we want to ask general questions about the document content, we can provide section titles and possibly child nodes to help guide the semantic search. If we want to ask specific questions about a sections paragraphs, we probably want to include the child nodes to get the most relevant text.

### RAG step 3: Semantic search
- embed the text and the questions using the `sentence_transformers` library.
- calculate the cosine similarity between the embedded questions and the embedded text.
- return the most relevant sections of the document that match the questions.


In [28]:
model = SentenceTransformer('all-MiniLM-L6-v2')

In [59]:
# Encode the paragraphs into embeddings.
sections = []
for section in doc.sections():
    sections.append(section.to_context_text(include_section_info=True))
    # sections.append(section.title)
doc_embeddings = model.encode(sections, convert_to_tensor=True)

In [60]:
queries = [
    "which section is about the objective?",
    "which section is about the background?",
    "which section is about the work scope?",
    "which section is about the deliverables?",
    "which section is about the performance metrics?",
]

for query in queries:
    query_embedding = model.encode(query, convert_to_tensor=True)
    similarities = model.similarity(query_embedding, doc_embeddings)[0]
    # top 5 results
    top_indices = torch.topk(similarities, 5)
    for i in range(len(top_indices[0])):
        print(f"Query: {query}\nResult {i}:\n{sections[top_indices[1][i]]}\nScore: {top_indices[0][i].item():.4f}\n")

Query: which section is about the objective?
Result 0:
1.2 OIT SERVICES AND METHODOLOGY > Figure 1.2-3: Service Offerings and the SDLC and SACI phases
1.2.1 Services Executed During SDLC and SACI Planning Phase
Score: 0.3704

Query: which section is about the objective?
Result 1:
1.2 OIT SERVICES AND METHODOLOGY
Figure 1.2-3: Service Offerings and the SDLC and SACI phases
Score: 0.3652

Query: which section is about the objective?
Result 2:

1.2 OIT SERVICES AND METHODOLOGY
Score: 0.3279

Query: which section is about the objective?
Result 3:

3.0 SCOPE OF WORK
Score: 0.3192

Query: which section is about the objective?
Result 4:
1.2 OIT SERVICES AND METHODOLOGY > Figure 1.2-3: Service Offerings and the SDLC and SACI phases
1.2.2 Services Executed During SDLC and SACI Test Services Execution Phase
Score: 0.3170

Query: which section is about the background?
Result 0:

1.0 BACKGROUND
Score: 0.5665

Query: which section is about the background?
Result 1:
1.2 OIT SERVICES AND METHODOLOGY 

### RAG step 4: Summarization
- summarize the relevant sections using a generative model.

Example code using the `transformers` huggingface library to summarize the text is shown below. This will use a pre-trained model to generate a summary of the relevant sections of the document.
```python
from transformers import pipeline
# Load a summarization pipeline
summarizer = pipeline("summarization", 
                      model="facebook/bart-large-cnn")

def summarize_text(text: str) -> str:
    """Summarize the input text using the summarization pipeline."""
    # Adjust max_length and min_length as needed but dont exceed the attention limit of the model
    summary = summarizer(text, max_length=len(text)/2, min_length=30, do_sample=False) 
    return summary[0]['summary_text'] if summary else "No summary available."

if __name__ == "__main__":
        # input_text will be the chunk of text to summarize in the actual application
        input_text = '''The Contractor will be required to design, develop, or operate a system of records on individuals, to accomplish 
                        an agency function subject to the Privacy Act of 1974, Public Law 93-579, December 31, 1974 (5 U.S.C. 552a) and 
                        applicable agency regulations. Violation of the Act may involve the imposition of criminal penalties.'''
        summary = summarize_text(input_text)
        print("Summary:")
        print(summary)
```

Example for how we can target an online api like the openai api to summarize the text is shown below. This will use the OpenAI API to generate a summary of the relevant sections of the document.
```python

In [None]:
openai.api_key = "your-openai-api-key"
index = VectorStoreIndex([])
for chunk in doc.chunks():
    index.insert(Document(text=chunk.to_context_text()), extra_info={})
retriever = index.as_retriever()
