# Build Document Embedding Tool

This is a fork from the `retrieval-v1` within the local-rag development. 

In this notebook the goal is to design an approach to load content from a PDF document into an vector database. For this experiment we will use Pinecone because it will most likely be our weapon of choice during production grade development

In [3]:
# load libraries
import os
import requests
from tqdm.auto import tqdm # for progress bars
import random
import re
import torch


  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# list of PDF documents
list_of_pdf_docs = [
    "https://www.faa.gov/sites/faa.gov/files/regulations_policies/handbooks_manuals/aviation/airplane_handbook/00_afh_full.pdf",
    "https://www.faa.gov/sites/faa.gov/files/regulations_policies/handbooks_manuals/aviation/FAA-H-8083-15B.pdf",
    "https://www.faa.gov/regulations_policies/handbooks_manuals/aviation/faa-h-8083-25c.pdf",
    "https://www.faa.gov/sites/faa.gov/files/2022-06/risk_management_handbook_2A.pdf",
    "https://www.faa.gov/sites/faa.gov/files/regulations_policies/handbooks_manuals/aviation/FAA-H-8083-1.pdf",
]

## Extract Content From PDFs

data will be extracted from each pdf file we load in. The content can vary from text to images to tables. We will need to continue to develop different function to accurately load this into our notebook.

The text is cleaned before being populated to remove any noise. This will help increase accuracy of embeddings further down the process.

In [5]:
### Document Loader

import pymupdf

# extract image from pdf
def get_page_images(doc, page_content, page_index):
    image_paths = []
    try:
        image_list = page_content.get_images()

        # print number of images found on page
        if image_list: 
            print(f"found {len(image_list)} images on page {page_index}")

        for image_index, img in enumerate(image_list, start=1): # enumerate the image list
            xref = img[0] # get XREF of image
            pix = pymupdf.Pixmap(doc, xref) # create a Pixmap

            if pix.n - pix.alpha > 3: #CMYK: convert to RGB first
                pix = pymupdf.Pixmap(pymupdf.cdRGB, pix)

            image_path = "page_%s-image_%s.png" % (page_index, image_index)
            pix.save(image_path) # save the image as png
            pix = None

            image_paths.append(image_path)

    except Exception as e: 
        print(f"error occurred getting images: {e}")

    return image_paths

# clean text
def clean_text(text: str) -> str:
    """
    Format text to remove noise.

    In the document we are experimenting with, there are a lot of "."s
    We will also move "\n" and replace with " "
    """

    # replace multiple dots (.............) with a single space
    text = re.sub(r'\.{2,}', ' ', text)

    # replace new line character with space
    clean_text = text.replace("\n", " ").strip()

    # Add more formatting if needed
    return clean_text

# parse document
def parse_document(filepath):
    """
    This will extract all the data from document and 
    populate a dictionary with the extracted data
    """

    print("---PARSING DOCUMENT---")
    doc = pymupdf.open(filepath) # open document
    pages_and_texts = []

    for page_number, page in enumerate(doc):
        text = page.get_text() # get text from page
        text = clean_text(text) # clean text
        pages_and_texts.append({
            "page_number": page_number,
            "page_char_count": len(text),
            "page_word_count": len(text.split(" ")),
            "page_sentence_count_raw": len(text.split(". ")),
            "page_token_count": len(text) / 4, # average token = ~4 char
            "images": get_page_images(doc, page, page_number),
            "text": text,
        })

    return pages_and_texts

Now we will run some code to extract data from a local file.

In [6]:
filepath = "./src/61-65-certifications.pdf"

# parse document
pdf_content = parse_document(filepath)

---PARSING DOCUMENT---
found 1 images on page 0


In [7]:
pdf_content

[{'page_number': 0,
  'page_char_count': 573,
  'page_word_count': 91,
  'page_sentence_count_raw': 3,
  'page_token_count': 143.25,
  'images': ['page_0-image_1.png'],
  'text': 'U.S. Department  of Transportation  Federal Aviation  Administration  Advisory  Circular  Subject: Certification: Pilots and Flight and  Ground Instructors  Date: 8/27/18  AC No: 61-65H  Initiated by: AFS-800  Change:  This advisory circular (AC) provides guidance for pilot and instructor applicants, pilots, flight  instructors, ground instructors, and examiners on the certification standards, knowledge test  procedures, and other requirements in Title 14 of the Code of Federal Regulations (14 CFR)  part 61.  Rick Domingo  Executive Director, Flight Standards Service'},
 {'page_number': 1,
  'page_char_count': 1436,
  'page_word_count': 323,
  'page_sentence_count_raw': 1,
  'page_token_count': 359.0,
  'images': [],
  'text': '8/27/18  AC 61-65H  ii  CONTENTS  Paragraph  Page  1  Purpose of This Advisory Cir

We successfully retrieved the content from the PDF document.

Now this content needs to be broken down into chunks that will fit into the context window of our llm.

We will use spaCy to break text into sentences. Its an NLP library, therefore, it will be more accurate than splitting by: `text.split(". ")`

In [8]:
from spacy.lang.en import English

# initialize model and sentincizer once 
nlp = English()
nlp.add_pipe("sentencizer")

def chunk_content(content):
    for item in tqdm(content):
        # Process the text to get sentences
        doc = nlp(item["text"])
        item["sentences"] = list(doc.sents)

        # Convert sentences to strings
        item["sentences"] = [str(sentence) for sentence in item["sentences"]]

        # Count the sentences
        item["page_sentence_count_spacy"] = len(item["sentences"])

    return content

In [9]:
pdf_content = chunk_content(pdf_content)

# inspect sample
random.sample(pdf_content, k=1)

100%|██████████| 58/58 [00:00<00:00, 142.54it/s]


[{'page_number': 30,
  'page_char_count': 2964,
  'page_word_count': 473,
  'page_sentence_count_raw': 25,
  'page_token_count': 741.0,
  'images': [],
  'text': '8/27/18    AC 61-65H  28  35 OTHER INSTRUCTOR ENDORSEMENTS. Specific requirements for knowledge,  aeronautical experience, and, as appropriate, testing for the complex airplane, high  performance airplane, tailwheel airplane, pressurized aircraft capable of operating at high  altitudes, and type specific training are found in § 61.31.  36 GROUND INSTRUCTOR CERTIFICATION. The applicability, eligibility,  privileges, and recency requirements for the ground instructor certificate are located in  part 61 subpart I.  37 AUTHORIZED INSTRUCTORS. Section 61.1 defines an “authorized instructor” as:  1. A person who holds a ground instructor certificate issued under part 61 and is  in compliance with § 61.217 when conducting ground training in accordance  with the privileges and limitations of his or her ground instructor certificate; 

These lists of sentences need to be prepped to become embedded. We will make larger chunks to maintain context in between sentences.

36 tokens per sentence
Will chunk 10 sentences together

In [10]:
# split size
sentences_in_chunk = 10

# recursively split list into desired sizes
def split_list(input_list: list,
               slice_size: int) -> list[list[str]]:
    """
    Split the input_list into sublists of size slice_size (as close as possible)
    """

    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# looper through pages and text and split sentences into chunks
for item in tqdm(pdf_content):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                          slice_size=sentences_in_chunk)
    item["num_chunks"] = len(item["sentence_chunks"])

100%|██████████| 58/58 [00:00<00:00, 183047.13it/s]


In [11]:
# sample example from group
random.sample(pdf_content, k=1)

[{'page_number': 21,
  'page_char_count': 3275,
  'page_word_count': 518,
  'page_sentence_count_raw': 21,
  'page_token_count': 818.75,
  'images': [],
  'text': '8/27/18    AC 61-65H  19  certificate with an airplane multiengine rating, the applicant would still be required to  receive a logbook endorsement that attests to satisfactory demonstration of instructional  proficiency in stall awareness, spin entry, spins, and spin recovery procedures. However,  the training would be required to be performed in an airplane, most likely a single-engine  land airplane, that is not restricted from spins.  26.2 Flight Instructor Certificate with Rotorcraft Category and Helicopter Class Rating.  For applicants applying for a flight instructor certificate with rotorcraft category and  helicopter class rating, the applicant will be required to demonstrate touchdown  autorotations. An examiner may accept, at his or her discretion, a logbook endorsement in  lieu of demonstrating the touchdown porti

**Splitting each chunk into its own item**

Embed each chunk into own numerical representation

New object created

In [12]:
import re

# split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pdf_content):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # join sentences together into a paragraph-like structure, aka a chunk (single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full stop/capital letter combo
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # get stats about chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 char

        pages_and_chunks.append(chunk_dict)

100%|██████████| 58/58 [00:00<00:00, 9290.77it/s]


In [14]:
# view random sample
random.sample(pages_and_chunks, k=1)

[{'page_number': 17,
  'sentence_chunk': 'An authorized flight instructor must supervise the training and experience required in furtherance of a higher level of certificate. Each flight conducted by the recreational pilot under those provisions must be authorized by the flight instructor’s endorsement in the recreational pilot’s logbook. 2. Recreational pilots may act as PIC on a flight that is in Class B, C, and D airspace, at an airport located in Class B, C, or D airspace, and to, from, through, or at an airport having an operational control tower after having received the required training and endorsement in accordance with',
  'chunk_char_count': 594,
  'chunk_word_count': 96,
  'chunk_token_count': 148.5}]

remove smaller embeddings

In [17]:
import pandas as pd

df = pd.DataFrame(pages_and_chunks)

# show random chunks with under 10 tokens in length
min_token_length = 10
for row in df[df["chunk_token_count"] <= min_token_length].sample(1).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]} | Page number: {row[1]["page_number"]}')

pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

Chunk token count: 2.0 | Text: 12-31-19 | Page number: 46


[{'page_number': 0,
  'sentence_chunk': 'U. S. Department of Transportation Federal Aviation Administration Advisory Circular Subject: Certification: Pilots and Flight and Ground Instructors Date: 8/27/18 AC No: 61-65H Initiated by: AFS-800 Change: This advisory circular (AC) provides guidance for pilot and instructor applicants, pilots, flight instructors, ground instructors, and examiners on the certification standards, knowledge test procedures, and other requirements in Title 14 of the Code of Federal Regulations (14 CFR) part 61. Rick Domingo Executive Director, Flight Standards Service',
  'chunk_char_count': 557,
  'chunk_word_count': 75,
  'chunk_token_count': 139.25},
 {'page_number': 1,
  'sentence_chunk': '8/27/18 AC 61-65H ii CONTENTS Paragraph Page 1 Purpose of This Advisory Circular (AC) 1 2 Audience 1 3 Safety Message 1 4 Where You Can Find This AC 1 5 What This AC Cancels 1 6 Related Reading Material (current editions)  1 7 Summary of Changes  2 8 Pilot Training and Tes

**Embedding text chunks**

In [18]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device="cpu") # choose device to load model to

# Notes: this will embed using local computing power. Learn more about the benefits (if any)
# of computing in the cloud

# Make sure the model is on the CPU
embedding_model.to("cpu")

# Embed each chunk one by one
for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

100%|██████████| 127/127 [00:22<00:00,  5.58it/s]


In [20]:
pages_and_chunks_over_min_token_len[0]

{'page_number': 0,
 'sentence_chunk': 'U. S. Department of Transportation Federal Aviation Administration Advisory Circular Subject: Certification: Pilots and Flight and Ground Instructors Date: 8/27/18 AC No: 61-65H Initiated by: AFS-800 Change: This advisory circular (AC) provides guidance for pilot and instructor applicants, pilots, flight instructors, ground instructors, and examiners on the certification standards, knowledge test procedures, and other requirements in Title 14 of the Code of Federal Regulations (14 CFR) part 61. Rick Domingo Executive Director, Flight Standards Service',
 'chunk_char_count': 557,
 'chunk_word_count': 75,
 'chunk_token_count': 139.25,
 'embedding': array([-6.03953516e-03, -3.27171981e-02, -1.24537116e-02, -1.97144281e-02,
         3.52388583e-02, -2.52726618e-02,  4.48607728e-02,  5.23233376e-02,
        -2.36425102e-02,  2.69900472e-03,  3.13407667e-02, -3.44269983e-02,
         7.40530565e-02,  4.55278298e-03,  2.25385502e-02, -4.18319218e-02,
   

**Save to file**

In [21]:
# Save to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

## Add to Pinecone

Using Pinecone will allow us to retrieve documents seamlessly. These embeddings will be continuously stored elsewhere so they can be retrieved whenever

First we will need to initialize a connection

In [27]:
api_key = os.getenv("PINECONE_API_KEY")

In [28]:
# Initialize connection
import dotenv
dotenv.load_dotenv()

from pinecone import Pinecone

# configure client
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

In [29]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

In [32]:
index_name = "rag-retriever-v2"

In [33]:
# check if index already exists
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=768,
        metric="cosine",
        spec=spec,
    )
#connect to index
index = pc.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

We have a new index created to add our embeddings

In [39]:
# create ids for embeddings
for item in range(len(pages_and_chunks_over_min_token_len)):
    pages_and_chunks_over_min_token_len[item]["ids"] = str(item)

In [43]:
pages_and_chunks_over_min_token_len[1]

{'page_number': 1,
 'sentence_chunk': '8/27/18 AC 61-65H ii CONTENTS Paragraph Page 1 Purpose of This Advisory Circular (AC) 1 2 Audience 1 3 Safety Message 1 4 Where You Can Find This AC 1 5 What This AC Cancels 1 6 Related Reading Material (current editions)  1 7 Summary of Changes  2 8 Pilot Training and Testing 2 9 Knowledge Tests  2 10 Completion of Ground Training or a Home Study Curriculum  3 11 Verification of Identity, Age, and English Language Standard  4 12 Practical Tests 5 13 Light-Sport Aircraft (LSA) With a Single Seat 6 14 Prerequisites for Practical Tests  8 15 Student Pilot Certification 8 16 Acceptance of a Student Pilot Application 9 17 Student Pilot Certificate Eligibility  9 18 Student Pilot Application Process: IACRA 10 19 Student Pilot Application Process: Paper FAA Form 8710-1 11 20 Pre-Solo Requirements and Privileges 11 21 Sport Pilot Certification 14 22 Recreational Pilot Certification  15 23 Private Pilot Certification 16 24 Commercial Pilot Certification 1

In [45]:
from time import sleep
batch_size = 100 # amount of embeddings to create and insert at once

for i in tqdm(range(0, len(pages_and_chunks_over_min_token_len), batch_size)):
    # find end of batch
    i_end = min(len(pages_and_chunks_over_min_token_len), i+batch_size)
    meta_batch = pages_and_chunks_over_min_token_len[i:i_end]
    # get ids
    ids_branch = [x["ids"] for x in meta_batch]
    # get text to encode
    text_branch = [x["sentence_chunk"] for x in meta_batch]
    # get embedding
    embeddings = [x["embedding"] for x in meta_batch]

    # clean metadata
    meta_batch = [{
        "text": x["sentence_chunk"],
        "ids": x["ids"],
        "page_number": x["page_number"],
        "chunk_char_count": x["chunk_char_count"],
        "chunk_word_count": x["chunk_word_count"],
        "chunk_token_count": x["chunk_token_count"],
    } for x in meta_batch]
    # upsert to pinecone
    to_upsert = list(zip(ids_branch, embeddings, meta_batch))
    index.upsert(vectors=to_upsert)

100%|██████████| 2/2 [00:03<00:00,  1.73s/it]


These are now uploaded into pincone with the necessary metadata. 

We can confirm the accuracy between the notebook variables and pinecone by searching for ids

**Embedding our query:**

In [50]:
query = "what do i need to sign for a student to solo?"
res = embedding_model.encode(query)

xq = res.tolist()

res = index.query(vector=xq, top_k=2, include_metadata=True)

In [51]:
res

{'matches': [{'id': '36',
              'metadata': {'chunk_char_count': 1019.0,
                           'chunk_token_count': 254.75,
                           'chunk_word_count': 172.0,
                           'ids': '36',
                           'page_number': 14.0,
                           'text': '3. Proficiency and Safety. Prior to '
                                   'conducting a solo flight, a student pilot '
                                   'must have demonstrated satisfactory '
                                   'proficiency and safety, as judged by an '
                                   'authorized instructor, on the maneuvers '
                                   'and procedures required by § 61.87 in the '
                                   'make and model of aircraft or similar make '
                                   'and model of aircraft to be flown. Refer '
                                   'to § 61.87(c)(2). The student must also '
                   

We can retrieve relevant documents by finding items that are similar to our query. 

Now, we will put this into a retrieval function

In [54]:
limit = 3750

def retrieve(query):
    res = embedding_model.encode(query)

    # retrieve from Pinecone
    xq = res.tolist()

    # get relevant documents
    res = index.query(vector=xq, top_k=2, include_metadata=True)
    contexts = [
        x['metadata']['text'] for x in res['matches']
    ]

    # build our prompt with the retrieved context included
    prompt_start = (
        "Answer the question based on the context below. \n\n"+
        "Context: \n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )
    # append context until hitting limit
    for i in range(1, len(contexts)):
        if len("\n\n---\n\n".join(contexts[:i])) >= limit:
            prompt = (
                prompt_start + 
                "\n\n---\n\n".join(contexts[:i]) +
                prompt_end
            )
            break
        elif i == len(contexts)-1:
            prompt = (
                prompt_start + 
                "\n\n---\n\n".join(contexts) + 
                prompt_end
            )
    return prompt

In [56]:
# first we retrieve relevant items from pinecone
query_with_context = retrieve(query)
print(query_with_context)

Answer the question based on the context below. 

Context: 
3. Proficiency and Safety. Prior to conducting a solo flight, a student pilot must have demonstrated satisfactory proficiency and safety, as judged by an authorized instructor, on the maneuvers and procedures required by § 61.87 in the make and model of aircraft or similar make and model of aircraft to be flown. Refer to § 61.87(c)(2). The student must also meet the FAA English language standard as stated in the ACS or PTS. 20.3 Ninety Calendar-Day Endorsement. A student pilot may not operate an aircraft in solo flight unless that student pilot has received an endorsement in the student’s logbook for the specific make and model aircraft to be flown by an authorized instructor who gave the training within the 90 calendar-days preceding the date of the flight. Refer to § 61.87(n). 20.4 Solo at Night. A student pilot may not operate an aircraft in solo flight at night unless that student pilot has received the training required b

**Using OpenAI for LLM**

For our generator we will use OpenAI's API to create content. This cna be local down the road if needed, but for now this is sufficient.

In [60]:
from openai import OpenAI

In [69]:
client = OpenAI()

# Define the query as a list of message objects
query = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "what is an airplane?"}
]

completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=query
)

print(completion.choices[0].message.content)

An airplane is a powered flying vehicle with fixed wings and a fuselage. It is typically used for transportation of passengers and goods through the Earth's atmosphere. Airplanes rely on various technologies, such as engines and aerodynamics, to generate lift and thrust to stay airborne and travel through the air.


In [70]:
def complete(prompt):
    query = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]

    completion = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=query
    )

    return completion.choices[0].message.content

In [74]:
# test function
query = "what do i need to sign for a student to solo?"

print(complete(query))

In order for a student to solo, you typically need to sign a student solo endorsement on their student pilot certificate. This endorsement will be provided by the flight instructor once they have determined that the student is ready to fly solo. Additionally, you may need to sign the student's logbook to signify that they have met the necessary requirements for solo flight. It's important to ensure that the student meets all relevant federal aviation regulations and training requirements before allowing them to fly solo.


In [72]:
print(complete(query_with_context))

For a student to solo, you would need to sign an endorsement in the student's logbook indicating that the student has met the pre-solo requirements, demonstrated satisfactory proficiency and safety on the maneuvers and procedures required, and received the necessary training for the specific make and model of aircraft to be flown. Additionally, the endorsement must be given within the 90 calendar-days preceding the date of the solo flight for the specific aircraft being flown.
