## EchoPDF

EchoPDF is a Retrieval-Augmented Generation (RAG) tool that enables users to upload any PDF document, ask questions about its content, and receive tailored, contextually accurate answers. Designed to enhance document accessibility, EchoPDF combines NLP and deep learning to extract and retrieve specific information, providing quick and insightful responses directly from uploaded PDFs.


In [5]:
import os
import requests
from dotenv import load_dotenv

# from utils.helper_functions import open_and_read_pdf

In [6]:
load_dotenv()

# Path to pdf
pdf_path = "human_nutrition.pdf"

# Import the pdf
if not os.path.exists(pdf_path):
    print(f"[INFO]: File doesn't exist")
    file_name = pdf_path

    url = os.getenv("pdf_url")
    
    response = requests.get(url)

    # Check if request was successful
    if response.status_code == 200:
        # Open the file and save it
        with open(file_name, "wb") as file:
            file.write(response.content)
        print(f"[INFO]: File has been downloaded and saved as {file_name}")
    else:
        print(f"[INFO]: Failed to download the file. Status code: {response.status_code}")
else:
    print(f"[INFO]: File already exists.")


[INFO]: File already exists.


In [19]:
import fitz
from tqdm.auto import tqdm

def format_text(input: str) -> str:
    """
    Performs text formatting and returns formatted text
    """
    cleaned_text = input.replace("\n", " ").strip()

    return cleaned_text

def open_and_read_pdf(pdf_path: str):
    """
    Opens the pdf, creates a list of dictionaries for each page, and returns the list
    """
    document = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(document)):
        text = page.get_text()
        text = format_text(input=text)
        pages_and_texts.append({
                "page_number": page_number,
                "page_char_count": len(text),
                "page_word_count": len(text.split(" ")),
                "page_sentence_count_raw": len(text.split(". ")),
                "page_token_count": len(text) / 4,
                "text": text  
        })
        
    return pages_and_texts

In [20]:
# Let's open the pdf and read it's content
pages_and_text = open_and_read_pdf(pdf_path="human_nutrition.pdf")
pages_and_text[:5]

1208it [00:01, 690.81it/s]


[{'page_number': 0,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': 1,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''},
 {'page_number': 2,
  'page_char_count': 320,
  'page_word_count': 54,
  'page_sentence_count_raw': 1,
  'page_token_count': 80.0,
  'text': 'Human Nutrition: 2020  Edition  UNIVERSITY OF HAWAI‘I AT MĀNOA  FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM  ALAN TITCHENAL, SKYLAR HARA,  NOEMI ARCEO CAACBAY, WILLIAM  MEINKE-LAU, YA-YUN YANG, MARIE  KAINOA FIALKOWSKI REVILLA,  JENNIFER DRAPER, GEMADY  LANGFELDER, CHERYL GIBBY, CHYNA  NICOLE CHUN, AND ALLISON  CALABRESE'},
 {'page_number': 3,
  'page_char_count': 212,
  'page_word_count': 32,
  'page_sentence_count_raw': 1,
  'page_token_count': 53.0,
  'text': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science 

In [9]:
import random

random.sample(pages_and_text, k=2)

[{'page_number': 714,
  'page_char_count': 974,
  'page_word_count': 197,
  'page_sentence_count_raw': 15,
  'page_token_count': 243.5,
  'text': 'Food  Serving  Zinc (mg) Percent Daily Value  Oysters  3 oz.  74  493  Beef, chuck roast 3 oz.  7  47  Crab  3 oz.  6.5  43  Lobster  3 oz.  3.4  23  Pork loin  3 oz.  2.9  19  Baked beans  ½ c.  2.9  19  Yogurt, low fat  8 oz.  1.7  11  Oatmeal, instant  1 packet 1.1  7  Almonds  1 oz.  0.9  6  Fact Sheet for Health Professionals: Zinc. National Institute of  Health, Office of Dietary Supplements. https://ods.od.nih.gov/ factsheets/Zinc-HealthProfessional/. Updated February 11, 2016.  Accessed November 10, 2017.  Learning Activities  Technology Note: The second edition of the Human  Nutrition Open Educational Resource (OER) textbook  features interactive learning activities.\xa0 These activities are  available in the web-based textbook and not available in the  downloadable versions (EPUB, Digital PDF, Print_PDF, or  Open Document).  Learni

In [10]:
import pandas as pd

df = pd.DataFrame(pages_and_text)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,0,29,4,1,7.25,Human Nutrition: 2020 Edition
1,1,0,1,1,0.0,
2,2,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,3,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,4,797,145,2,199.25,Contents Preface University of Hawai‘i at Mā...


In [11]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,603.5,1148.0,198.3,9.97,287.0
std,348.86,560.38,95.76,6.19,140.1
min,0.0,0.0,1.0,1.0,0.0
25%,301.75,762.0,134.0,4.0,190.5
50%,603.5,1231.5,214.5,10.0,307.88
75%,905.25,1603.5,271.0,14.0,400.88
max,1207.0,2308.0,429.0,32.0,577.0


## Further text processing (splitting pages into sentences)

In [26]:
from spacy.lang.en import English

nlp = English()

nlp.add_pipe("sentencizer")

for item in tqdm(pages_and_text):
    if (item["text"]):
        item["sentences"] = list(nlp(item["text"]).sents)

        item["sentences"] = [str(sentence) for sentence in item["sentences"]]

        item["page_sentence_count_spacy"] = len(item["sentences"])

100%|██████████| 1208/1208 [00:01<00:00, 836.01it/s]


In [28]:
random.sample(pages_and_text, k=1)

[{'page_number': 178,
  'page_char_count': 1284,
  'page_word_count': 239,
  'page_sentence_count_raw': 13,
  'page_token_count': 321.0,
  'text': 'Image by  Shutterstock.  All Rights  Reserved.  Measuring Body Fat Content  Water, organs, bone tissue, fat, and muscle tissue make up a  person’s weight. Having more fat mass may be indicative of disease  risk, but fat mass also varies with sex, age, and physical activity  level. Females have more fat mass, which is needed for reproduction  and, in part, is a consequence of different levels of hormones. The  optimal fat content of a female is between 20 and 30 percent of  her total weight and for a male is between 12 and 20 percent. Fat  mass can be measured in a variety of ways. The simplest and lowest- cost way is the skin-fold test. A health professional uses a caliper to  measure the thickness of skin on the back, arm, and other parts of  the body and compares it to standards to assess body fatness. It is  a noninvasive and fairly accu

In [29]:
df = pd.DataFrame(pages_and_text)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1179.0
mean,603.5,1148.0,198.3,9.97,287.0,10.57
std,348.86,560.38,95.76,6.19,140.1,6.16
min,0.0,0.0,1.0,1.0,0.0,1.0
25%,301.75,762.0,134.0,4.0,190.5,5.0
50%,603.5,1231.5,214.5,10.0,307.88,11.0
75%,905.25,1603.5,271.0,14.0,400.88,15.0
max,1207.0,2308.0,429.0,32.0,577.0,28.0
