## EchoPDF

EchoPDF is a Retrieval-Augmented Generation (RAG) tool that enables users to upload any PDF document, ask questions about its content, and receive tailored, contextually accurate answers. Designed to enhance document accessibility, EchoPDF combines NLP and deep learning to extract and retrieve specific information, providing quick and insightful responses directly from uploaded PDFs.


In [15]:
import os
import requests
from dotenv import load_dotenv

# from utils.helper_functions import open_and_read_pdf

In [17]:
load_dotenv()

# Path to pdf
pdf_path = "human_nutrition.pdf"

# Import the pdf
if not os.path.exists(pdf_path):
    print(f"[INFO]: File doesn't exist")
    file_name = pdf_path

    url = os.getenv("pdf_url")
    
    response = requests.get(url)

    # Check if request was successful
    if response.status_code == 200:
        # Open the file and save it
        with open(file_name, "wb") as file:
            file.write(response.content)
        print(f"[INFO]: File has been downloaded and saved as {file_name}")
    else:
        print(f"[INFO]: Failed to download the file. Status code: {response.status_code}")
else:
    print(f"[INFO]: File already exists.")


[INFO]: File already exists.


In [20]:
import fitz
from tqdm.auto import tqdm

def format_text(input: str) -> str:
    """
    Performs text formatting and returns formatted text
    """
    cleaned_text = input.replace("hello world", "Hello World").strip()

    return cleaned_text

def open_and_read_pdf(pdf_path: str):
    """
    Opens the pdf, creates a list of dictionaries for each page, and returns the list
    """
    document = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(document)):
        text = page.get_text()
        text = format_text(input=text)
        pages_and_texts.append({
                "page_number": page_number,
                "page_char_count": len(text),
                "page_word_count": len(text.split(" ")),
                "page_sentence_count_raw": len(text.split(". ")),
                "page_token_count": len(text) / 4,
                "text": text  
        })
        
    return pages_and_texts

In [22]:
# Let's open the pdf and read it's content
# import fitz
# from tqdm.auto import tqdm
pages_and_text = open_and_read_pdf(pdf_path="human_nutrition.pdf")
pages_and_text[:5]

1208it [00:02, 594.03it/s]


[{'page_number': 0,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': 1,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''},
 {'page_number': 2,
  'page_char_count': 320,
  'page_word_count': 42,
  'page_sentence_count_raw': 1,
  'page_token_count': 80.0,
  'text': 'Human Nutrition: 2020 \nEdition \nUNIVERSITY OF HAWAI‘I AT MĀNOA \nFOOD SCIENCE AND HUMAN \nNUTRITION PROGRAM \nALAN TITCHENAL, SKYLAR HARA, \nNOEMI ARCEO CAACBAY, WILLIAM \nMEINKE-LAU, YA-YUN YANG, MARIE \nKAINOA FIALKOWSKI REVILLA, \nJENNIFER DRAPER, GEMADY \nLANGFELDER, CHERYL GIBBY, CHYNA \nNICOLE CHUN, AND ALLISON \nCALABRESE'},
 {'page_number': 3,
  'page_char_count': 212,
  'page_word_count': 30,
  'page_sentence_count_raw': 1,
  'page_token_count': 53.0,
  'text': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa F

In [23]:
import random

random.sample(pages_and_text, k=2)

[{'page_number': 787,
  'page_char_count': 767,
  'page_word_count': 124,
  'page_sentence_count_raw': 6,
  'page_token_count': 191.75,
  'text': 'Image by \nAllison \nCalabrese / \nCC BY 4.0 \nBuilding a Healthy Plate: Choose \nNutrient-Dense Foods \nClick on the different food groups listed to view their food gallery: \n• Fruits \n• Grains \n• Dairy \n• Vegetables \n• Protein \nPlanning a healthy diet using the MyPlate approach is not difficult. \nAccording to the icon, half of your plate should have fruits and \nvegetables, one-quarter should have whole grains, and one-quarter \nshould have protein. Dairy products should be low-fat or non-fat. \nThe ideal diet gives you the most nutrients within the fewest \ncalories. This means choosing nutrient-rich foods. \nFill half of your plate with red, orange, and dark green vegetables \nand fruits, such as kale, bok choy, kalo (taro), tomatoes, sweet \n746  |  MyPlate Planner'},
 {'page_number': 230,
  'page_char_count': 187,
  'page_word_c

In [25]:
import pandas as pd

df = pd.DataFrame(pages_and_text)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,0,29,4,1,7.25,Human Nutrition: 2020 Edition
1,1,0,1,1,0.0,
2,2,320,42,1,80.0,Human Nutrition: 2020 \nEdition \nUNIVERSITY O...
3,3,212,30,1,53.0,Human Nutrition: 2020 Edition by University of...
4,4,797,114,2,199.25,Contents \nPreface \nUniversity of Hawai‘i at ...


In [26]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,603.5,1148.0,172.31,9.97,287.0
std,348.86,560.38,86.27,6.18,140.1
min,0.0,0.0,1.0,1.0,0.0
25%,301.75,762.0,110.0,4.0,190.5
50%,603.5,1231.5,182.5,10.0,307.88
75%,905.25,1603.5,238.0,14.0,400.88
max,1207.0,2308.0,394.0,32.0,577.0
