## EchoPDF

EchoPDF is a Retrieval-Augmented Generation (RAG) tool that enables users to upload any PDF document, ask questions about its content, and receive tailored, contextually accurate answers. Designed to enhance document accessibility, EchoPDF combines NLP and deep learning to extract and retrieve specific information, providing quick and insightful responses directly from uploaded PDFs.


In [1]:
import os
import requests
from dotenv import load_dotenv

# from utils.helper_functions import open_and_read_pdf

In [2]:
load_dotenv()

# Path to pdf
pdf_path = "human_nutrition.pdf"

# Import the pdf
if not os.path.exists(pdf_path):
    print(f"[INFO]: File doesn't exist")
    file_name = pdf_path

    url = os.getenv("pdf_url")
    
    response = requests.get(url)

    # Check if request was successful
    if response.status_code == 200:
        # Open the file and save it
        with open(file_name, "wb") as file:
            file.write(response.content)
        print(f"[INFO]: File has been downloaded and saved as {file_name}")
    else:
        print(f"[INFO]: Failed to download the file. Status code: {response.status_code}")
else:
    print(f"[INFO]: File already exists.")


[INFO]: File already exists.


In [3]:
import fitz
from tqdm.auto import tqdm

def format_text(input: str) -> str:
    """
    Performs text formatting and returns formatted text
    """
    cleaned_text = input.replace("\n", " ").strip()

    return cleaned_text

def open_and_read_pdf(pdf_path: str):
    """
    Opens the pdf, creates a list of dictionaries for each page, and returns the list
    """
    document = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(document)):
        text = page.get_text()
        text = format_text(input=text)
        pages_and_texts.append({
                "page_number": page_number - 41,
                "page_char_count": len(text),
                "page_word_count": len(text.split(" ")),
                "page_sentence_count_raw": len(text.split(". ")),
                "page_token_count": len(text) / 4,
                "text": text  
        })
        
    return pages_and_texts

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# Let's open the pdf and read it's content
pages_and_text = open_and_read_pdf(pdf_path="human_nutrition.pdf")
pages_and_text[:5]

1208it [00:01, 709.77it/s]


[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''},
 {'page_number': -39,
  'page_char_count': 320,
  'page_word_count': 54,
  'page_sentence_count_raw': 1,
  'page_token_count': 80.0,
  'text': 'Human Nutrition: 2020  Edition  UNIVERSITY OF HAWAI‘I AT MĀNOA  FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM  ALAN TITCHENAL, SKYLAR HARA,  NOEMI ARCEO CAACBAY, WILLIAM  MEINKE-LAU, YA-YUN YANG, MARIE  KAINOA FIALKOWSKI REVILLA,  JENNIFER DRAPER, GEMADY  LANGFELDER, CHERYL GIBBY, CHYNA  NICOLE CHUN, AND ALLISON  CALABRESE'},
 {'page_number': -38,
  'page_char_count': 212,
  'page_word_count': 32,
  'page_sentence_count_raw': 1,
  'page_token_count': 53.0,
  'text': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food 

In [5]:
import random

random.sample(pages_and_text, k=2)

[{'page_number': 1040,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''},
 {'page_number': 625,
  'page_char_count': 1327,
  'page_word_count': 250,
  'page_sentence_count_raw': 12,
  'page_token_count': 331.75,
  'text': 'Age Group  RDA (mg/day) UL (mg/day)  Infants (0–6 months)  200*  –  Infants (6–12 months)  260*  –  Children (1–3 years)  700  2,500  Children (4–8 years)  1,000  2,500  Children (9–13 years)  1,300  2,500  Adolescents (14–18 years)  1,300  2,500  Adults (19–50 years)  1,000  2,500  Adult females (50–71 years)  1,200  2,500  Adults, male & female (> 71 years) 1,200  2,500  * denotes Adequate Intake  Source: Ross AC, Manson JE, et al. The 2011 Report on Dietary  Reference Intakes for Calcium and Vitamin D from the Institute of  Medicine: What Clinicians Need to Know. J Clin Endocrinol Metab.  2011; 96(1), 53–8. http://www.ncbi.nlm.nih.gov/pubmed/21118827.  Accessed October 10, 2017.  Dietary Source

In [6]:
import pandas as pd

df = pd.DataFrame(pages_and_text)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,2,199.25,Contents Preface University of Hawai‘i at Mā...


In [7]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0
std,348.86,560.38,95.76,6.19,140.1
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,4.0,190.5
50%,562.5,1231.5,214.5,10.0,307.88
75%,864.25,1603.5,271.0,14.0,400.88
max,1166.0,2308.0,429.0,32.0,577.0


## Further text processing (splitting pages into sentences)

In [8]:
from spacy.lang.en import English

nlp = English()

nlp.add_pipe("sentencizer")

for item in tqdm(pages_and_text):
    # if (item["text"]):
    item["sentences"] = list(nlp(item["text"]).sents)

    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    item["page_sentence_count_spacy"] = len(item["sentences"])

100%|██████████| 1208/1208 [00:01<00:00, 1000.21it/s]


In [9]:
random.sample(pages_and_text, k=1)

[{'page_number': 37,
  'page_char_count': 1463,
  'page_word_count': 252,
  'page_sentence_count_raw': 11,
  'page_token_count': 365.75,
  'text': 'Results. This study was conducted on over four-thousand school  children, and found that iodized salt prevented goiter.  Conclusions. Seven other studies similar to Marine’s were  conducted in Italy and Switzerland, which also demonstrated the  effectiveness of iodized salt in treating goiter. In 1924, US public  health officials initiated the program of iodizing salt and started  eliminating the scourge of goiter. Today, more than 70% of American  households use iodized salt and many other countries have followed  the same public health strategy to reduce the health consequences  of iodine deficiency.  Career Connection  What are some of the ways in which you think like a  scientist, and use the scientific method in your everyday  life? Any decision-making process uses some aspect of the  scientific method. Think about some of the major de

In [10]:
df = pd.DataFrame(pages_and_text)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0,10.32
std,348.86,560.38,95.76,6.19,140.1,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,4.0,190.5,5.0
50%,562.5,1231.5,214.5,10.0,307.88,10.0
75%,864.25,1603.5,271.0,14.0,400.88,15.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0


## Splitting the sentences into chunks

In [11]:
chunk_size = 10

def split_sentences(input_list, slize_size = chunk_size):
    return [input_list[i:i+slize_size] for i in range(0, len(input_list), slize_size)]

In [12]:
# Loop through the pages and text and split sentences into chunks
for item in tqdm(pages_and_text):
    item["sentence_chunks"] = split_sentences(input_list=item["sentences"])
    item["num_chunks"] = len(item["sentence_chunks"])

100%|██████████| 1208/1208 [00:00<00:00, 565229.72it/s]
