<a href="https://www.kaggle.com/code/shravankumar147/document-loaders-for-rags?scriptVersionId=209670575" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Document Loaders for RAG

In [1]:
!pip install -q torch transformers transformers accelerate bitsandbytes langchain sentence-transformers faiss-cpu openpyxl pacmap datasets langchain-community ragatouille

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-cloud-bigquery 2.34.4 requires packaging<22.0dev,>=14.3, but you have packaging 24.2 which is incompatible.
jupyterlab 4.2.5 requires jupyter-lsp>=2.0.0, but you have jupyter-lsp 1.5.1 which is incompatible.
jupyterlab-lsp 5.1.0 requires jupyter-lsp>=2.0.0, but you have jupyter-lsp 1.5.1 which is incompatible.
kfp 2.5.0 requires google-cloud-storage<3,>=2.2.1, but you have google-cloud-storage 1.44.0 which is incompatible.
kfp 2.5.0 requires requests-toolbelt<1,>=0.8.0, but you have requests-toolbelt 1.0.0 which is incompatible.
libpysal 4.9.2 requires shapely>=2.0.1, but you have shapely 1.8.5.post1 which is incompatible.
preprocessing 0.1.13 requires nltk==3.2.4, but you have nltk 3.9.1 which is incompatible.
thinc 8.3.2 requires numpy<2.1.0,>=2.0.0; python_version >= "3.9", but you have numpy 1.26.4

Reference: https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf/#using-pypdf

In [None]:
# https://arxiv.org/pdf/1706.03762

## Download PDF Helper Function

In [2]:
import requests

def download_pdf(url, save_path):
    """
    Downloads a PDF from the given URL and saves it to the specified path.

    Parameters:
        url (str): The URL of the PDF to download.
        save_path (str): The file path where the PDF should be saved.

    Returns:
        bool: True if the download was successful, False otherwise.
    """
    try:
        # Send a GET request to download the file
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Raise an exception for HTTP errors

        # Write the content to a file
        with open(save_path, "wb") as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)

        print(f"PDF downloaded and saved as '{save_path}'")
        return True
    except requests.exceptions.RequestException as e:
        print(f"Failed to download PDF. Error: {e}")
        return False

## Downloading a single pdf file from given URL

In [3]:
# Example usage
url = "https://arxiv.org/pdf/1706.03762"
save_path = "Attention_Is_All_You_Need.pdf"
download_pdf(url, save_path)

PDF downloaded and saved as 'Attention_Is_All_You_Need.pdf'


True

## Download Multiple PDFs from a given list of URLs

In [9]:
pdf_urls = [
    "https://arxiv.org/pdf/1706.03762",
    "https://arxiv.org/pdf/1801.06146",
    "https://arxiv.org/pdf/2103.15348",
]

for i, url in enumerate(pdf_urls, start=1):
    save_path = f"paper_{i}.pdf"
    download_pdf(url, save_path)


PDF downloaded and saved as 'paper_1.pdf'
PDF downloaded and saved as 'paper_2.pdf'
PDF downloaded and saved as 'paper_3.pdf'


## Exploring the langchain document loaders

In [5]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("paper_1.pdf")
pages = loader.load_and_split()

In [6]:
pages[0]

Document(metadata={'source': 'paper_1.pdf', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transfo

In [8]:
print(pages[0].page_content)

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Exp

## Extracting Image Content from PDFs (OCR compatibility)

In order to extract image content as text from pdf, we need to pass  extract_images=True in PyPDFLoader
`PyPDFLoader("https://arxiv.org/pdf/2103.15348.pdf", extract_images=True)`
This requires `rapidocr-onnxruntime` 

In [11]:
!pip install rapidocr-onnxruntime

Collecting rapidocr-onnxruntime
  Downloading rapidocr_onnxruntime-1.4.0-py3-none-any.whl.metadata (1.3 kB)
Collecting onnxruntime>=1.7.0 (from rapidocr-onnxruntime)
  Downloading onnxruntime-1.20.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting coloredlogs (from onnxruntime>=1.7.0->rapidocr-onnxruntime)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime>=1.7.0->rapidocr-onnxruntime)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)
Downloading rapidocr_onnxruntime-1.4.0-py3-none-any.whl (14.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.9/14.9 MB[0m [31m73.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hDownloading onnxruntime-1.20.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (13.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.3/13.3 MB[0m [31m78.8 MB/s[0m eta [36m

In [12]:
loader = PyPDFLoader("https://arxiv.org/pdf/2103.15348.pdf", extract_images=True)
pages = loader.load()
pages[4].page_content

'LayoutParser: A Uniﬁed Toolkit for DL-Based DIA 5\nTable 1: Current layout detection models in the LayoutParser model zoo\nDataset Base Model1 Large ModelNotes\nPubLayNet [38] F / M M Layouts of modern scientiﬁc documents\nPRImA [3] M - Layouts of scanned modern magazines and scientiﬁc reports\nNewspaper [17] F - Layouts of scanned US newspapers from the 20th century\nTableBank [18] F F Table region on modern scientiﬁc and business document\nHJDataset [31] F / M - Layouts of history Japanese documents\n1For each dataset, we train several models of diﬀerent sizes for diﬀerent needs (the trade-oﬀ between accuracy\nvs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101\nbackbones [13], respectively. One can train models of diﬀerent architectures, like Faster R-CNN [28] (F) and Mask\nR-CNN [12] (M). For example, an F in the Large Model column indicates it has a Faster R-CNN model trained\nusing the ResNet 101 backbone. The platform is ma

In [14]:
print(pages[3].page_content)

4 Z. Shen et al.
Efficient Data Annotation
C u s t o m i z e d  M o d e l  T r a i n i n g
Model Cust omization
DI A Model Hub
DI A Pipeline Sharing
Community Platform
La y out Detection Models
Document Images 
T h e  C o r e  L a y o u t P a r s e r  L i b r a r y
OCR Module St or age & VisualizationLa y out Data Structur e
Fig. 1: The overall architecture of LayoutParser. For an input document image,
the core LayoutParser library provides a set of oﬀ-the-shelf tools for layout
detection, OCR, visualization, and storage, backed by a carefully designed layout
data structure. LayoutParser also supports high level customization via eﬃcient
layout annotation and model training functions. These improve model accuracy
on the target samples. The community platform enables the easy sharing of DIA
models and whole digitization pipelines to promote reusability and reproducibility.
A collection of detailed documentation, tutorials and exemplar projects make
LayoutParser easy to learn and use.
Al