Skip to content

Files

Latest commit

6d32e5d · Mar 1, 2025

History

History

pdf_document_extraction

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Oct 19, 2024
Feb 28, 2025
Oct 3, 2024
Feb 28, 2025
Jan 14, 2025
Jan 2, 2025
Feb 4, 2025
Feb 4, 2025
Feb 4, 2025
Feb 28, 2025
Feb 18, 2025
Jan 14, 2025
Jan 14, 2025
Jan 17, 2025
Mar 1, 2025

PDF Document Extraction and Indexing

The example builds a pipeline that extracts text, tables and figures from a PDF Document. It embeds the text, table and images from the document and writes them into ChromaDB. This example uses open-source components such Docling for document parsing and ElasticSearch as the vector store, to make it easy to deploy in commercial settings

The pipeline is hosted on a server endpoint in one of the containers. The endpoint can be called from any Python application.

Tools and Libraries Used

  • Orchestration: Indexify
  • Docling: PDF Parsing
  • Langchain: Text splitting
  • SentenceTransformers: Text and Image Embedding
  • VectorDB: Elastic Search

Start the Server Endpoint

docker compose up

Deploy the Graph

The Graph has all the code which performs PDF Parsing, embedding and writing the VectorDB. We will deploy the Graph on the server to create an Endpoint for our workflow. Make sure to deploy the right graph before running the example.

pip install indexify tensorlake
python workflow.py

This stage deploys the workflow on the server. At this point, you can also open the UI to see the deployed Graph.

After this, you can call the endpoint with PDFs to make Indexify start parsing documents.

Calling the Endpoint

from indexify import RemoteGraph

graph = RemoteGraph.by_name("Extract_pages_tables_images_pdf")
# Pass in a publicly avaialable PDF from the internet 
invocation_id = graph.run(block_until_done=True, url="https://arxiv.org/pdf/2501.12948")

Outputs

You can read the output of every function of the Graph. For example, to get the chunks from the documents -

chunks = graph.output(invocation_id, "chunk_text")

Vector Search

Once the documents are processed, you can query Elastic Search for vector search. Here is some same code for that

The indexes created in Elastic Search are text_embeddings and image_embeddings. The port 9200 is forwarded to the host from the container, so you can query the indexes.

Customization

Copy the folder, modify the code as you like and simply upload the new Graph.

python workflow.py

Using Chroma DB

The ChromaDB tables are populated automatically by the ChromaDBWriter class. The name of the databases used in the example are text_embeddings and image_embeddings. The database running inside the container at port 8000 is forwarded to the host for convenience.

There code to retrieve from ChromaDB is here

Swapping OCR Engines

You could swap out Docling for another open source or propretiery engine. The code here needs to be updated to use another library.

Using GPU

You have to make a couple of changes to use GPUs for PDF parsing.

  1. Uncomment the lines in the pdf-parser-executor block which mention uncommenting them would enable GPUs.
  2. Use the gpu_image in the PDFParser, extract_chunks and extract_images class/functions so that the workflow routes the PDFParser into the GPU enabled image.