Offline PDF Parser using Docling

This project provides a complete offline PDF processing solution using Docling, capable of extracting text, tables, and images from PDF documents without requiring an internet connection.

Features

Completely Offline: All models are pre-downloaded and stored locally
Comprehensive Extraction: Extracts text, tables, images, and document structure
Multiple Output Formats: JSON, Markdown, and plain text
GPU Acceleration: Uses CUDA for faster processing when available
Table Recognition: Advanced table structure recognition and extraction
OCR Support: Optical Character Recognition for scanned documents

Setup

1. Environment Setup

# Create conda environment
conda create -n docling python=3.12 -y
conda activate docling

# Install docling
pip install docling

2. Download Models (One-time setup)

# Download all required models for offline usage
docling-tools models download

This will download models to ~/.cache/docling/models/ (approximately 1.2GB).

Usage

Basic Usage

# Activate environment
conda activate docling

# Run the parser
python final_offline_parser.py

Files

final_offline_parser.py - Main parser script (recommended)
offline_pdf_parser.py - Original parser script
fixed_offline_parser.py - Fixed version
test_offline.py - Test script to verify setup
debug_document_structure.py - Debug script for document structure analysis

Output

The parser creates an output/ directory with:

companies_house_document_content.json - Complete structured data
companies_house_document_content.md - Markdown formatted output
companies_house_document_text.txt - Plain text output

Example Results

For the companies_house_document.pdf:

Pages processed: 32
Tables found: 20
Figures found: 3
Text extracted: 76,975 characters
Processing time: ~4 minutes (with GPU acceleration)

Technical Details

Offline Configuration

The parser is configured to run completely offline:

pipeline_options = PdfPipelineOptions(
    artifacts_path=artifacts_path,
    enable_remote_services=False,  # No internet required
    do_table_structure=True,       # Table recognition
    do_ocr=True,                   # OCR processing
    do_chunking=True,              # Document chunking
)

Model Requirements

Layout Model: Document layout analysis
Tableformer Model: Table structure recognition
Picture Classifier: Image classification
Code Formula Model: Mathematical formula recognition
EasyOCR Models: Text recognition

Troubleshooting

Common Issues

Models not found: Run docling-tools models download
CUDA errors: Ensure NVIDIA drivers and CUDA are properly installed
Memory issues: Reduce batch size or use CPU-only mode

Testing Setup

python test_offline.py

This will verify that all models are downloaded and the environment is properly configured.

Performance

GPU Processing: ~4 minutes for 32-page document
CPU Processing: ~8-10 minutes for 32-page document
Memory Usage: ~2-4GB RAM during processing
Storage: ~1.2GB for all models

Model Structure

For detailed information about all model files and folder structure, see MODEL_STRUCTURE.md.

License

This project uses Docling, which is licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
chroma_db		chroma_db
output		output
.gitignore		.gitignore
DOCLING_MODEL_STRUCTURE.md		DOCLING_MODEL_STRUCTURE.md
DOCLING_OFFLINE_ANALYSIS.md		DOCLING_OFFLINE_ANALYSIS.md
DOCLING_V2.2.1_MODEL_STRUCTURE.md		DOCLING_V2.2.1_MODEL_STRUCTURE.md
MODEL_STRUCTURE.md		MODEL_STRUCTURE.md
README.md		README.md
companies_house_document.pdf		companies_house_document.pdf
companies_house_document_2.pdf		companies_house_document_2.pdf
complex_sample.pdf		complex_sample.pdf
correct_model_downloader.py		correct_model_downloader.py
debug_document_structure.py		debug_document_structure.py
debug_images.py		debug_images.py
download_all_models.py		download_all_models.py
download_docling_models.py		download_docling_models.py
fast_gpu_parser.py		fast_gpu_parser.py
final_offline_parser.py		final_offline_parser.py
fixed_image_extractor.py		fixed_image_extractor.py
fixed_offline_parser.py		fixed_offline_parser.py
image_extractor.py		image_extractor.py
multi_gpu_offline_parser.py		multi_gpu_offline_parser.py
multiprocess_parser.py		multiprocess_parser.py
offline_pdf_parser.py		offline_pdf_parser.py
parallel_gpu_parser.py		parallel_gpu_parser.py
proper_model_downloader.py		proper_model_downloader.py
requirements.txt		requirements.txt
run_streamlit.sh		run_streamlit.sh
sample_document.pdf		sample_document.pdf
short_ocr_vlm_parser.py		short_ocr_vlm_parser.py
short_offline_processor.py		short_offline_processor.py
simple_model_downloader.py		simple_model_downloader.py
simple_vlm_parser.py		simple_vlm_parser.py
streamlit_app.py		streamlit_app.py
streamlit_requirements.txt		streamlit_requirements.txt
test_basic_conversion.py		test_basic_conversion.py
test_docling_2.2.1.py		test_docling_2.2.1.py
test_offline.py		test_offline.py
test_offline_processing.py		test_offline_processing.py
ultra_short_parser.ipynb		ultra_short_parser.ipynb
ultra_short_parser.py		ultra_short_parser.py
use_downloaded_models.py		use_downloaded_models.py
vlm_smoldocling_parser.py		vlm_smoldocling_parser.py
working_image_extractor.py		working_image_extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Offline PDF Parser using Docling

Features

Setup

1. Environment Setup

2. Download Models (One-time setup)

Usage

Basic Usage

Files

Output

Example Results

Technical Details

Offline Configuration

Model Requirements

Troubleshooting

Common Issues

Testing Setup

Performance

Model Structure

License

About

Uh oh!

Releases

Packages

Languages

tatsath/docling

Folders and files

Latest commit

History

Repository files navigation

Offline PDF Parser using Docling

Features

Setup

1. Environment Setup

2. Download Models (One-time setup)

Usage

Basic Usage

Files

Output

Example Results

Technical Details

Offline Configuration

Model Requirements

Troubleshooting

Common Issues

Testing Setup

Performance

Model Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages