This project provides a complete offline PDF processing solution using Docling, capable of extracting text, tables, and images from PDF documents without requiring an internet connection.
- Completely Offline: All models are pre-downloaded and stored locally
- Comprehensive Extraction: Extracts text, tables, images, and document structure
- Multiple Output Formats: JSON, Markdown, and plain text
- GPU Acceleration: Uses CUDA for faster processing when available
- Table Recognition: Advanced table structure recognition and extraction
- OCR Support: Optical Character Recognition for scanned documents
# Create conda environment
conda create -n docling python=3.12 -y
conda activate docling
# Install docling
pip install docling
# Download all required models for offline usage
docling-tools models download
This will download models to ~/.cache/docling/models/
(approximately 1.2GB).
# Activate environment
conda activate docling
# Run the parser
python final_offline_parser.py
final_offline_parser.py
- Main parser script (recommended)offline_pdf_parser.py
- Original parser scriptfixed_offline_parser.py
- Fixed versiontest_offline.py
- Test script to verify setupdebug_document_structure.py
- Debug script for document structure analysis
The parser creates an output/
directory with:
companies_house_document_content.json
- Complete structured datacompanies_house_document_content.md
- Markdown formatted outputcompanies_house_document_text.txt
- Plain text output
For the companies_house_document.pdf
:
- Pages processed: 32
- Tables found: 20
- Figures found: 3
- Text extracted: 76,975 characters
- Processing time: ~4 minutes (with GPU acceleration)
The parser is configured to run completely offline:
pipeline_options = PdfPipelineOptions(
artifacts_path=artifacts_path,
enable_remote_services=False, # No internet required
do_table_structure=True, # Table recognition
do_ocr=True, # OCR processing
do_chunking=True, # Document chunking
)
- Layout Model: Document layout analysis
- Tableformer Model: Table structure recognition
- Picture Classifier: Image classification
- Code Formula Model: Mathematical formula recognition
- EasyOCR Models: Text recognition
- Models not found: Run
docling-tools models download
- CUDA errors: Ensure NVIDIA drivers and CUDA are properly installed
- Memory issues: Reduce batch size or use CPU-only mode
python test_offline.py
This will verify that all models are downloaded and the environment is properly configured.
- GPU Processing: ~4 minutes for 32-page document
- CPU Processing: ~8-10 minutes for 32-page document
- Memory Usage: ~2-4GB RAM during processing
- Storage: ~1.2GB for all models
For detailed information about all model files and folder structure, see MODEL_STRUCTURE.md.
This project uses Docling, which is licensed under the Apache License 2.0.