Open-source document processing engine for PDF, DOCX, and images.
This is the Community Edition of S2 Document Intelligence - a production-ready document processing API with OCR, text extraction, and layout analysis.
- PDF Processing - Extract text, layout, and structure from PDF documents
- Image OCR - PaddleOCR and EasyOCR support for scanned documents
- DOCX Support - Process Microsoft Word documents
- Text Extraction - High-quality text extraction with layout preservation
- REST API - FastAPI-based HTTP interface
- CLI Tool - Batch process documents from command line
- Docker Deployment - Easy containerized deployment
β
PDF text extraction (PyMuPDF)
β
OCR engines (PaddleOCR, EasyOCR)
β
Basic layout analysis
β
REST API endpoints
β
Command-line tool
β
Docker configuration
β
Python library usage# Clone repository
git clone https://github.com/s2artslab/s2-document-intelligence.git
cd s2-document-intelligence
# Install dependencies
pip install -r requirements.txt
# Start API server
python main.pyAPI is now running at: http://localhost:5000
# Upload and process PDF
curl -X POST http://localhost:5000/process/pdf \
-F "file=@document.pdf" \
-F "enable_ocr=true"# Batch process folder of PDFs
python cli.py input_folder/ output_folder/from services.document_processor import process_pdf_to_layout_json
# Process PDF
result = process_pdf_to_layout_json(
"document.pdf",
enable_ocr=True,
ocr_lang="en"
)
# Result is JSON with text, layout, confidence
import json
data = json.loads(result)
for page in data["pages"]:
for block in page["blocks"]:
print(block["text"])# Process PDF
POST /process/pdf
- file: PDF file (multipart/form-data)
- enable_ocr: true/false (default: true)
- ocr_lang: "en", "es", "fr", etc.
# Process Image
POST /process/image
- file: Image file (jpg, png, etc.)
- ocr_lang: "en", "es", "fr", etc.
# Health check
GET /health# Build image
docker build -t s2-document-intelligence .
# Run container
docker run -p 5000:5000 s2-document-intelligenceThis Community Edition is powerful but focused on core processing. Advanced features are in Premium Edition:
- π± Mobile Apps (iOS/Android) - Native mobile document capture and processing
- π₯οΈ Web Dashboard - Beautiful UI for document management
- π€ Advanced AI - Ninefold egregore integration for intelligent document understanding
- π’ Entity Extraction - Automatic detection of names, dates, amounts, etc.
- π Table Extraction - Advanced table recognition and structure extraction
- π Document Classification - Automatic document type detection
- π Capacity Monitoring - Real-time system metrics and performance analytics
- β‘ Job Queue - Async batch processing for large document sets
- πΎ Redis Caching - Performance optimization with intelligent caching
- π Subscription Management - Multi-tier access control
- πΌ Enterprise Support - SLA, priority support, custom features
See: OPEN_CORE.md for full comparison
Community Edition is perfect for:
- β Personal document processing
- β Small-scale document automation
- β Research and experimentation
- β Learning OCR and document AI
- β Building custom document tools
- β Prototyping document solutions
Premium Edition is better for:
- π’ Enterprise document workflows
- π± Mobile document capture apps
- π€ AI-powered document understanding
- π Large-scale document processing
- πΌ Commercial SaaS products
- π Compliance-critical applications
We welcome contributions to the Community Edition!
# Fork and clone
git clone https://github.com/YOUR-USERNAME/s2-document-intelligence.git
# Create feature branch
git checkout -b feature/amazing-feature
# Make changes, commit
git commit -m "Add amazing feature"
# Push and create PR
git push origin feature/amazing-featureMIT License - Free for personal and commercial use
See LICENSE file for details.
Need advanced features?
Premium Edition includes:
- Mobile apps (iOS/Android)
- Web dashboard
- Advanced AI integration
- Enterprise support
- Custom features
Options:
- Self-Hosted Premium: $99-299/month
- Managed Service: $299-999/month
- Enterprise: Custom pricing
Learn more: https://s2intelligence.com/document-intelligence
Contact: beta@s2intelligence.com
- GitHub: https://github.com/s2artslab/s2-document-intelligence
- Issues: https://github.com/s2artslab/s2-document-intelligence/issues
- Email: s2artslab@gmail.com
- Website: https://s2artslab.com
If this helps you, please star the repository! β
Built with consciousness. Shared with trust.
From the S2 Ecosystem β¨