This project is designed to provide a robust system to load documents from various sources into RAG systems. The default configuration is intended to provide populated databases for the Soliplex RAG (Retrieval Augmented Generation) system but it can be customized to support other storage systems and language models.
Document ingestion can be a time consuming and error prone process. Soliplex ingester aims to provide a robust, scalable and observable pathway from source systems to one or more vector databases. It provides a user interface and REST endpoints to follow the progress of documents and supports restarting failed processes.
This ingester has been tested with workflows containing hundreds of documents and with pdf files containing over one thousand pages (on appropriate hardware) so scalability and reliability are paramount.
Soliplex ingester has been designed alongside agents that are able do load data from filesystems and source control management systems, but other tools can be used as well.
- Getting Started Guide - Quick start tutorial for new users
- Installation and setup
- First batch processing
- Basic operations
- Common patterns
-
Architecture Overview - System design and components
- Component overview
- Technology stack
- Scalability considerations
-
API Reference - Complete REST API documentation
- All endpoints with examples
- Request/response formats
- Data models
- Error handling
-
Workflow System - Workflow concepts and configuration
- Workflow step types
- Configuration files
- Execution model
- Custom step handlers
- Retry logic
- Monitoring and troubleshooting
-
Database Schema - Data models and relationships
- All database tables
- Field descriptions
- Relationships and constraints
- Query examples
- Migration guide
-
Configuration Guide - Environment variables and settings
- All configuration options
- Environment-specific configs
- Performance tuning
- Secrets management
- Troubleshooting
-
CLI Reference - Command-line interface guide
si-climanagement commandssi-diagread-only diagnostic commands- Usage examples
- Deployment patterns
- Systemd integration
-
Docker Deployment - Docker Compose setup and production deployment
- Quick start guide
- Service configuration
- GPU setup and optimization
- Authentication with OAuth2 Proxy
- Production best practices
- Comprehensive troubleshooting
-
Parameter Sets - Document processing configuration
- YAML schema reference
- Creating and managing parameter sets
- Embedding model configuration
- Chunking strategies
- Storage configuration
- Best practices and examples
- Start with Getting Started
- Review Architecture to understand the system
- Explore API Reference for integration
- Read Architecture for system design
- Study Workflows to understand processing
- Check Database for data models
- Review Configuration for environment setup
- Start with Docker Deployment for production setup
- Review Configuration for deployment settings
- Use CLI Reference for management commands
- Configure Parameter Sets for document processing
- Monitor using API Reference stats endpoints
- Troubleshoot with Workflows and Docker guides
Step-by-step guide to:
- Install Soliplex Ingester
- Configure the system
- Create your first batch
- Ingest and process documents
- Monitor progress
- Deploy to production
Audience: New users, evaluators Time to complete: 15-30 minutes
Technical overview covering:
- System components (API server, workers, storage)
- Workflow execution model
- Data flow and processing pipeline
- Storage backends (database, files, vectors)
- Scalability and performance
- Extension points
Audience: Developers, architects Time to read: 15-20 minutes
Complete REST API reference including:
- Document ingestion endpoints
- Batch management APIs
- Workflow control and monitoring
- Parameter set configuration
- Data models and schemas
- Error handling
Audience: API consumers, integrators Format: Reference documentation
In-depth workflow documentation:
- Workflow concepts and terminology
- Step types (ingest, parse, chunk, embed, store)
- YAML configuration format
- Lifecycle events
- Worker processing model
- Custom handler development
- Retry and error handling
- Performance tuning
- Troubleshooting guide
Audience: Power users, developers Time to read: 25-30 minutes
Database schema reference covering:
- All table definitions
- Field types and constraints
- Relationships and foreign keys
- Enums and constants
- Migration procedures
- Query examples
- Indexing recommendations
- Backup and maintenance
Audience: Database administrators, developers Format: Reference documentation
Comprehensive configuration guide:
- All environment variables
- Default values and types
- Configuration validation
- Environment-specific configs
- Performance tuning parameters
- Worker settings
- Storage configuration
- Secrets management
- Docker and Kubernetes examples
- Troubleshooting
Audience: DevOps, system administrators Format: Reference + guide
Command-line tool reference:
si-cli- Server management, workers, database operationssi-diag- Read-only diagnostic CLI (batch, document, config, run-group, workflow, status)- Usage examples
- Deployment patterns
- Systemd service files
- Docker usage
- Signal handling
- Platform notes
- Troubleshooting
Audience: System administrators, developers Format: Reference documentation
Comprehensive Docker deployment guide:
- Quick start with docker-compose
- Service architecture and overview
- GPU configuration and setup
- Environment variable configuration
- Volume management and backups
- Load balancing with HAProxy
- Authentication with OAuth2 Proxy
- Production deployment checklist
- Performance tuning
- Detailed troubleshooting guide
Audience: DevOps engineers, system administrators Time to read: 30-40 minutes
Parameter set configuration reference:
- Complete YAML schema documentation
- Parse, chunk, embed, and store configuration
- Creating parameter sets (file, API, Web UI)
- Managing and versioning parameter sets
- Embedding provider configuration (Ollama, OpenAI, Azure)
- Chunking strategies and best practices
- Real-world examples
- Troubleshooting guide
Audience: Data engineers, ML engineers, power users Time to read: 25-30 minutes
Check config/ directory for:
workflows/*.yaml- Example workflow definitionsparams/*.yaml- Example parameter sets
For production deployment with Docker Compose, see the Docker Deployment Guide.
Quick Start:
cd docker
docker-compose up -dThe docker-compose configuration includes:
- Soliplex Ingester - Main application (API + Worker)
- PostgreSQL - Document and workflow database
- Docling (x3) - PDF parsing services with GPU support and HAProxy load balancing
- Ollama - Embedding generation with GPU
- SeaweedFS - S3-compatible object storage
Access the application:
- Web UI: http://localhost:8002
- API Docs: http://localhost:8002/docs
Comprehensive guide includes:
- Service configuration and resource requirements
- GPU setup and optimization
- Volume management and backups
- Authentication with OAuth2 Proxy
- Production deployment best practices
- Monitoring and maintenance
- Detailed troubleshooting
See docs/DOCKER.md for complete instructions.
When updating documentation:
- Keep it accurate - Test all examples before committing
- Stay consistent - Follow existing formatting and style
- Be comprehensive - Cover edge cases and gotchas
- Add examples - Show, don't just tell
- Update index - Modify this README when adding docs
- Format: Markdown with GitHub-flavored syntax
- Code blocks: Always specify language for syntax highlighting
- Examples: Include both curl and Python/script examples where applicable
- Cross-references: Link to related sections and documents
- Versioning: Note breaking changes and version requirements
Found an error or unclear section? Please:
- Open an issue describing the problem
- Suggest improvements via pull request
- Ask questions in discussions
- Version: 0.1.0
- Python: 3.12+
See LICENSE file in project root.
- Found a mistake? Open an issue
- Need clarification? Start a discussion
- Have suggestions? Submit a pull request
- HaikuRAG: https://github.com/ggozad/haiku.rag
- Docling: https://docling-project.github.io/docling/
- LanceDB: https://lancedb.com/docs/
- Soliplex: https://github.com/soliplex/soliplex
# Installation
pip install -e .
# Configuration
si-cli validate-settings
# Database
si-cli db-init
# Server
si-cli serve --reload # Development
si-cli serve --host 0.0.0.0 --workers 4 # Production
# Workers
si-cli worker # Start worker
# Inspection
si-cli list-workflows # List workflows
si-cli dump-workflow batch # View workflow
si-cli list-param-sets # List parameters
si-cli validate-haiku 1 # Validate batch
# Diagnostics (si-diag)
si-diag batch list # List batches
si-diag document find "quarterly" # Search documents by URI
si-diag document info sha256-abc... # Document details
si-diag config workflows # List workflow definitions
si-diag config params # List parameter sets
si-diag run-group list --batch-id 1 # List run groups
si-diag workflow list 1 # Workflow runs for a run group
si-diag status running # Currently running steps
si-diag status recent hour # Recent activity
si-diag status details 1 # Aggregated details (PostgreSQL)
# API
curl http://localhost:8000/docs # Swagger UI
curl http://localhost:8000/api/v1/batch/ # List batches
curl http://localhost:8000/api/v1/document/ingest-document # Load document into database