Skip to content

soliplex/ingester

Repository files navigation

Soliplex Ingester

CI

This project is designed to provide a robust system to load documents from various sources into RAG systems. The default configuration is intended to provide populated databases for the Soliplex RAG (Retrieval Augmented Generation) system but it can be customized to support other storage systems and language models.

Document ingestion can be a time consuming and error prone process. Soliplex ingester aims to provide a robust, scalable and observable pathway from source systems to one or more vector databases. It provides a user interface and REST endpoints to follow the progress of documents and supports restarting failed processes.

This ingester has been tested with workflows containing hundreds of documents and with pdf files containing over one thousand pages (on appropriate hardware) so scalability and reliability are paramount.

Soliplex ingester has been designed alongside agents that are able do load data from filesystems and source control management systems, but other tools can be used as well.

Documentation Index

Getting Started

  • Getting Started Guide - Quick start tutorial for new users
    • Installation and setup
    • First batch processing
    • Basic operations
    • Common patterns

Core Documentation

  • Architecture Overview - System design and components

    • Component overview
    • Technology stack
    • Scalability considerations
  • API Reference - Complete REST API documentation

    • All endpoints with examples
    • Request/response formats
    • Data models
    • Error handling
  • Workflow System - Workflow concepts and configuration

    • Workflow step types
    • Configuration files
    • Execution model
    • Custom step handlers
    • Retry logic
    • Monitoring and troubleshooting
  • Database Schema - Data models and relationships

    • All database tables
    • Field descriptions
    • Relationships and constraints
    • Query examples
    • Migration guide
  • Configuration Guide - Environment variables and settings

    • All configuration options
    • Environment-specific configs
    • Performance tuning
    • Secrets management
    • Troubleshooting
  • CLI Reference - Command-line interface guide

    • si-cli management commands
    • si-diag read-only diagnostic commands
    • Usage examples
    • Deployment patterns
    • Systemd integration
  • Docker Deployment - Docker Compose setup and production deployment

    • Quick start guide
    • Service configuration
    • GPU setup and optimization
    • Authentication with OAuth2 Proxy
    • Production best practices
    • Comprehensive troubleshooting
  • Parameter Sets - Document processing configuration

    • YAML schema reference
    • Creating and managing parameter sets
    • Embedding model configuration
    • Chunking strategies
    • Storage configuration
    • Best practices and examples

Quick Links

For New Users

  1. Start with Getting Started
  2. Review Architecture to understand the system
  3. Explore API Reference for integration

For Developers

  1. Read Architecture for system design
  2. Study Workflows to understand processing
  3. Check Database for data models
  4. Review Configuration for environment setup

For Operations

  1. Start with Docker Deployment for production setup
  2. Review Configuration for deployment settings
  3. Use CLI Reference for management commands
  4. Configure Parameter Sets for document processing
  5. Monitor using API Reference stats endpoints
  6. Troubleshoot with Workflows and Docker guides

Document Summaries

GETTING_STARTED.md

Step-by-step guide to:

  • Install Soliplex Ingester
  • Configure the system
  • Create your first batch
  • Ingest and process documents
  • Monitor progress
  • Deploy to production

Audience: New users, evaluators Time to complete: 15-30 minutes


ARCHITECTURE.md

Technical overview covering:

  • System components (API server, workers, storage)
  • Workflow execution model
  • Data flow and processing pipeline
  • Storage backends (database, files, vectors)
  • Scalability and performance
  • Extension points

Audience: Developers, architects Time to read: 15-20 minutes


API.md

Complete REST API reference including:

  • Document ingestion endpoints
  • Batch management APIs
  • Workflow control and monitoring
  • Parameter set configuration
  • Data models and schemas
  • Error handling

Audience: API consumers, integrators Format: Reference documentation


WORKFLOWS.md

In-depth workflow documentation:

  • Workflow concepts and terminology
  • Step types (ingest, parse, chunk, embed, store)
  • YAML configuration format
  • Lifecycle events
  • Worker processing model
  • Custom handler development
  • Retry and error handling
  • Performance tuning
  • Troubleshooting guide

Audience: Power users, developers Time to read: 25-30 minutes


DATABASE.md

Database schema reference covering:

  • All table definitions
  • Field types and constraints
  • Relationships and foreign keys
  • Enums and constants
  • Migration procedures
  • Query examples
  • Indexing recommendations
  • Backup and maintenance

Audience: Database administrators, developers Format: Reference documentation


CONFIGURATION.md

Comprehensive configuration guide:

  • All environment variables
  • Default values and types
  • Configuration validation
  • Environment-specific configs
  • Performance tuning parameters
  • Worker settings
  • Storage configuration
  • Secrets management
  • Docker and Kubernetes examples
  • Troubleshooting

Audience: DevOps, system administrators Format: Reference + guide


CLI.md

Command-line tool reference:

  • si-cli - Server management, workers, database operations
  • si-diag - Read-only diagnostic CLI (batch, document, config, run-group, workflow, status)
  • Usage examples
  • Deployment patterns
  • Systemd service files
  • Docker usage
  • Signal handling
  • Platform notes
  • Troubleshooting

Audience: System administrators, developers Format: Reference documentation


DOCKER.md

Comprehensive Docker deployment guide:

  • Quick start with docker-compose
  • Service architecture and overview
  • GPU configuration and setup
  • Environment variable configuration
  • Volume management and backups
  • Load balancing with HAProxy
  • Authentication with OAuth2 Proxy
  • Production deployment checklist
  • Performance tuning
  • Detailed troubleshooting guide

Audience: DevOps engineers, system administrators Time to read: 30-40 minutes


PARAMETER_SETS.md

Parameter set configuration reference:

  • Complete YAML schema documentation
  • Parse, chunk, embed, and store configuration
  • Creating parameter sets (file, API, Web UI)
  • Managing and versioning parameter sets
  • Embedding provider configuration (Ollama, OpenAI, Azure)
  • Chunking strategies and best practices
  • Real-world examples
  • Troubleshooting guide

Audience: Data engineers, ML engineers, power users Time to read: 25-30 minutes


Configuration Examples

Check config/ directory for:

  • workflows/*.yaml - Example workflow definitions
  • params/*.yaml - Example parameter sets

Docker Deployment

For production deployment with Docker Compose, see the Docker Deployment Guide.

Quick Start:

cd docker
docker-compose up -d

The docker-compose configuration includes:

  • Soliplex Ingester - Main application (API + Worker)
  • PostgreSQL - Document and workflow database
  • Docling (x3) - PDF parsing services with GPU support and HAProxy load balancing
  • Ollama - Embedding generation with GPU
  • SeaweedFS - S3-compatible object storage

Access the application:

Comprehensive guide includes:

  • Service configuration and resource requirements
  • GPU setup and optimization
  • Volume management and backups
  • Authentication with OAuth2 Proxy
  • Production deployment best practices
  • Monitoring and maintenance
  • Detailed troubleshooting

See docs/DOCKER.md for complete instructions.

Documentation Maintenance

Contributing to Docs

When updating documentation:

  1. Keep it accurate - Test all examples before committing
  2. Stay consistent - Follow existing formatting and style
  3. Be comprehensive - Cover edge cases and gotchas
  4. Add examples - Show, don't just tell
  5. Update index - Modify this README when adding docs

Documentation Standards

  • Format: Markdown with GitHub-flavored syntax
  • Code blocks: Always specify language for syntax highlighting
  • Examples: Include both curl and Python/script examples where applicable
  • Cross-references: Link to related sections and documents
  • Versioning: Note breaking changes and version requirements

Feedback

Found an error or unclear section? Please:

  • Open an issue describing the problem
  • Suggest improvements via pull request
  • Ask questions in discussions

Version Information

  • Version: 0.1.0
  • Python: 3.12+

License

See LICENSE file in project root.


Getting Help

Documentation Issues

  • Found a mistake? Open an issue
  • Need clarification? Start a discussion
  • Have suggestions? Submit a pull request

Related Documentation


Quick Reference Card

# Installation
pip install -e .

# Configuration
si-cli validate-settings

# Database
si-cli db-init

# Server
si-cli serve --reload                    # Development
si-cli serve --host 0.0.0.0 --workers 4  # Production

# Workers
si-cli worker                            # Start worker

# Inspection
si-cli list-workflows                    # List workflows
si-cli dump-workflow batch               # View workflow
si-cli list-param-sets                   # List parameters
si-cli validate-haiku 1                  # Validate batch

# Diagnostics (si-diag)
si-diag batch list                       # List batches
si-diag document find "quarterly"        # Search documents by URI
si-diag document info sha256-abc...      # Document details
si-diag config workflows                 # List workflow definitions
si-diag config params                    # List parameter sets
si-diag run-group list --batch-id 1      # List run groups
si-diag workflow list 1                  # Workflow runs for a run group
si-diag status running                   # Currently running steps
si-diag status recent hour               # Recent activity
si-diag status details 1                 # Aggregated details (PostgreSQL)

# API
curl http://localhost:8000/docs          # Swagger UI
curl http://localhost:8000/api/v1/batch/ # List batches
curl http://localhost:8000/api/v1/document/ingest-document # Load document into database

About

content ingestion pipeline that works with soliplex

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors