A comprehensive Rust toolkit for extracting and processing documentation from multiple file formats into a universal JSON structure, optimized for vector stores and RAG (Retrieval-Augmented Generation) systems.
Current Version: 0.3.1
Status: β
 Production Ready
Python Bindings: β
 Fully Functional
Documentation: β
 Complete
- β Universal JSON Output: Consistent format across all document types
 - β Multiple Format Support: PDF, TXT, JSON, CSV, DOCX
 - β Python Bindings: Full PyO3 integration with native performance
 - β Intelligent Text Processing: Smart chunking, cleaning, and metadata extraction
 - β Modular Architecture: Each document type has its specialized processor
 - β Vector Store Ready: Optimized output for embedding and indexing
 - β CLI Tools: Both universal processor and format-specific binaries
 - β Rich Metadata: Comprehensive document and chunk-level metadata
 - β Language Detection: Automatic language detection capabilities
 - β Performance Optimized: Fast processing with detailed timing information
 
- Rust 1.70+ (for compilation)
 - Cargo (comes with Rust)
 
git clone https://github.com/WillIsback/doc_loader.git
cd doc_loader
cargo build --releaseAfter building, you'll have access to these CLI tools:
doc_loader- Universal document processorpdf_processor- PDF-specific processortxt_processor- Plain text processorjson_processor- JSON document processorcsv_processor- CSV file processordocx_processor- DOCX document processor
Process any supported document type with the main binary:
# Basic usage
./target/release/doc_loader --input document.pdf
# With custom options
./target/release/doc_loader \
    --input document.pdf \
    --output result.json \
    --chunk-size 1500 \
    --chunk-overlap 150 \
    --detect-language \
    --prettyUse specialized processors for specific formats:
# Process a PDF
./target/release/pdf_processor --input report.pdf --pretty
# Process a CSV with analysis
./target/release/csv_processor --input data.csv --output analysis.json
# Process a JSON document
./target/release/json_processor --input config.json --detect-languageAll processors support these common options:
--input <FILE>- Input file path (required)--output <FILE>- Output JSON file (optional, defaults to stdout)--chunk-size <SIZE>- Maximum chunk size in characters (default: 1000)--chunk-overlap <SIZE>- Overlap between chunks (default: 100)--no-cleaning- Disable text cleaning--detect-language- Enable language detection--pretty- Pretty print JSON output
All processors generate a standardized JSON structure:
{
  "document_metadata": {
    "filename": "document.pdf",
    "filepath": "/path/to/document.pdf", 
    "document_type": "PDF",
    "file_size": 1024000,
    "created_at": "2025-01-01T12:00:00Z",
    "modified_at": "2025-01-01T12:00:00Z",
    "title": "Document Title",
    "author": "Author Name",
    "format_metadata": {
      // Format-specific metadata
    }
  },
  "chunks": [
    {
      "id": "pdf_chunk_0",
      "content": "Extracted text content...",
      "chunk_index": 0,
      "position": {
        "page": 1,
        "line": 10,
        "start_offset": 0,
        "end_offset": 1000
      },
      "metadata": {
        "size": 1000,
        "language": "en",
        "confidence": 0.95,
        "format_specific": {
          // Chunk-specific metadata
        }
      }
    }
  ],
  "processing_info": {
    "processor": "PdfProcessor",
    "processor_version": "1.0.0",
    "processed_at": "2025-01-01T12:00:00Z",
    "processing_time_ms": 150,
    "total_chunks": 5,
    "total_content_size": 5000,
    "processing_params": {
      "max_chunk_size": 1000,
      "chunk_overlap": 100,
      "text_cleaning": true,
      "language_detection": true
    }
  }
}The project follows a modular architecture:
src/
βββ lib.rs              # Main library interface
βββ main.rs             # Universal CLI
βββ error.rs            # Error handling
βββ core/               # Core data structures
β   βββ mod.rs          # Universal output format
βββ utils/              # Utility functions
β   βββ mod.rs          # Text processing utilities
βββ processors/         # Document processors
β   βββ mod.rs          # Common processor traits
β   βββ pdf.rs          # PDF processor
β   βββ txt.rs          # Text processor
β   βββ json.rs         # JSON processor
β   βββ csv.rs          # CSV processor
β   βββ docx.rs         # DOCX processor
βββ bin/                # Individual CLI binaries
    βββ pdf_processor.rs
    βββ txt_processor.rs
    βββ json_processor.rs
    βββ csv_processor.rs
    βββ docx_processor.rs
Test the functionality with the provided sample files:
# Test text processing
./target/debug/doc_loader --input test_sample.txt --pretty
# Test JSON processing
./target/debug/json_processor --input test_sample.json --pretty
# Test CSV processing  
./target/debug/csv_processor --input test_sample.csv --pretty- Text extraction with lopdf
 - Page-based chunking
 - Metadata extraction (title, author, creation date)
 - Position tracking (page, line, offset)
 
- Header detection and analysis
 - Column statistics (data types, fill rates, unique values)
 - Row-by-row or batch processing
 - Data completeness analysis
 
- Hierarchical structure analysis
 - Key extraction and statistics
 - Nested object flattening
 - Schema inference
 
- Document structure parsing
 - Style and formatting preservation
 - Section and paragraph extraction
 - Metadata extraction
 
- Encoding detection
 - Line and paragraph preservation
 - Language detection
 - Character and word counting
 
Use doc_loader as a library in your Rust projects:
use doc_loader::{UniversalProcessor, ProcessingParams};
use std::path::Path;
fn main() -> Result<(), Box<dyn std::error::Error>> {
    let processor = UniversalProcessor::new();
    let params = ProcessingParams::default()
        .with_chunk_size(1500)
        .with_language_detection(true);
    
    let result = processor.process_file(
        Path::new("document.pdf"), 
        Some(params)
    )?;
    
    println!("Extracted {} chunks", result.chunks.len());
    Ok(())
}- Fast Processing: Optimized for large documents
 - Memory Efficient: Streaming processing for large files
 - Detailed Metrics: Processing time and statistics
 - Concurrent Support: Thread-safe processors
 
- Enhanced PDF text extraction (pdfium integration)
 - Complete DOCX XML parsing
 - Unit test coverage
 - Performance benchmarks
 
- Additional formats (XLSX, PPTX, HTML, Markdown)
 - Advanced language detection
 - Web interface/API
 - Vector store integrations
 - OCR support for scanned documents
 - Parallel processing optimizations
 
- Fork the repository
 - Create a feature branch
 - Make your changes
 - Add tests if applicable
 - Submit a pull request
 
[Add your license information here]
Report issues on the project's issue tracker. Include:
- File format and size
 - Command used
 - Error messages
 - Expected vs actual behavior
 
Doc Loader - Making document processing simple, fast, and universal! π
Doc Loader provides fully functional Python bindings through PyO3, offering the same performance as the native Rust library with a clean Python API.
# Via PyPI (recommandΓ©)
pip install extracteur-docs-rs
# Ou build depuis les sources
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install maturin build tool
pip install maturin
# Build and install Python bindings (Python 3.9+ supported)
venv/bin/maturin develop --features python --releaseimport extracteur_docs_rs as doc_loader
# Quick start - process any supported file format
result = doc_loader.process_file("document.pdf", chunk_size=500)
print(f"Chunks: {result.chunk_count()}")
print(f"Words: {result.total_word_count()}")
print(f"Supported formats: {doc_loader.supported_extensions()}")
# Advanced usage with custom parameters
processor = doc_loader.PyUniversalProcessor()
params = doc_loader.PyProcessingParams(
    chunk_size=400,
    overlap=60,
    clean_text=True,
    extract_metadata=True
)
result = processor.process_file("document.txt", params)
# Process text content directly
text_result = processor.process_text_content("Your text here...", params)
# Export to JSON
json_output = result.to_json()- β RAG/Embedding Pipeline: Direct integration with sentence-transformers
 - β Data Analysis: Export to pandas DataFrames
 - β REST API: Flask/FastAPI endpoints
 - β Batch Processing: Process directories of documents
 - β Jupyter Notebooks: Interactive document analysis
 
The Python bindings are fully tested and functional with:
- All file formats supported (PDF, TXT, JSON, CSV, DOCX)
 - Complete API coverage matching Rust functionality
 - Proper error handling with Python exceptions
 - Full parameter customization
 - Comprehensive documentation and examples
 
Run the demo: venv/bin/python python_demo.py
For complete Python documentation, see docs/python_usage.md.