Skip to content

veridock/enclose

Repository files navigation

Enclose

A comprehensive document processing pipeline for Markdown to PDF/SVG/PNG conversion with OCR capabilities.

graph LR
    A[Markdown] -->|Parse| B[HTML]
    B -->|Convert| C[PDF]
    C -->|Embed| D[SVG]
    D -->|Extract| E[PNG]
    E -->|Process| F[OCR]
    F -->|Index| G[Search]
    G -->|Visualize| H[Dashboard]
Loading

🚀 Features

graph LR
    A[Input Formats] --> B[Markdown]
    A --> C[PDF]
    
    B --> D[Converters]
    C --> D
    
    D --> E[Output Formats]
    E --> F[PDF]
    E --> G[SVG]
    E --> H[PNG]
    E --> I[HTML]
    
    style A fill:#f9f,stroke:#333
    style E fill:#9f9,stroke:#333
Loading
  • Multi-format conversion: Convert between Markdown, PDF, SVG, and PNG
  • SVG embedding: Embed PDFs as base64 data URIs in SVG containers
  • Image extraction: Extract high-quality images from PDFs
  • OCR processing: Extract text with confidence scoring
  • Metadata tracking: Preserve and enhance metadata throughout processing
  • Interactive dashboard: View and search processed documents

✅ File Format Validation

Enclose includes comprehensive file format validation to ensure the integrity and correctness of all converted files:

graph TD
    A[Input Validation] --> B[Conversion]
    B --> C[Output Validation]
    C --> D[Verification]
    
    style A fill:#d4f1f9,stroke:#333
    style C fill:#d4f1f9,stroke:#333
    style D fill:#d4f1f9,stroke:#333
Loading

Validation Checks

PDF Files

  • ✅ Valid PDF signature (%PDF header)
  • ✅ Correct MIME type (application/pdf)
  • ✅ File integrity verification

SVG Files

  • ✅ Valid XML structure
  • ✅ Correct MIME type (image/svg+xml)
  • ✅ Basic SVG tag validation

PNG Files

  • ✅ Valid PNG signature (magic bytes)
  • ✅ Correct MIME type (image/png)
  • ✅ Image data integrity check
  • ✅ PIL verification of image data

Example Validation Output

# When running tests, you'll see validation output like:
PASSED tests/test_file_formats.py::test_pdf_conversion
PASSED tests/test_file_formats.py::test_svg_conversion
PASSED tests/test_file_formats.py::test_png_conversion

📚 Documentation

For complete documentation, please visit our documentation site.

🛠️ Quick Start

Prerequisites

  • Python 3.8+
  • Poetry (for development)

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/enclose.git
    cd enclose
  2. Install dependencies:

    poetry install
  3. Install the package in development mode:

    poetry install

Basic Usage

  1. List supported formats:

    enclose --list
  2. Convert a markdown file to another format:

    # Basic conversion (outputs to current directory with default name)
    enclose example.md pdf
    
    # Specify output filename
    enclose example.md pdf -o output.pdf
    
    # Convert to SVG
    enclose example.md svg -o output.svg
    
    # Convert to PNG
    enclose example.md png -o output.png
    
    # Convert to HTML
    enclose example.md html -o output.html

Example

  1. First, create a test markdown file or use the provided example.md

  2. Convert it to different formats:

    # Convert to PDF
    enclose example.md pdf -o example.pdf
    
    # Convert to SVG
    enclose example.md svg -o example.svg

Important Notes

  • The -o or --output flag requires a full file path with extension (e.g., output.pdf, ./output.svg)
  • If no output is specified, the output will be saved in the current directory with a default name based on the input file
  • The output directory must exist before running the command
  1. The output will be saved to output/example.pdf

Command Line Options

usage: enclose [-h] [--version] [--list] [-o OUTPUT] [input] [{pdf,png,svg,html}]

A document processing tool for format conversion.

positional arguments:
  input                 Input file path (markdown, pdf, etc.)
  {pdf,png,svg,html}    Output format

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --list                List supported formats and conversions
  -o OUTPUT, --output OUTPUT
                        Output directory (default: current directory)

Development

To run tests:

make test

To run linting:

make lint

To run type checking:

make typecheck

Prerequisites

Installation

# Clone the repository
git clone https://github.com/veridock/enclose.git
cd enclose

# Install the package
make install

Basic Usage

# Process a document
enclose process example.md -o output/

# View the results
open output/dashboard.html  # macOS
# or
xdg-open output/dashboard.html  # Linux

📖 Documentation Structure

🌟 Features in Detail

Document Conversion

  • Markdown to PDF with custom styling
  • PDF to SVG with embedded fonts
  • High-quality image extraction

Advanced Processing

  • OCR text extraction with confidence scoring
  • Metadata extraction and management
  • Batch processing support

Command Line Interface

  • Intuitive command structure
  • Configurable output formats
  • Progress tracking

📊 Example Workflow

sequenceDiagram
    participant User
    participant CLI
    participant Processor
    
    User->>CLI: enclose process doc.md
    CLI->>Processor: Process document
    Processor->>Processor: Convert Markdown to PDF
    Processor->>Processor: Generate SVG with embedded PDF
    Processor->>Processor: Extract images
    Processor->>Processor: Process OCR
    Processor-->>CLI: Processing complete
    CLI-->>User: Results in output/
Loading

📦 Project Structure

enclose/
├── docs/                   # Documentation
├── processor/              # Main package
│   ├── __init__.py
│   ├── __main__.py         # CLI entry point
│   ├── core/               # Core processing logic
│   ├── converters/         # Format converters
│   └── utils/              # Utility functions
├── scripts/                # Helper scripts
├── tests/                  # Test suite
└── pyproject.toml          # Project configuration

🤝 Contributing

Contributions are welcome! Please see our Contributing Guide for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🧪 Testing

To run the test suite:

make test
make lint

🔄 Development Workflow

  1. Set up development environment

    make install
  2. Run tests

    make test
  3. Format and check code

    make format
    make lint
  4. Run the development server

    make dev
    • Dashboard opens automatically in browser
    • Access: output/dashboard.html

🛠️ CLI Commands

Basic Usage

# Convert a document
enclose convert input.md pdf -o output/

# List available formats
enclose list-formats
# Show help
enclose --help

Command Structure

flowchart TD
    A[enclose] --> B[convert]
    A --> C[list-formats]
    A --> D[--help]
    
    B --> E[input_file]
    B --> F[output_format]
    B --> G[options]
    
    G --> H[-o/--output]
    G --> I[--dpi]
    G --> J[--quality]
    
    style B fill:#9f9,stroke:#333
    style C fill:#99f,stroke:#333
Loading

Common Examples

# Convert Markdown to PDF
enclose convert document.md pdf -o output/

# Convert PDF to high-quality PNG
enclose convert document.pdf png --dpi 300 -o images/


# List all supported formats
enclose list-formats

Advanced Options

# Set output DPI for images
enclose convert input.pdf png --dpi 150

# Set image quality (1-100)
enclose convert input.pdf jpg --quality 90

# Process multiple files
for f in *.md; do enclose convert "$f" pdf -o output/; done

📁 Project Structure

graph TD
    A[Project Root] --> B[Source Code]
    A --> C[Documentation]
    A --> D[Build System]
    A --> E[Tests]
    
    B --> F[enclose/]:::dir
    F --> G[__init__.py]:::file
    F --> H[__main__.py]:::file
    F --> I[core/]:::dir
    F --> J[converters/]:::dir
    F --> K[utils/]:::dir
    
    C --> L[docs/]:::dir
    C --> M[README.md]:::file
    
    D --> N[pyproject.toml]:::file
    D --> O[Makefile]:::file
    
    E --> P[tests/]:::dir
    
    classDef dir fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef file fill:#e8f5e9,stroke:#2e7d32,stroke-width:1.5px
Loading

Key Directories

  • enclose/ - Main Python package

    • core/ - Core processing logic and document handling
    • converters/ - File format conversion modules
    • utils/ - Utility functions and helpers
    • __main__.py - CLI entry point
  • docs/ - Comprehensive documentation

    • getting-started/ - Installation and setup guides
    • architecture/ - System design and components
    • usage/ - User guides and examples
  • tests/ - Test suite

    • Unit tests
    • Integration tests
    • Test fixtures

Key Files

  • pyproject.toml - Project configuration and dependencies
  • Makefile - Common development tasks
  • scripts/enclose - Global CLI wrapper script
  • .github/workflows/ - CI/CD pipelines

🔄 Workflow

flowchart LR
    A[Input] -->|Markdown/PDF| B(enclose)
    B --> C{Format?}
    
    C -->|Markdown| D[Parse Markdown]
    D --> E[Generate HTML]
    E --> F[Convert to PDF]
    
    C -->|PDF| G[Process PDF]
    G --> H[Extract Content]
    
    F & H --> I[Generate Outputs]
    I --> J[SVG/PNG/HTML]
    I --> K[Metadata]
    J --> L[Dashboard]
    
    style A fill:#e3f2fd,stroke:#1565c0
    style B fill:#e8f5e9,stroke#2e7d32
    style L fill:#fff3e0,stroke:#e65100
Loading

Processing Steps

  1. Input Handling

    • Accepts Markdown or PDF files
    • Validates input format and content
  2. Conversion

    • Markdown → HTML → PDF
    • PDF → Images/Text
  3. Output Generation

    • Generate SVG/PNG/HTML outputs
    • Extract and process metadata
  4. Visualization

    • Create interactive dashboard
    • Enable search and filtering

🔄 Pipeline Workflow

Step 1: CREATE
├── Generate example markdown file (invoice)
└── Output: invoice_example.md

Step 2: MARKDOWN → PDF
├── Convert markdown to styled HTML
├── Generate PDF with CSS styling
└── Output: invoice_example.pdf

Step 3: PDF → SVG
├── Embed PDF as base64 data URI
├── Add SVG metadata (RDF/Dublin Core)
└── Output: invoice_example.svg + metadata.json

Step 4: PDF → PNG
├── Extract PDF pages as PNG images
├── Convert PNG to base64 encoding
└── Output: page_*.png + updated metadata

Step 5: OCR PROCESSING
├── Extract text from PNG images
├── Calculate confidence scores
└── Output: updated metadata with OCR data

Step 6: FILESYSTEM SEARCH
├── Scan for all SVG files
├── Parse SVG metadata
└── Output: svg_search_results.json

Step 7: DASHBOARD CREATION
├── Generate HTML table with thumbnails
├── Embed SVG previews
└── Output: dashboard.html (opens in browser)

📊 Output Files

Metadata Structure

{
  "file": "path/to/file.svg",
  "type": "svg_with_pdf",
  "created": "2025-06-25T10:30:00",
  "pdf_embedded": true,
  "total_pages": 1,
  "pages": [
    {
      "page": 1,
      "file": "page_1.png",
      "base64": "iVBORw0KGgoAAAANSU...",
      "ocr_text": "Invoice #INV-2025-001...",
      "ocr_confidence": 95.7,
      "word_count": 45
    }
  ]
}

Dashboard Features

  • SVG Thumbnails: Direct embedding of SVG files
  • File Information: Path, size, modification date
  • PDF Detection: Indicates embedded PDF data
  • Metadata Status: Shows RDF metadata presence
  • Interactive Links: Click to open files

🛠️ Makefile Targets

Target Description
install Install dependencies in virtual environment
create Create example markdown file
process Run conversion pipeline (steps 2-5)
search Search filesystem for SVG files
enclose Create HTML dashboard
clean Remove generated files
clean-all Remove everything including venv
help Show available commands

🔧 Configuration

OCR Language Support

# Install additional languages
sudo apt-get install tesseract-ocr-pol  # Polish
sudo apt-get install tesseract-ocr-deu  # German

# Configure in processor.py
pytesseract.image_to_string(image, lang='pol+eng')

PDF Styling

Modify CSS in markdown_to_pdf() method:

styled_html = f"""
<style>
    body {{ font-family: 'Your Font', sans-serif; }}
    /* Add custom styles */
</style>
"""

🐛 Troubleshooting

Common Issues

OCR Not Working:

# Check tesseract installation
tesseract --version

# Install language packs
sudo apt-get install tesseract-ocr-eng

PDF Conversion Fails:

# Check weasyprint dependencies
pip install --upgrade weasyprint

SVG Rendering Issues:

# Install cairo development libraries
sudo apt-get install libcairo2-dev

Debug Mode

# Enable verbose output
python processor.py --step process --verbose

📝 License

This project is open source. See LICENSE file for details.

🤝 Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open Pull Request

📞 Support

  • Issues: GitHub Issues
  • Documentation: This README
  • Examples: Check output/ directory after running pipeline

🎉 Quick Demo

# Complete setup and demo
make install
make all

# View results
open output/dashboard.html  # macOS
xdg-open output/dashboard.html  # Linux

The dashboard will show your processed documents with interactive thumbnails and metadata!

About

enclose - A comprehensive document processing system that converts Markdown files through a complete pipeline: Markdown → PDF → SVG → PNG → OCR → Search → Dashboard.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors