A comprehensive document processing pipeline for Markdown to PDF/SVG/PNG conversion with OCR capabilities.
graph LR
A[Markdown] -->|Parse| B[HTML]
B -->|Convert| C[PDF]
C -->|Embed| D[SVG]
D -->|Extract| E[PNG]
E -->|Process| F[OCR]
F -->|Index| G[Search]
G -->|Visualize| H[Dashboard]
graph LR
A[Input Formats] --> B[Markdown]
A --> C[PDF]
B --> D[Converters]
C --> D
D --> E[Output Formats]
E --> F[PDF]
E --> G[SVG]
E --> H[PNG]
E --> I[HTML]
style A fill:#f9f,stroke:#333
style E fill:#9f9,stroke:#333
- Multi-format conversion: Convert between Markdown, PDF, SVG, and PNG
- SVG embedding: Embed PDFs as base64 data URIs in SVG containers
- Image extraction: Extract high-quality images from PDFs
- OCR processing: Extract text with confidence scoring
- Metadata tracking: Preserve and enhance metadata throughout processing
- Interactive dashboard: View and search processed documents
Enclose includes comprehensive file format validation to ensure the integrity and correctness of all converted files:
graph TD
A[Input Validation] --> B[Conversion]
B --> C[Output Validation]
C --> D[Verification]
style A fill:#d4f1f9,stroke:#333
style C fill:#d4f1f9,stroke:#333
style D fill:#d4f1f9,stroke:#333
- ✅ Valid PDF signature (
%PDFheader) - ✅ Correct MIME type (
application/pdf) - ✅ File integrity verification
- ✅ Valid XML structure
- ✅ Correct MIME type (
image/svg+xml) - ✅ Basic SVG tag validation
- ✅ Valid PNG signature (magic bytes)
- ✅ Correct MIME type (
image/png) - ✅ Image data integrity check
- ✅ PIL verification of image data
# When running tests, you'll see validation output like:
PASSED tests/test_file_formats.py::test_pdf_conversion
PASSED tests/test_file_formats.py::test_svg_conversion
PASSED tests/test_file_formats.py::test_png_conversionFor complete documentation, please visit our documentation site.
- Python 3.8+
- Poetry (for development)
-
Clone the repository:
git clone https://github.com/yourusername/enclose.git cd enclose -
Install dependencies:
poetry install
-
Install the package in development mode:
poetry install
-
List supported formats:
enclose --list
-
Convert a markdown file to another format:
# Basic conversion (outputs to current directory with default name) enclose example.md pdf # Specify output filename enclose example.md pdf -o output.pdf # Convert to SVG enclose example.md svg -o output.svg # Convert to PNG enclose example.md png -o output.png # Convert to HTML enclose example.md html -o output.html
-
First, create a test markdown file or use the provided
example.md -
Convert it to different formats:
# Convert to PDF enclose example.md pdf -o example.pdf # Convert to SVG enclose example.md svg -o example.svg
- The
-oor--outputflag requires a full file path with extension (e.g.,output.pdf,./output.svg) - If no output is specified, the output will be saved in the current directory with a default name based on the input file
- The output directory must exist before running the command
- The output will be saved to
output/example.pdf
usage: enclose [-h] [--version] [--list] [-o OUTPUT] [input] [{pdf,png,svg,html}]
A document processing tool for format conversion.
positional arguments:
input Input file path (markdown, pdf, etc.)
{pdf,png,svg,html} Output format
options:
-h, --help show this help message and exit
--version show program's version number and exit
--list List supported formats and conversions
-o OUTPUT, --output OUTPUT
Output directory (default: current directory)
To run tests:
make testTo run linting:
make lintTo run type checking:
make typecheck- Python 3.8.1+
- Poetry for dependency management
- System dependencies (see Installation Guide)
# Clone the repository
git clone https://github.com/veridock/enclose.git
cd enclose
# Install the package
make install# Process a document
enclose process example.md -o output/
# View the results
open output/dashboard.html # macOS
# or
xdg-open output/dashboard.html # Linux- Getting Started - Installation and setup
- User Guide - Command reference and usage examples
- Architecture - System design and components
- Development - Contributing and development setup
- Markdown to PDF with custom styling
- PDF to SVG with embedded fonts
- High-quality image extraction
- OCR text extraction with confidence scoring
- Metadata extraction and management
- Batch processing support
- Intuitive command structure
- Configurable output formats
- Progress tracking
sequenceDiagram
participant User
participant CLI
participant Processor
User->>CLI: enclose process doc.md
CLI->>Processor: Process document
Processor->>Processor: Convert Markdown to PDF
Processor->>Processor: Generate SVG with embedded PDF
Processor->>Processor: Extract images
Processor->>Processor: Process OCR
Processor-->>CLI: Processing complete
CLI-->>User: Results in output/
enclose/
├── docs/ # Documentation
├── processor/ # Main package
│ ├── __init__.py
│ ├── __main__.py # CLI entry point
│ ├── core/ # Core processing logic
│ ├── converters/ # Format converters
│ └── utils/ # Utility functions
├── scripts/ # Helper scripts
├── tests/ # Test suite
└── pyproject.toml # Project configuration
Contributions are welcome! Please see our Contributing Guide for details.
This project is licensed under the MIT License - see the LICENSE file for details.
To run the test suite:
make test
make lint-
Set up development environment
make install
-
Run tests
make test -
Format and check code
make format make lint
-
Run the development server
make dev
- Dashboard opens automatically in browser
- Access:
output/dashboard.html
# Convert a document
enclose convert input.md pdf -o output/
# List available formats
enclose list-formats
# Show help
enclose --helpflowchart TD
A[enclose] --> B[convert]
A --> C[list-formats]
A --> D[--help]
B --> E[input_file]
B --> F[output_format]
B --> G[options]
G --> H[-o/--output]
G --> I[--dpi]
G --> J[--quality]
style B fill:#9f9,stroke:#333
style C fill:#99f,stroke:#333
# Convert Markdown to PDF
enclose convert document.md pdf -o output/
# Convert PDF to high-quality PNG
enclose convert document.pdf png --dpi 300 -o images/
# List all supported formats
enclose list-formats# Set output DPI for images
enclose convert input.pdf png --dpi 150
# Set image quality (1-100)
enclose convert input.pdf jpg --quality 90
# Process multiple files
for f in *.md; do enclose convert "$f" pdf -o output/; donegraph TD
A[Project Root] --> B[Source Code]
A --> C[Documentation]
A --> D[Build System]
A --> E[Tests]
B --> F[enclose/]:::dir
F --> G[__init__.py]:::file
F --> H[__main__.py]:::file
F --> I[core/]:::dir
F --> J[converters/]:::dir
F --> K[utils/]:::dir
C --> L[docs/]:::dir
C --> M[README.md]:::file
D --> N[pyproject.toml]:::file
D --> O[Makefile]:::file
E --> P[tests/]:::dir
classDef dir fill:#e1f5fe,stroke:#01579b,stroke-width:2px
classDef file fill:#e8f5e9,stroke:#2e7d32,stroke-width:1.5px
-
enclose/- Main Python packagecore/- Core processing logic and document handlingconverters/- File format conversion modulesutils/- Utility functions and helpers__main__.py- CLI entry point
-
docs/- Comprehensive documentationgetting-started/- Installation and setup guidesarchitecture/- System design and componentsusage/- User guides and examples
-
tests/- Test suite- Unit tests
- Integration tests
- Test fixtures
pyproject.toml- Project configuration and dependenciesMakefile- Common development tasksscripts/enclose- Global CLI wrapper script.github/workflows/- CI/CD pipelines
flowchart LR
A[Input] -->|Markdown/PDF| B(enclose)
B --> C{Format?}
C -->|Markdown| D[Parse Markdown]
D --> E[Generate HTML]
E --> F[Convert to PDF]
C -->|PDF| G[Process PDF]
G --> H[Extract Content]
F & H --> I[Generate Outputs]
I --> J[SVG/PNG/HTML]
I --> K[Metadata]
J --> L[Dashboard]
style A fill:#e3f2fd,stroke:#1565c0
style B fill:#e8f5e9,stroke#2e7d32
style L fill:#fff3e0,stroke:#e65100
-
Input Handling
- Accepts Markdown or PDF files
- Validates input format and content
-
Conversion
- Markdown → HTML → PDF
- PDF → Images/Text
-
Output Generation
- Generate SVG/PNG/HTML outputs
- Extract and process metadata
-
Visualization
- Create interactive dashboard
- Enable search and filtering
Step 1: CREATE
├── Generate example markdown file (invoice)
└── Output: invoice_example.md
Step 2: MARKDOWN → PDF
├── Convert markdown to styled HTML
├── Generate PDF with CSS styling
└── Output: invoice_example.pdf
Step 3: PDF → SVG
├── Embed PDF as base64 data URI
├── Add SVG metadata (RDF/Dublin Core)
└── Output: invoice_example.svg + metadata.json
Step 4: PDF → PNG
├── Extract PDF pages as PNG images
├── Convert PNG to base64 encoding
└── Output: page_*.png + updated metadata
Step 5: OCR PROCESSING
├── Extract text from PNG images
├── Calculate confidence scores
└── Output: updated metadata with OCR data
Step 6: FILESYSTEM SEARCH
├── Scan for all SVG files
├── Parse SVG metadata
└── Output: svg_search_results.json
Step 7: DASHBOARD CREATION
├── Generate HTML table with thumbnails
├── Embed SVG previews
└── Output: dashboard.html (opens in browser)
{
"file": "path/to/file.svg",
"type": "svg_with_pdf",
"created": "2025-06-25T10:30:00",
"pdf_embedded": true,
"total_pages": 1,
"pages": [
{
"page": 1,
"file": "page_1.png",
"base64": "iVBORw0KGgoAAAANSU...",
"ocr_text": "Invoice #INV-2025-001...",
"ocr_confidence": 95.7,
"word_count": 45
}
]
}- SVG Thumbnails: Direct embedding of SVG files
- File Information: Path, size, modification date
- PDF Detection: Indicates embedded PDF data
- Metadata Status: Shows RDF metadata presence
- Interactive Links: Click to open files
| Target | Description |
|---|---|
install |
Install dependencies in virtual environment |
create |
Create example markdown file |
process |
Run conversion pipeline (steps 2-5) |
search |
Search filesystem for SVG files |
enclose |
Create HTML dashboard |
clean |
Remove generated files |
clean-all |
Remove everything including venv |
help |
Show available commands |
# Install additional languages
sudo apt-get install tesseract-ocr-pol # Polish
sudo apt-get install tesseract-ocr-deu # German
# Configure in processor.py
pytesseract.image_to_string(image, lang='pol+eng')Modify CSS in markdown_to_pdf() method:
styled_html = f"""
<style>
body {{ font-family: 'Your Font', sans-serif; }}
/* Add custom styles */
</style>
"""OCR Not Working:
# Check tesseract installation
tesseract --version
# Install language packs
sudo apt-get install tesseract-ocr-engPDF Conversion Fails:
# Check weasyprint dependencies
pip install --upgrade weasyprintSVG Rendering Issues:
# Install cairo development libraries
sudo apt-get install libcairo2-dev# Enable verbose output
python processor.py --step process --verboseThis project is open source. See LICENSE file for details.
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
- Issues: GitHub Issues
- Documentation: This README
- Examples: Check
output/directory after running pipeline
# Complete setup and demo
make install
make all
# View results
open output/dashboard.html # macOS
xdg-open output/dashboard.html # LinuxThe dashboard will show your processed documents with interactive thumbnails and metadata!