A Python tool for converting PDF invoices to well-formatted Markdown files. The tool uses OCR when needed and preserves the structure and formatting of the original invoice.
- Converts PDF invoices to clean, well-formatted Markdown
- Extracts and preserves all important invoice information:
- Company details
- Customer information
- Invoice details
- Project information
- Line items with pricing
- Payment details
- Uses OCR (Optical Character Recognition) when needed
- Supports both English and Norwegian invoices
- Follows markdown best practices and linting rules
- Handles both text-based and image-based PDFs
- Clone the repository:
git clone https://github.com/yourusername/pdf-to-markdown.git
cd pdf-to-markdown
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Place your PDF invoices in the
input_pdfs
directory. - Run the conversion script:
python convert_pdf.py
- Find the converted markdown files in the
output_markdown
directory.
The converted markdown files follow a consistent structure:
# Company Name
## Company Details
* Address: ...
* Phone: ...
* Email: ...
* Organization Number: ...
* Website: ...
## Customer Details
* Company: ...
* Address: ...
* Postal Code: ...
## Invoice Details
* Invoice Number: ...
* Invoice Date: ...
* Due Date: ...
* Customer Number: ...
* Reference: ...
## Project Details
* Project: ...
* Contact: ...
* Delivery Date: ...
* Delivery Address: ...
## Line Items
| Description | Amount (excl. VAT) | VAT (25%) | Amount (incl. VAT) |
|------------|-------------------|-----------|------------------|
| Item 1 | 1000,00 | 250,00 | 1250,00 |
## Total
Total Amount: NOK 1250,00
## Payment Details
* Account Number: ...
* Reference: ...
* Due Date: ...
opencv-python
: Image processingpytesseract
: OCR text extractionpdf2image
: PDF to image conversionbeautifulsoup4
: HTML parsingfastapi
: API framework (for future web interface)uvicorn
: ASGI server
- Added proper URL formatting in markdown output
- Improved markdown formatting to follow best practices
- Fixed issues with markdown linting:
- Proper URL formatting
- Consistent heading structure
- Single trailing newline
- Enhanced invoice data extraction and organization
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.