pydigitize is a simple command-line based tool to scan/archive documents.
It does the following steps:
- Scan a document with any scanner that supports SANE (ADF supported)
- Straightening and cleaning of scanned documents
- Run OCR on PDF so that it becomes searchable
- Generate PDF/A file for archival
- Add keywords to the PDF file
Because you don't want to type the same arguments for every piece of paper that
you scan, pydigitize supports profiles: A profile pre-configures settings like
the output directory, the resolution, whether to run OCR, additional PDF
keywords, etc. You can create a profile for all your invoices for example. Then
every time you get an invoice, you scan it with ./scan.py -p invoice
, done.
- Python 3.x
- OCRmyPDF 3.x
- libtiff 4.x
- sane 1.x
- unpaper
See ./scan.py --help
.
If you want to use profiles, create a profiles.toml
file in the current
directory.
For every profile you can specify the following parameters:
path
: The output directoryname
: Set a string that will be included in every filename in slugified formocr
: Whether to run OCR, straightening and cleanup on the scanned documentkeywords
: List of keywords that will be added to PDF Metadata
You can also create sub-profiles. They inherit the settings from the parent.
Example:
[bill]
path = "/home/user/bills/"
name = "bill"
ocr = true
keywords = ["bill"]
[bill.dentist]
name = "dentist"
keywords = ["bill", "dentist"]
[drawing]
path = "/home/user/drawings/"
ocr = false
Then pass the name of the profile to the scan.py
command using the -p
parameter.
./scan.py -p bill.dentist
You can of course override your parameters:
./scan.py -p bill -n amazon
If you want to scan a specific number of pages, use the -c
argument.
pydigitize will prompt you to confirm before scanning every page. This is very useful for example when scanning double-sided documents on a scanner that does not have a duplex unit, or when scanning a document partially in the ADF and partially on the flatbed.
If you don't want manual confirmation, but just want the scanner to scan as
fast as it can, use the --nowait
argument.