autoocr
is a small tool which watches a directory for PDF files and then uses pdfsandwich
to do OCR on them. After processing the files are moved to an output location.
You can build the tool yourself if you have Go installed. Generally using the provided docker image is much easier though.
The docker image has two pre-defined volumes:
/data/input
/data/output
The input directory is watched by the process for new files and the resulting files are written to the output directory. Running the image should generally be as easy as:
docker run --name autoocr \
-v /path/to/input:/data/input \
-v /path/to/output:/data/output \
xperimental/autoocr:latest
The autoocr
executable has a few options that can also be passed to the container:
Usage of autoocr:
--delay duration Processing delay after receiving watch events. (default 5s)
-i, --input string Directory to use for input. (default "input")
--keep-original Keep backup of original file. (default true)
--languages string OCR Languages to use. (default "deu+eng")
--log-format string Logging format to use. (default "plain")
--log-level string Logging level to show. (default "info")
-o, --output string Directory to use for output. (default "output")
--pdf-sandwich string Path to pdfsandwich utility. (default "pdfsandwich")
This software would not be possible without the existence of pdfsandwich
, tesseract
, ImageMagick
and others. Thanks for providing those!