Convert pdf preprocessor #108

xf0e · 2018-12-12T02:07:43Z

The tesseract engine now can be confronted with pdf files. This is achieved by a new ConvertPdf preprocessor.

Usage:

The preprocessor binary should be started with "-preprocessor convert-pdf"
and afterwards it can be tested with:
curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://localhost:8000/test.pdf","engine":"tesseract", "preprocessors":["convert-pdf"]}' http://localhost:8080/ocr

Internal we are calling gs to create a multi page TIFF from our input. The ImageMagick won't work for this purpose because it creates a single paged image files which tesseract can't handle.
e.g.

Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Page 1
Image too large: (2480, 77176)
Error during processing.

Regards!

tleyden · 2018-12-12T02:47:29Z

Thanks for the contribution! I verified that it builds locally, and triggered new docker images on dockerhub. (still processing)

OSevangelist · 2019-02-17T13:03:05Z

Hi guys, great work! i Tried this feature but even for very small PDFs (i.e. 2 pages) i got

Unable to perform OCR decode. Error: Timeout waiting for RPC response

Any ideas why this happens. I use tesseract3 insides the containers

tleyden · 2019-02-19T20:38:14Z

Any logs on the containers? I'm guessing it failed with some sort of error that didn't get propagated back.

darmanovic · 2019-02-27T14:51:06Z

@tleyden

I have same issue. Logs:
`

OCR_HTTP: serveHttp called
OCR_CLIENT: dialing "amqp://admin:Phaish9ohbaidei6oole@rabbitmq/"
OCR_CLIENT: callbackQueue name: amq.gen-Y6bVZfgmdLjnzjZrj5_gsQ
OCR_CLIENT: looping over deliveries..
OCR_CLIENT: ocrRequest before: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: [convert-pdf]
OCR_CLIENT: publishing with routing key "convert-pdf"
OCR_CLIENT: ocrRequest after: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: []
ERROR: Timeout waiting for RPC response -- open-ocr.HandleOcrRequest() at ocr_http_handler.go:80
ERROR: Unable to perform OCR decode. Error: Timeout waiting for RPC response -- open-ocr.(*OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:40`

tleyden · 2019-02-27T18:48:03Z

Can you get logs on the worker container? Or maybe there isn't one running, which would explain the timeout.

What does docker ps return?

darmanovic · 2019-02-27T22:23:42Z

Worker container log is:

27T22:14:11.302615900Z 22:14:11.302272 OCR_WORKER: Creating new OCR Worker
22:14:11.302392 OCR_WORKER: Run() called...
22:14:11.302409 OCR_WORKER: dialing "amqp://admin:Phaish9ohbaidei6oole@rabbitmq/"
22:14:11.320177 OCR_WORKER: got Connection, getting Channel
22:14:11.322389 OCR_WORKER: binding to: decode-ocr
22:14:11.323148 OCR_WORKER: Queue bound to Exchange, starting Consume (consumer tag "foo")

I have 4 containers running, docker ps outputs (some colums cleared for clarity)

b0055fbecbde .  tleyden5iwx/open-ocr-2              docker-compose_openocr_1
b8be2302936c .  tleyden5iwx/open-ocr-preprocessor   docker-compose_strokewidthtransform_1
ae51cccc7094    tleyden5iwx/open-ocr-2              docker-compose_openocrworker_1
9904e5507ac7 .  rabbitmq:3.6.5-management           docker-compose_rabbitmq_1

Line

command: "/opt/open-ocr/open-ocr-preprocessor -amqp_uri amqp://admin:Phaish9ohbaidei6oole@rabbitmq/ -preprocessor stroke-width-transform"

of docker-compose.yml shoud be changed to:

command: "/opt/open-ocr/open-ocr-preprocessor -amqp_uri amqp://admin:Phaish9ohbaidei6oole@rabbitmq/ **-preprocessor convert-pdf"

if I am right?

xf0e · 2019-02-28T15:49:44Z

hello darmanovic,
sorry, i edited the first post. The preprocessor args should be
"-preprocessor convert-pdf" and should not contain "**". The stars are just typos.

darmanovic · 2019-02-28T15:53:32Z

I suspected that stars are typos, but when I remove them, container won't run at all.

LINE:

    command: "/opt/open-ocr/open-ocr-preprocessor -amqp_uri amqp://admin:Phaish9ohbaidei6oole@rabbitmq/ -preprocessor convert-pdf"

LOG:

15:52:17.985590 PREPROCESSOR_WORKER: Creating new Preprocessor Worker
15:52:17.986118 PANIC: Could not create rpc worker: No preprocessor found for: "convert-pdf" -- main.main() at main.go:47
panic: Could not create rpc worker: No preprocessor found for: "convert-pdf"
2019-02-28T15:52:17.990229700Z 
goroutine 1 [running]:
runtime.panic(0x627e80, 0xc210042940)
/usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
github.com/couchbaselabs/logg.LogPanic(0x7374d0, 0x1f, 0x7efe16e9ae78, 0x1, 0x1)
/opt/go/src/github.com/couchbaselabs/logg/logg.go:136 +0xec
main.main()
/opt/go/src/github.com/tleyden/open-ocr/cli-preprocessor/main.go:47 +0x200

bplukasz · 2019-03-13T16:11:27Z

Same error as @darmanovic. Someone solved it?

nevvermind · 2019-04-10T06:26:26Z

Hi, all. Please have a look at #117 for a follow-up on this error.

xf0e added 3 commits December 12, 2018 02:53

[FEATURE,CGL] new preprocessor for converting incoming pdf files to tiff

ad4220e

[TASK] commented in defer calls to delete temp files

afb96bc

[TASK] removed unneeded commentary

170cc70

xf0e mentioned this pull request Dec 12, 2018

PDF support #87

Closed

Point back to primary open-ocr repo

9f08fbf

tleyden merged commit 2257ae8 into tleyden:master Dec 12, 2018

nevvermind mentioned this pull request Mar 19, 2019

Using the PDF pre-processor #117

Open

xf0e deleted the convert-pdf-preprocessor branch July 31, 2019 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert pdf preprocessor #108

Convert pdf preprocessor #108

xf0e commented Dec 12, 2018 •

edited

Loading

tleyden commented Dec 12, 2018

OSevangelist commented Feb 17, 2019 •

edited

Loading

tleyden commented Feb 19, 2019

darmanovic commented Feb 27, 2019

tleyden commented Feb 27, 2019

darmanovic commented Feb 27, 2019

xf0e commented Feb 28, 2019 •

edited

Loading

darmanovic commented Feb 28, 2019

bplukasz commented Mar 13, 2019

nevvermind commented Apr 10, 2019

Convert pdf preprocessor #108

Convert pdf preprocessor #108

Conversation

xf0e commented Dec 12, 2018 • edited Loading

tleyden commented Dec 12, 2018

OSevangelist commented Feb 17, 2019 • edited Loading

tleyden commented Feb 19, 2019

darmanovic commented Feb 27, 2019

tleyden commented Feb 27, 2019

darmanovic commented Feb 27, 2019

xf0e commented Feb 28, 2019 • edited Loading

darmanovic commented Feb 28, 2019

bplukasz commented Mar 13, 2019

nevvermind commented Apr 10, 2019

xf0e commented Dec 12, 2018 •

edited

Loading

OSevangelist commented Feb 17, 2019 •

edited

Loading

xf0e commented Feb 28, 2019 •

edited

Loading