Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert pdf preprocessor #108

Merged
merged 4 commits into from
Dec 12, 2018
Merged

Conversation

xf0e
Copy link
Contributor

@xf0e xf0e commented Dec 12, 2018

The tesseract engine now can be confronted with pdf files. This is achieved by a new ConvertPdf preprocessor.

Usage:

Internal we are calling gs to create a multi page TIFF from our input. The ImageMagick won't work for this purpose because it creates a single paged image files which tesseract can't handle.
e.g.

Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Page 1
Image too large: (2480, 77176)
Error during processing.

Regards!

@xf0e xf0e mentioned this pull request Dec 12, 2018
@tleyden tleyden merged commit 2257ae8 into tleyden:master Dec 12, 2018
@tleyden
Copy link
Owner

tleyden commented Dec 12, 2018

Thanks for the contribution! I verified that it builds locally, and triggered new docker images on dockerhub. (still processing)

@OSevangelist
Copy link

OSevangelist commented Feb 17, 2019

Hi guys, great work! i Tried this feature but even for very small PDFs (i.e. 2 pages) i got

Unable to perform OCR decode. Error: Timeout waiting for RPC response

Any ideas why this happens. I use tesseract3 insides the containers

@tleyden
Copy link
Owner

tleyden commented Feb 19, 2019

Any logs on the containers? I'm guessing it failed with some sort of error that didn't get propagated back.

@darmanovic
Copy link

@tleyden

I have same issue. Logs:
`

  • OCR_HTTP: serveHttp called
  • OCR_CLIENT: dialing "amqp://admin:Phaish9ohbaidei6oole@rabbitmq/"
  • OCR_CLIENT: callbackQueue name: amq.gen-Y6bVZfgmdLjnzjZrj5_gsQ
  • OCR_CLIENT: looping over deliveries..
  • OCR_CLIENT: ocrRequest before: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: [convert-pdf]
  • OCR_CLIENT: publishing with routing key "convert-pdf"
  • OCR_CLIENT: ocrRequest after: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: []
  • ERROR: Timeout waiting for RPC response -- open-ocr.HandleOcrRequest() at ocr_http_handler.go:80
  • ERROR: Unable to perform OCR decode. Error: Timeout waiting for RPC response -- open-ocr.(*OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:40`

@tleyden
Copy link
Owner

tleyden commented Feb 27, 2019

Can you get logs on the worker container? Or maybe there isn't one running, which would explain the timeout.

What does docker ps return?

@darmanovic
Copy link

Worker container log is:

27T22:14:11.302615900Z 22:14:11.302272 OCR_WORKER: Creating new OCR Worker
22:14:11.302392 OCR_WORKER: Run() called...
22:14:11.302409 OCR_WORKER: dialing "amqp://admin:Phaish9ohbaidei6oole@rabbitmq/"
22:14:11.320177 OCR_WORKER: got Connection, getting Channel
22:14:11.322389 OCR_WORKER: binding to: decode-ocr
22:14:11.323148 OCR_WORKER: Queue bound to Exchange, starting Consume (consumer tag "foo")

I have 4 containers running, docker ps outputs (some colums cleared for clarity)

b0055fbecbde .  tleyden5iwx/open-ocr-2              docker-compose_openocr_1
b8be2302936c .  tleyden5iwx/open-ocr-preprocessor   docker-compose_strokewidthtransform_1
ae51cccc7094    tleyden5iwx/open-ocr-2              docker-compose_openocrworker_1
9904e5507ac7 .  rabbitmq:3.6.5-management           docker-compose_rabbitmq_1

Line

command: "/opt/open-ocr/open-ocr-preprocessor -amqp_uri amqp://admin:Phaish9ohbaidei6oole@rabbitmq/ -preprocessor stroke-width-transform"

of docker-compose.yml shoud be changed to:

command: "/opt/open-ocr/open-ocr-preprocessor -amqp_uri amqp://admin:Phaish9ohbaidei6oole@rabbitmq/ **-preprocessor convert-pdf"

if I am right?

@xf0e
Copy link
Contributor Author

xf0e commented Feb 28, 2019

hello darmanovic,
sorry, i edited the first post. The preprocessor args should be
"-preprocessor convert-pdf" and should not contain "**". The stars are just typos.

@darmanovic
Copy link

I suspected that stars are typos, but when I remove them, container won't run at all.

LINE:

    command: "/opt/open-ocr/open-ocr-preprocessor -amqp_uri amqp://admin:Phaish9ohbaidei6oole@rabbitmq/ -preprocessor convert-pdf"

LOG:

15:52:17.985590 PREPROCESSOR_WORKER: Creating new Preprocessor Worker
15:52:17.986118 PANIC: Could not create rpc worker: No preprocessor found for: "convert-pdf" -- main.main() at main.go:47
panic: Could not create rpc worker: No preprocessor found for: "convert-pdf"
2019-02-28T15:52:17.990229700Z 
goroutine 1 [running]:
runtime.panic(0x627e80, 0xc210042940)
/usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
github.com/couchbaselabs/logg.LogPanic(0x7374d0, 0x1f, 0x7efe16e9ae78, 0x1, 0x1)
/opt/go/src/github.com/couchbaselabs/logg/logg.go:136 +0xec
main.main()
/opt/go/src/github.com/tleyden/open-ocr/cli-preprocessor/main.go:47 +0x200

@bplukasz
Copy link

Same error as @darmanovic. Someone solved it?

@nevvermind
Copy link

Hi, all. Please have a look at #117 for a follow-up on this error.

@xf0e xf0e deleted the convert-pdf-preprocessor branch July 31, 2019 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants