[FEATURE] Tika with Tesseract OCR #164

notaus123 · 2021-07-21T07:45:41Z

Hi there,

I installed tika as standalone-Server and added OCR Support with tesseract.
(https://cwiki.apache.org/confluence/display/tika/tikaocr)

In ext/tika/Classes/Service/Tika/ServerService.php I had to add several headers to enable ocr-ing (would be great if this could become configurable btw)

OCRing is working, but my log is full of warnings:
Core: Error handler (BE): PHP Warning: file_get_contents(http://tika.xxx:9998/tika): failed to open stream: Connection refused in /var/www/html_relaunch/site/releases/initial/typo3conf/ext/tika/Classes/Service/Tika/ServerService.php line 238

It seems the file_get_contents(..) is getting a timout. Tried to increase the php.ini setting for socket timeout, no change at all.
When disabling the additional ocr-headers, it's running fine.

Manually calling the tika server with ocr-header works fine.

Maybe you have an idea?

dkd-kaehm · 2021-08-09T07:50:24Z

@notaus123 Thanks for reporting.
Could you please prepare a pull-request with mentioned headers and tell us about the TYPO3, Solr-Server, Tika-Server, EXT-Solr version stack?

This change shows the possibility of EXT:tika to extract text from images by using OCR feature of Apache Tika. See: TYPO3-Solr/ext-tika#164 How to see the demo: Run `ddev enable tika` Then navigate to in "File list" module to fileadmin/TIKA_OCR_DEMO click on icons of files and chose "Tika Preview".

dkd-kaehm · 2021-08-27T21:46:10Z

I cant reproduce the issue with logs "file_get_contents(http://tika.xxx:9998/tika): failed to open stream" on EXT:tika v10.0.0.
The EXT:tika v10 introduced the configurable PSR-18 HTTP client from TYPO3.

No headers are required to extract texts from images. That works already with tika "full" docker image.
See the significant change in:

TYPO3-Solr/solr-ddev-site@dda2afd#diff-0af25d60a5d43139a91176ff71c3c0bc162408b1df0736dd6e4c0f701cbc750eR5

dkd-kaehm changed the title ~~[BUG] Tika with Tesseract OCR~~ [FEATURE] Tika with Tesseract OCR Aug 27, 2021

dkd-kaehm closed this as completed Aug 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Tika with Tesseract OCR #164

[FEATURE] Tika with Tesseract OCR #164

notaus123 commented Jul 21, 2021

dkd-kaehm commented Aug 9, 2021 •

edited

dkd-kaehm commented Aug 27, 2021

[FEATURE] Tika with Tesseract OCR #164

[FEATURE] Tika with Tesseract OCR #164

Comments

notaus123 commented Jul 21, 2021

dkd-kaehm commented Aug 9, 2021 • edited

dkd-kaehm commented Aug 27, 2021

dkd-kaehm commented Aug 9, 2021 •

edited