Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Tika with Tesseract OCR #164

Closed
notaus123 opened this issue Jul 21, 2021 · 2 comments
Closed

[FEATURE] Tika with Tesseract OCR #164

notaus123 opened this issue Jul 21, 2021 · 2 comments

Comments

@notaus123
Copy link

Hi there,

I installed tika as standalone-Server and added OCR Support with tesseract.
(https://cwiki.apache.org/confluence/display/tika/tikaocr)

In ext/tika/Classes/Service/Tika/ServerService.php I had to add several headers to enable ocr-ing (would be great if this could become configurable btw)

OCRing is working, but my log is full of warnings:
Core: Error handler (BE): PHP Warning: file_get_contents(http://tika.xxx:9998/tika): failed to open stream: Connection refused in /var/www/html_relaunch/site/releases/initial/typo3conf/ext/tika/Classes/Service/Tika/ServerService.php line 238

It seems the file_get_contents(..) is getting a timout. Tried to increase the php.ini setting for socket timeout, no change at all.
When disabling the additional ocr-headers, it's running fine.

Manually calling the tika server with ocr-header works fine.

Maybe you have an idea?

@dkd-kaehm
Copy link
Contributor

dkd-kaehm commented Aug 9, 2021

@notaus123 Thanks for reporting.
Could you please prepare a pull-request with mentioned headers and tell us about the TYPO3, Solr-Server, Tika-Server, EXT-Solr version stack?

@dkd-kaehm dkd-kaehm changed the title [BUG] Tika with Tesseract OCR [FEATURE] Tika with Tesseract OCR Aug 27, 2021
dkd-kaehm added a commit to TYPO3-Solr/solr-ddev-site that referenced this issue Aug 27, 2021
This change shows the possibility of EXT:tika to extract text from images
by using OCR feature of Apache Tika.

See: TYPO3-Solr/ext-tika#164

How to see the demo:

Run 
`ddev enable tika` 

Then navigate to in "File list" module to fileadmin/TIKA_OCR_DEMO 
click on icons of files and chose "Tika Preview".
@dkd-kaehm
Copy link
Contributor

I cant reproduce the issue with logs "file_get_contents(http://tika.xxx:9998/tika): failed to open stream" on EXT:tika v10.0.0.
The EXT:tika v10 introduced the configurable PSR-18 HTTP client from TYPO3.

No headers are required to extract texts from images. That works already with tika "full" docker image.
See the significant change in:

TYPO3-Solr/solr-ddev-site@dda2afd#diff-0af25d60a5d43139a91176ff71c3c0bc162408b1df0736dd6e4c0f701cbc750eR5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants