-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Tika with Tesseract OCR #164
Comments
@notaus123 Thanks for reporting. |
This change shows the possibility of EXT:tika to extract text from images by using OCR feature of Apache Tika. See: TYPO3-Solr/ext-tika#164 How to see the demo: Run `ddev enable tika` Then navigate to in "File list" module to fileadmin/TIKA_OCR_DEMO click on icons of files and chose "Tika Preview".
I cant reproduce the issue with logs "file_get_contents(http://tika.xxx:9998/tika): failed to open stream" on EXT:tika v10.0.0. No headers are required to extract texts from images. That works already with tika "full" docker image. |
Hi there,
I installed tika as standalone-Server and added OCR Support with tesseract.
(https://cwiki.apache.org/confluence/display/tika/tikaocr)
In ext/tika/Classes/Service/Tika/ServerService.php I had to add several headers to enable ocr-ing (would be great if this could become configurable btw)
OCRing is working, but my log is full of warnings:
Core: Error handler (BE): PHP Warning: file_get_contents(http://tika.xxx:9998/tika): failed to open stream: Connection refused in /var/www/html_relaunch/site/releases/initial/typo3conf/ext/tika/Classes/Service/Tika/ServerService.php line 238
It seems the file_get_contents(..) is getting a timout. Tried to increase the php.ini setting for socket timeout, no change at all.
When disabling the additional ocr-headers, it's running fine.
Manually calling the tika server with ocr-header works fine.
Maybe you have an idea?
The text was updated successfully, but these errors were encountered: