Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't ocr anything with 2.6.2 #337

Closed
starenka opened this issue Nov 20, 2023 · 2 comments
Closed

can't ocr anything with 2.6.2 #337

starenka opened this issue Nov 20, 2023 · 2 comments

Comments

@starenka
Copy link

starenka commented Nov 20, 2023

(tmp-42dc3f1969e972a) starenka /data/.envs/tmp-42dc3f1969e972a % ipython
Python 3.11.6 (main, Oct  8 2023, 05:06:43) [GCC 13.2.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.11.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import tesserocr

In [2]: tesserocr.file_to_text('/tmp/test.jpg')
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[2], line 1
----> 1 tesserocr.file_to_text('/tmp/test.jpg')

File tesserocr.pyx:2621, in tesserocr.file_to_text()

RuntimeError: Failed to read picture

In [3]: !pip list | grep tesserocr
tesserocr         2.6.2

In [5]: print(tesserocr.tesseract_version())
tesseract 5.3.3
 leptonica-1.83.1
  libpng 1.6.34 : zlib 1.2.11

it works okay with <2.6

In [1]: import tesserocr

In [2]: tesserocr.file_to_text('/tmp/test.jpg')
Out[2]: ">&EoASCADE\n\nEn\n\nEmoji Meaning Emoji Designs Technical Information\n\nRobot\n\nThe head of a classic robot. Commonly depicted as a\nvintage, tin toy robot with circular eyes, a triangular\nnose, knobs for ears, a light and/or antennae atop its\n\nLearn More About This Emoji\n\nGoes Great With\n@®e. =\n¢ b&\n\nUpcoming Events\n\n@ Thanksgiving ff Black Friday Emoji List\n\nHanukkah\n\n2. Christmas\n\nLatest News\n\n@9RF MOV OSREA\n\nShow More\n\nShow More\n\nMicrosoft Windows Samsung One UI What's New in\n\n11 23H2 Emoji 6.0 Emoji Unicode 15.1 &\nChangelog Changelog Emoji 15.1\nMicrosoft have begun Samsung has begun The latest list of emoji\nto roll out their latest rolling out the latest recommendations\n\nversion of its Android\nsoftware layer, One UI\n6.0. This update\n\nintroduces a brand new\nvisual style for the va...\n\nupdate to Windows 11,\nadding Emoji 15.0\nsupport and debuting\nthe glossy 3D Fluent\ndesigns in select appl...\n\ndrafted by the Unicode\nConsortium - Emoji\n15.1 - has been\nformally approved. This\nmeans that 118 new\n\nemojis s...\n\nVendors & Platforms Emojipedia Updates & Releases\n\nAbout Emojipedia Latest Approved Emojis\n\nle Noto Color Emoji Contact Emaji Kitchen\n\nLatest Draft Emolis\n\nsung Emoji Wr\n\nEmojipedia Shop All Emoji Version\n\nFacebook Licensing All Unicode Ve\nTwitter / x ings Emoji Prope\nWhatsApp Information Emoji Reau\nJoyPixels Privacy Palioy\n\nSnapchat Terms of Service\n\nTikTok How To Change Language\n\nAll Vendors & Platforms AL Art Master\n\n‘All emoji names are official Unicode: Character Database or CLDR names. Gode points fisted\n\nare part of the Lnicode Standard.\n\nAdditional emoji descriptions and definitions are copyright © Emojipedia. Emoji images\ndisplayed on Emojipedia are copyright © their respective creators, unless otherwise noted.\n\nEmojipedia® is a member of the Unicode Consortium,\n\nZEDGE’\n\n(reer ar\n\nFacts, Figures & Guides\n\na Emojipedia is brought to you by Zedge, the\n\nEmoji Statistics ‘world's #1 phone personalization app\nEmoji Sequence\n\nGoogle Play\n\n© App st\n\nGender Neutral\n\nCan 1 Email?\n\nEmojipedia® is a registered trademark of Zedye, Inc; Apple® is a registered trademark of\nApple Inc; Microsoft® and Windows® are registered trademarks of Microsoft Corporation;\nGoogle® and Android™ are registered trademarks or trademarks of Google Inc in the United\nStates and/or other countries.\n\nFollow Emojipedia on Twitter, Facebook, Instagram, Mastodon, or TikTok. Da Not Sell My\nPersonal Information. Change Consent. Read our Terms of Service here.\n\nRun a retail store? Check out the NRSPlus.com Point of Sale (POS) system, and low-rate\nNRSPay.com credit card processing from our partner, National Retail Solutions (NFS).\n"

In [3]: !pip list | grep tesserocr
/home/starenka/.local/lib/python3.11/site-packages/IPython/core/interactiveshell.py:2559: UserWarning: You executed the system command !pip which may not work as expected. Try the IPython magic %pip instead.
  warnings.warn(
tesserocr         2.5.2

In [4]: print(tesserocr.tesseract_version())
tesseract 5.3.0
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.2) : libpng 1.6.40 : libtiff 4.5.1 : zlib 1.2.13 : libwebp 1.3.2 : libopenjp2 2.5.0

//edit

i guess it needs newer tesseract-ocr? might warn about this, if true?

root ~ # apt show tesseract-ocr
Package: tesseract-ocr
Version: 5.3.0-2
Priority: optional
Section: graphics
Source: tesseract
Maintainer: Alexander Pozdnyakov <almipo@mail.ru>
Installed-Size: 2,186 kB
Depends: libarchive13 (>= 3.2.1), libc6 (>= 2.34), libcairo2 (>= 1.2.4), libcurl4 (>= 7.16.2), libfontconfig1 (>= 2.12.6), libgcc-s1 (>= 3.0), libglib2.0-0 (>= 2.12.0), libharfbuzz0b (>= 1.2.6), libicu72 (>= 72.1~rc-1~), liblept5 (>= 1.75.3), libpango-1.0-0 (>= 1.44.3), libpangocairo-1.0-0 (>= 1.22.0), libstdc++6 (>= 11), libtesseract5 (= 5.3.0-2), tesseract-ocr-eng (>= 4.0.9~), tesseract-ocr-osd (>= 4.0.9~)
Replaces: tesseract-ocr-data
Homepage: https://github.com/tesseract-ocr/
Tag: accessibility::ocr, implemented-in::c++, interface::commandline,
 role::program
Download-Size: 402 kB
APT-Manual-Installed: yes
APT-Sources: https://ftp.debian.org/debian trixie/main amd64 Packages
Description: Tesseract command line OCR tool
 Tesseract is an open source Optical Character Recognition (OCR)
 Engine. It can be used directly, or (for programmers) using an API to
 extract printed text from images. It supports a wide variety of
 languages. This package includes the command line tool.
@sirfz
Copy link
Owner

sirfz commented Nov 20, 2023

Looks like the pre-compiled binaries aren't compiled with jpeg support, try re-building by installing as follows:

pip install --no-binary tesserocr tesserocr

catileptic added a commit to alephdata/ingest-file that referenced this issue Feb 19, 2024
This is done due to the fact that version 2.6.2 doesn't bring any useful functionalities to Aleph and, instead, has the side-effect of breaking OCR for JPEG files as per: sirfz/tesserocr#337 .
@sirfz
Copy link
Owner

sirfz commented Mar 28, 2024

tesserocr v2.6.3 binaries are now built with jpeg (as well as tiff and webp) support

@sirfz sirfz closed this as completed Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants