Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing support for Tesseract5? #338

Closed
dickreuter opened this issue Dec 7, 2023 · 18 comments
Closed

Missing support for Tesseract5? #338

dickreuter opened this issue Dec 7, 2023 · 18 comments

Comments

@dickreuter
Copy link

Is there no support for tessseract 5?

In this pipeline I install tesseract with chocolatey. That works fine, and it installs tesseract 5, but then tesserocr gives the following error:
Supporting tesseract v3.04.00

Collecting tesserocr (from -r requirements.txt (line 31))
Downloading tesserocr-2.6.2.tar.gz (58 kB)
---------------------------------------- 58.9/58.9 kB 3.0 MB/s eta 0:00:00
Installing build dependencies: started
Installing build dependencies: finished with status 'done'
Getting requirements to build wheel: started
Getting requirements to build wheel: finished with status 'error'
error: subprocess-exited-with-error

Getting requirements to build wheel did not run successfully.
exit code: 1

[54 lines of output]
Failed to extract tesseract version number from: tesseract v5.3.3.20231005

leptonica-1.83.1

libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.4) : libpng 1.6.40 : libtiff 4.6.0 : zlib 1.2.13 : libwebp 1.3.2 : libopenjp2 2.5.0

Found AVX2

Found AVX

Found FMA

Found SSE4.1

Found libarchive 3.7.2 zlib/1.3 liblzma/5.4.4 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5

Found libcurl/8.3.0 Schannel zlib/1.3 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.4 libpsl/0.21.2 (+libidn2/2.3.3) libssh2/1.11.0
Supporting tesseract v3.04.00

@winstxnhdw
Copy link

winstxnhdw commented Dec 29, 2023

The maintainer is long gone. Anyways, since you are on Windows, you shouldn't need to pre-install Tesseract. For Windows, the Tesseract model is bundled with the tesserocr wheel. See here. You may still need to install the relevant tessdata though.

@zdenop
Copy link
Contributor

zdenop commented Dec 29, 2023

tessocr support tesseract 5 - see tesserocr code.

Building tesserocr from source (tesserocr-2.6.2.tar.gz) requires also building tesseract development files (or to build leptonica&tesseract from source), otherwise tesserocr build fails. Details are in Readme.

@winstxnhdw
Copy link

winstxnhdw commented Dec 29, 2023

He clearly isn't building tesserocr from source, so there's no need for him to install leptonica and tesseract.

@dickreuter
Copy link
Author

dickreuter commented Dec 29, 2023 via email

@winstxnhdw
Copy link

@dickreuter I have sent you a PR regarding the pipeline.

@winstxnhdw
Copy link

winstxnhdw commented Dec 29, 2023

Also, I noticed that you have libleptonica and libtesseract in your Ubuntu Docker builds. You can remove them safely for faster builds and a smaller image size as they are now bundled into the tesserocr installation.

@zdenop
Copy link
Contributor

zdenop commented Dec 29, 2023

If this is correct:

Downloading tesserocr-2.6.2.tar.gz

then he is for 100% building from source. Maybe not intentionally, but this is source code - not a wheel (binary build)...

@winstxnhdw
Copy link

winstxnhdw commented Dec 29, 2023

Collecting tesserocr (from -r requirements.txt (line 31))

The log here already tells you that he is doing a pip install from requirements.txt. Also, circling back to your earlier point, there's no need to install leptonica and tesseract anymore. The README is outdated.

I am using tesserocr without installing those dependencies in my Examplify app.

@zdenop
Copy link
Contributor

zdenop commented Dec 29, 2023

And??? pip invoke build from source if it did not find a wheel... Are you familiar with the tools you try to use?

@zdenop
Copy link
Contributor

zdenop commented Dec 29, 2023

What exactly is outdated in README?

@winstxnhdw
Copy link

winstxnhdw commented Dec 29, 2023

And??? pip invoke build from source if it did not find a wheel...

Why does this matter? OP is using Windows and installing with pip, obviously expecting a binary build, which there is. Just that the maintainer's setup.py doesn't pull the wheels for Windows for whatever reason.

What exactly is outdated in README?

The entire requirements section. Instead, he should add that to a section specifically for building from source / development.

@dickreuter
Copy link
Author

dickreuter commented Dec 30, 2023 via email

@zdenop
Copy link
Contributor

zdenop commented Dec 30, 2023

The entire requirements section.

Seriously?? This one?

pip
Download the wheel file corresponding to your Windows platform and Python installation from [simonflueckiger/tesserocr-windows_build/releases](https://github.com/simonflueckiger/tesserocr-windows_build/releases) and install them via:

> pip install <package_name>.whl

Do you understand that text? What is outdated there? Please state facts, not vague accusations.

Just that the maintainer's setup.py doesn't pull

tesserocr (this project where the issue was created) NEVER produced Windows binary version. It was always created externally.

the wheels for Windows for whatever reason.

whatever the reason => the latest Windows wheel is 2.6.0
And it is not a problem if somebody knows how to write requirements.txt correctly.

@winstxnhdw
Copy link

winstxnhdw commented Dec 30, 2023

It is truly amazing how you missed this entire part

Requires libtesseract (>=3.04) and libleptonica (>=1.71).

On Debian/Ubuntu:

$ apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config
You may need to manually compile tesseract for a more recent version. Note that you may need to update your LD_LIBRARY_PATH environment variable to point to the right library versions in case you have multiple tesseract/leptonica installations.

tesserocr (this project where the issue was created) NEVER produced Windows binary version. It was always created externally.

Exactly, and that's the problem. If you are going to commit to supporting a platform, the maintainer should do it well.

@zdenop
Copy link
Contributor

zdenop commented Dec 30, 2023

It is truly amazing how you missed this entire part

I did not miss it. Is correct and relevant. Or do you claim you can run tesserocr on Debian without these libraries???

Exactly, and that's the problem. If you are going to commit to supporting a platform, the maintainer should do it well.

It is not a problem. E.g. tesseract and leptonica support many platforms but they never provide binary packages, just a source code.

@winstxnhdw
Copy link

winstxnhdw commented Dec 30, 2023

Or do you claim you can run tesserocr on Debian without these libraries???

I am just saying that there is no longer a need to explicitly install these dependencies. You were even a participant on the PR for this change.

It is not a problem. E.g. tesseract and leptonica support many platforms but they never provide binary packages, just a source code.

We can agree to disagree then. I believe it's the maintainer's responsibility to ensure that the DX for installing their libraries should always be seamless. In one of my projects, I made sure to bundle the nvidia cublas and cudnn libraries along with the wheel. I know some people may argue that it could be a redundant install if the user already has the dependencies installed in the machine, but relying on the user's PATH to properly resolve these dependencies, in my experience and many others, usually just leads to pain.

To reiterate, the only reason why I, and many others are using this library instead of pytesseract is because the OCR engine is bundled within the installation. That can lead to many advantages. For one, I don't have to add a layer to my docker image for installing these dependencies and I don't have to worry about whether my OS has or has not installed the dependencies in the PATH that tesserocr is expecting.

@zdenop
Copy link
Contributor

zdenop commented Jan 1, 2024

am just saying that there is no longer a need to explicitly install

... untill you start to face the problems - see e.g. #337. Other problems were reported for Mac. Distributing own binary libraries on Linux is not a good idea. Linux philosophy is using system shared libraries => tesserocr should be linked against system leptonica and tesseract and not against their custom build.
pip install --no-binary tesserocr tesserocr is the right way to install tesseroct on Linux and similar systems (MacOS, Freebsd). Windows is the other problem because ... it is Windows.

...pytesseract is because the OCR engine is bundled within the installation

pytesseract does not bundle OCR - it wraps tesseract executable (e.g. you need to install tesseract separately) while tesserocr wraps (and links) tesseract library. As far as I understand pytesseract decided to go this way to avoid problems with distributing binary libraries, dependancies, security etc. (e.g. it leaves all problems to tesseract packagers)...

I believe it's the maintainer's responsibility to ensure that the DX for installing their libraries should always be seamless

No. It is a packager responsibility. Packager != maintainer. There is a split of tasks and responsibilities and it is right.
GTK, pango, gnome, KDE maintainers do not care if you are able to install their products/libraries on Windows etc... The same problem is with Windows or Mac OS apps&libs.

@winstxnhdw
Copy link

pytesseract does not bundle OCR - it wraps tesseract executable (e.g. you need to install tesseract separately) while tesserocr wraps (and links) tesseract library.

You misread me. I am saying that I prefer tesserocr over pytesseract because it links the tesseract library.

... untill you start to face the problems - see e.g. #337.

Is this issue not because the maintainer failed to properly pre-compile tesseract in the proper environment?

GTK, pango, gnome, KDE maintainers do not care if you are able to install their products/libraries on Windows etc..

And you're right, they don't have to because they do not explicitly support these platforms. This is unlike tesserocr which explicitly mentions support for these platforms in the README. In this case, this library is playing the role of the Packager.

All I am saying is that tesserocr's DX is almost there. Just update the README and fix the automated CIs that pre-compile the tesseract library so that everyone gets the full-feature set.

@sirfz sirfz closed this as completed Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants