Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesserocr fails to detect Tesseract 4 #60

Closed
Belval opened this issue Jul 6, 2017 · 12 comments
Closed

tesserocr fails to detect Tesseract 4 #60

Belval opened this issue Jul 6, 2017 · 12 comments

Comments

@Belval
Copy link
Contributor

Belval commented Jul 6, 2017

I compiled Tesseract 4 from source and tried it from the command line (it works). Executable is in /usr/local/bin, library is in /usr/local/lib.

export LD_LIBRARY_PATH=/usr/local/lib

tesserocr will still load Tesseract 3 which shouldn't be on my system anymore according to apt-get.

@Belval
Copy link
Contributor Author

Belval commented Jul 6, 2017

Digging a bit deeper: the Tesseract version is set during setup.py execution. Therefore reinstalling was necessary. Unfortunately it only popped another error:

/tesserocr.cpython-35m-x86_64-linux-gnu.so: undefined symbol: omp_get_thread_num

@Belval
Copy link
Contributor Author

Belval commented Jul 6, 2017

Switching to the right branch (my bad) solved that problem, now I get:

('/usr/local/', ['fra', 'osd', 'eng'])
Traceback (most recent call last):
    File "test.py", line 11, in <module>
      with PyTessBaseAPI() as api:
    File "tesserocr.pyx", line 1144, in tesserocr.PyTessBaseAPI.__cinit__ (tesserocr.cpp:9953)
    File "tesserocr.pyx", line 1157, in tesserocr.PyTessBaseAPI._init_api (tesserocr.cpp:10129)
 RuntimeError: Failed to init API, possibly an invalid tessdata path?

As the first line shows, I do have fra, eng, osd in my tessdata path...

@Belval
Copy link
Contributor Author

Belval commented Jul 6, 2017

Okay so I did some troubleshooting and I think I figured some stuff.

The path parameter or the PyTessBaseAPI seems to be overriden by the environnement variable TESSDATA_PREFIX, even if said var is an empty string.

For example, both

api = PyTessBaseAPI()

and

api = PyTessBaseAPI(path='/usr/local/') 

will return RuntimeError: Failed to init API, possibly an invalid tessdata path?

While setting the environnement variable to /usr/local/ will make it work (if that's where you tessdata/ folder is).

I'll try to make a Pull Request that fixes this tonight.

@Belval
Copy link
Contributor Author

Belval commented Jul 7, 2017

Here's what I did:

  1. Clean Archlinux installation (the problem was spotted on an Ubuntu machine, hopefully that won't have any effects on the issue)
  2. Git clone & build the very last version of Leptonica
  3. Git clone & build the very last version of Tesseract
  4. Git clone & python3 setup.py install tesserocr
  5. Editing setup.py to make it use Tesseract v4
    6 python3 setup.py install

Observations:

  • If tessdata is in the default folder, everything works
  • If tessdata is in another folder, and TESSDATA_PREFIX is set to said folder, it works
  • If tessdata is in another folder, TESSDATA_PREFIX is empty, and no path parameter is given it doesn't work (obviously)
  • If tessdata is in another folder, TESSDATA_PREFIX is empty, but the path parameter is
    given it doesn't work (the problem)

Unfortunately, working with the Cython code seems to indicate a problem on Tesseract's end and not Tesserocr. The passed path argument is right in the

cdef int ret = self._baseapi.Init(path, lang, oem, configs, configs_size, vars_vec, vars_vals, set_only_non_debug_params)

call.

@eromoe
Copy link

eromoe commented Jul 14, 2017

Similar error on ubuntu 14.04.

apt-get version not match Requires libtesseract (>=3.04) and libleptonica (>=1.71), so I build the lastest version of Leptonica and Tesseract.

@Belval
Copy link
Contributor Author

Belval commented Jul 14, 2017

@eromoe You have to

  1. Uninstall all previous Tesseract and Leptonica version
  2. Build Leptonica 1.73
  3. Build Tesseract 4
  4. pip uninstall tesserocr If it was already installed
  5. pip install tesserocr <- This is the important bit (and the one I didn't understand) you have to reinstall tesserocr for it to use Tesseract 4.
  6. Make sure you have a tessdata folder and declare an environnement variable for it for me it was export TESSDATA_PREFIX=/usr/share but the path could be different for you.

Hopefully that helps

@eromoe
Copy link

eromoe commented Jul 17, 2017

@Kankroc My problem is eng.traineddata missing, there is nothing under /usr/local/share/tessdata/ after building Tesseract. Tesseract 's building guide include a command make install-langs, which not work.

And tesserocr did not give the right error message to me(it gave the message same as yours) , but pytesseract did. So I manually downloaded eng.traineddata and moved to /usr/local/share/tessdata/, then tesserocr and pytesseract both became fine .

@Belval
Copy link
Contributor Author

Belval commented Jul 17, 2017

@eromoe Yeah I had that problem too, probably worth looking into it (make install-langs not working) or adding a specific exception in tesserocr.

@sirfz
Copy link
Owner

sirfz commented Jul 22, 2017

Note that installing tesseract from source does not include tessdata which should be downloaded separately from https://github.com/tesseract-ocr/tessdata. I've pushed a change to show the current tessdata path in the runtime exception to make it easier to debug.

@sirfz sirfz closed this as completed Jul 22, 2017
@redstoneleo
Copy link

redstoneleo commented Jun 26, 2018

From my experience ,
one has to specify the path and lang parameters in order to use PyTessBaseAPI
It is said that the default value of lang is eng, but one cannot omit the parameter either .

an working example

from tesserocr import PyTessBaseAPI

images = [r'C:\Users\i\AppData\Local\Programs\Python\115.jpg']
with PyTessBaseAPI(path='.', lang='eng') as api:
	for img in images:

		api.SetImageFile(img)
		
		print (api.GetBoxText(0))

@Adblu
Copy link

Adblu commented Jun 28, 2018

got the same problem. WHen i updated and upgraded ubuntu (sudo apt-get upgrade/update) This error appeard: RuntimeError: Failed to init API, possibly an invalid tessdata path: /usr/share/tesseract-ocr/

flip111 added a commit to flip111/tesserocr that referenced this issue Jul 12, 2018
Add information about tessdata. There are a lot of issues about this and nothing in the readme yet. The information is just what i gathered from these issues and get from my own experience. Related issues:

sirfz#101
sirfz#28
sirfz#60
sirfz#100
@vivekav-96
Copy link

Switching to the right branch (my bad) solved that problem, now I get:

('/usr/local/', ['fra', 'osd', 'eng'])
Traceback (most recent call last):
    File "test.py", line 11, in <module>
      with PyTessBaseAPI() as api:
    File "tesserocr.pyx", line 1144, in tesserocr.PyTessBaseAPI.__cinit__ (tesserocr.cpp:9953)
    File "tesserocr.pyx", line 1157, in tesserocr.PyTessBaseAPI._init_api (tesserocr.cpp:10129)
 RuntimeError: Failed to init API, possibly an invalid tessdata path?

As the first line shows, I do have fra, eng, osd in my tessdata path...

Can you please explain how you solved the error - undefined symbol: omp_get_thread_num

sirfz added a commit that referenced this issue Jun 19, 2021
* Update README.rst

Add information about tessdata. There are a lot of issues about this and nothing in the readme yet. The information is just what i gathered from these issues and get from my own experience. Related issues:

#101
#28
#60
#100

Co-authored-by: Fayez <iamfayez@gmail.com>
softdev050 added a commit to softdev050/tesserocr that referenced this issue Apr 5, 2023
* Update README.rst

Add information about tessdata. There are a lot of issues about this and nothing in the readme yet. The information is just what i gathered from these issues and get from my own experience. Related issues:

sirfz/tesserocr#101
sirfz/tesserocr#28
sirfz/tesserocr#60
sirfz/tesserocr#100

Co-authored-by: Fayez <iamfayez@gmail.com>
sayjun0505 added a commit to sayjun0505/tesserocr that referenced this issue Apr 8, 2023
* Update README.rst

Add information about tessdata. There are a lot of issues about this and nothing in the readme yet. The information is just what i gathered from these issues and get from my own experience. Related issues:

sirfz/tesserocr#101
sirfz/tesserocr#28
sirfz/tesserocr#60
sirfz/tesserocr#100

Co-authored-by: Fayez <iamfayez@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants