Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

good accuracy but too slow, how to improve Tesseract speed #263

Open
ychtioui opened this issue Mar 10, 2016 · 61 comments

Comments

Projects
None yet
@ychtioui
Copy link

commented Mar 10, 2016

I integrated Tesseract C/C++, version 3.x, to read English OCR on images.

It’s working pretty good, but very slow. It takes close to 1000ms (1 second) to read the attached image (00060.jpg) on my quad-core laptop.

I’m not using the Cube engine, and I’m feeding only binary images to the OCR reader.

Any way to make it faster. Any ideas on how to make Tesseract read faster?
thanks
00060

@stweil

This comment has been minimized.

Copy link
Contributor

commented Mar 10, 2016

You can already run 4 parallel instances of Tesseract on your quad core, then it will read 4 images in about the same time. Introducing multi threading would not help to reduce the time needed for an OCR of many images. I am working on a project where OCR with Tesseract would take nearly 7 years on a single core, but luckily I can try to get many computers and use their cores, so the time can be reduced to a few days.
Using compiler settings which are optimized for your CPU helps to gain a few percent, but I am afraid that for a larger gain different algorithms in Tesseract and its libraries would be needed.

@ychtioui

This comment has been minimized.

Copy link
Author

commented Mar 10, 2016

Besides the OCR, we have other things that need to run on the other cores.
I believe, the main issue that's slowing down Tesseract is the way memory is managed.
Too many memory allocations (new function) and releases (delete or delete [] functions) do slow down the reader.
In the past, I did use a different OCR engine, and it was allocating up-front large buffers to store all the needed data (large buffer of blobs, a large buffer of lines, a large buffer of words and their corresponding data), the buffers were just being indexed as we were reading the data from an image. The large buffers were allocated only once upon ocr engine initialization and release only once upon ocr engine shutdown. This memory management scheme was very efficient computational-time-wise.
Are there any settings for Tesseract that are known to be computationally intensive?
any tricks to speed up Tesseract?

@tfmorris

This comment has been minimized.

Copy link
Contributor

commented Mar 10, 2016

What evidence is your memory management speculation based on?

@ychtioui

This comment has been minimized.

Copy link
Author

commented Mar 10, 2016

I'm not speculating anything. The reality is that TesseRact takes more than 3 seconds to read the above image that I initially attached (I use VS2010). When I use the console test application that comes with the TesseRact, it takes about the same time (more than 3 seconds).

Anyone would speculate a lot in 3 seconds

I have more than 20 years in machine vision. I used several OCR engines in the past. Actually I have one -in house- that reads the same image in less than 100ms, but our engine is designed more for reading a single line of text (i.e. it returns a single line of text).

TesseRact database is not that large. Most of the techniques used by TesseRact are quite standard in the OCR-area (page layout, line extraction, possible character extraction, word forming, and then several phases of classification). However, the TesseRact manages very badly memory usage. why? it takes more than 3 seconds to read a typical texted-image.

please if you're not bringing any meaningful ideas to my posting, just spare me your comment.

@stweil

This comment has been minimized.

Copy link
Contributor

commented Mar 11, 2016

@ychtioui, as you have spent many years in machine vision, you know quite well that there are lots of ways why programs can be slow. Memory management is just one of them. Even with a lot of experience, I'd start running performance analyzers to investigate performance issues. Of course I can guess what might be possible reasons and try to improve the software based on that guesses, but improvements based on evidence (like the result of a performance analysis) are more efficient. Don't you think so, too? Do you have a chance to run a performance analysis?

@zdenop

This comment has been minimized.

Copy link
Contributor

commented Mar 11, 2016

You can try to use 3.02 version if you need only English. AFAIR it was
singnificantly faster on my (old) computer.

Zdenko

On Thu, Mar 10, 2016 at 4:35 PM, younes notifications@github.com wrote:

I integrated Tesseract C/C++, version 3.x, to read English OCR on images.

It’s working pretty good, but very slow. It takes close to 1000ms (1
second) to read the attached image (00060.jpg) on my quad-core laptop.

I’m not using the Cube engine, and I’m feeding only binary images to the
OCR reader.

Any way to make it faster. Any ideas on how to make Tesseract read faster?
thanks
[image: 00060]
https://cloud.githubusercontent.com/assets/9968625/13674495/ac261db4-e6ab-11e5-9b4a-ad91d5b4ff87.jpg


Reply to this email directly or view it on GitHub
#263.

@ychtioui

This comment has been minimized.

Copy link
Author

commented Mar 14, 2016

I'm running version 3.02
I'm going through different sections of the reader, and checking which section is taking the most time.

is it typical to read images (such as mine attached above) in a few seconds?

thanks for your comments.

@amitdo

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2016

... 3.02 version ... AFAIR it was significantly faster on my (old) computer.

3.02 3.02.02 is compiled with '-O3' by default.
https://github.com/tesseract-ocr/tesseract/blob/3.02.02/configure.ac#L161

3.03 and 3.04 are compiled with '-O2' by default.
https://github.com/tesseract-ocr/tesseract/blob/3.03-rc1/configure.ac#L201
https://github.com/tesseract-ocr/tesseract/blob/3.04.01/configure.ac#L300

2.04 and 3.01 are compiled with '-O0' '-O2' by default.
https://github.com/tesseract-ocr/tesseract/blob/2.04/configure.ac
https://github.com/tesseract-ocr/tesseract/blob/3.01/configure.ac
The 'configure.ac' script in these versions does not explicitly set the '-O' level, so autotools will use '-O0' '-O2' as default.

@ychtioui

This comment has been minimized.

Copy link
Author

commented Mar 18, 2016

thanks amitdo.
I'm using 3.02 but the C/C++ version of Tesseract.
I couldn't find the setting -O3 in the source files. where is it?

@amitdo

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2016

What I linked to was actually 3.02.02

I think this is 3.02:
https://github.com/tesseract-ocr/tesseract/blob/d581ab7e12a2fac4a73ac0af4ce7ec522b8f3e42/configure.ac

You are right. It does not contain any '-On' flag, so the compiler will use '-O0', which is not good for speed. so if you are using autotools to build Tesseract it will instruct the compiler to use '-O2'.

@amitdo

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2016

I assume you are using Tesseract on Linux / FreeBSD / Mac. On Windows + MS Visual C++ the configure.ac file is irrelevant.

@Shreeshrii

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2016

@ychtioui said in a post above "I use VS2010" so using Windows.

@amitdo

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2016

Thanks Shree.

I don't know which optimization level is used for Visual C++.

@ychtioui

This comment has been minimized.

Copy link
Author

commented Mar 19, 2016

I use vs2010 on a Windows 7 pc.
Project settings or building options won't change much the read speed.
Tesseract was designed in research labs. Most of the key sections of the reader are speed-don't-care.
I used some performance tools to analyze where most of the computation time is spent.
In the page layout section, the blob analyzer does a lot of new/delete. This is very time consuming. The attached image above has more than 3600 blobs. Besides a number of processings are done on each blob (distance transform, finding the enclosing rectangle, measuring blob parameters, etc.). The allocations (new) and the release (delete) of all these blobs is very time consuming.
If we use a global array (allocate upfront) of blobs (exactly object BLOBNBOX) and whenever we need a blob, just get one index from the array. The array will be released once when we shut down the engine.
I used this concept in another single line ocr reader and it's super fast.

@zdenop

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2016

VS2010 use optimization flag /O2 (Maximize speed) - other flags are set to default.
In past in forum there were warnings against using compiler optimization flag as they affect also OCR results. This is reason why there are standard optimization flags (-O2 in autotools and /O2 in VS).

I tried to run perf tool on linux:
perf record tesseract eurotext.tif eurotext
and I got this report (perf report):

  39,77%  tesseract  libtesseract.so.3.0.4  [.] tesseract::SquishedDawg::edge_char_of
  13,98%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeCharNormArrays
  13,09%  tesseract  libtesseract.so.3.0.4  [.] IntegerMatcher::UpdateTablesForFeature
   4,22%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::PruneClasses
   2,66%  tesseract  libtesseract.so.3.0.4  [.] ScratchEvidence::UpdateSumOfProtoEvidences
   1,48%  tesseract  libtesseract.so.3.0.4  [.] ELIST_ITERATOR::forward
   1,16%  tesseract  libc-2.19.so           [.] _int_malloc
   1,15%  tesseract  libtesseract.so.3.0.4  [.] tesseract::ShapeTable::MaxNumUnichars
   1,01%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ExpandShapesAndApplyCorrections
   0,87%  tesseract  liblept.so.5.0.0       [.] rasteropLow
   0,79%  tesseract  libm-2.19.so           [.] __mul
   0,72%  tesseract  libtesseract.so.3.0.4  [.] FPCUTPT::assign
   0,71%  tesseract  libc-2.19.so           [.] _int_free
   0,71%  tesseract  libtesseract.so.3.0.4  [.] ELIST::add_sorted_and_find
   0,61%  tesseract  libtesseract.so.3.0.4  [.] tesseract::AmbigSpec::compare_ambig_specs
   0,57%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeNormMatch
   0,52%  tesseract  libc-2.19.so           [.] memset
   0,49%  tesseract  libc-2.19.so           [.] vfprintf
   0,45%  tesseract  libc-2.19.so           [.] malloc
   0,36%  tesseract  libtesseract.so.3.0.4  [.] SegmentLLSQ
   0,31%  tesseract  libm-2.19.so           [.] __ieee754_atan2_sse2
   0,31%  tesseract  libc-2.19.so           [.] malloc_consolidate
   0,30%  tesseract  libtesseract.so.3.0.4  [.] LLSQ::add
   0,29%  tesseract  libtesseract.so.3.0.4  [.] GenericVector<tesseract::ScoredFont>::operator+=
   0,29%  tesseract  libtesseract.so.3.0.4  [.] _ZN14ELIST_ITERATOR7forwardEv@plt
   0,28%  tesseract  libtesseract.so.3.0.4  [.] tesseract::ComputeFeatures
   0,25%  tesseract  liblept.so.5.0.0       [.] pixScanForForeground
   0,24%  tesseract  libtesseract.so.3.0.4  [.] GenericVector<tesseract::ScoredFont>::reserve
   0,20%  tesseract  libtesseract.so.3.0.4  [.] C_OUTLINE::increment_step
   0,20%  tesseract  [kernel.kallsyms]      [k] clear_page

according this report 3 top function consumed 66% of "time".

Then I tried 4 pages (A4 ) tiff (G4 compressed):

  52,24%  tesseract  libtesseract.so.3.0.4  [.] tesseract::SquishedDawg::edge_char_of
  12,06%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeCharNormArrays
  10,06%  tesseract  libtesseract.so.3.0.4  [.] IntegerMatcher::UpdateTablesForFeature
   3,57%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::PruneClasses
   1,90%  tesseract  libtesseract.so.3.0.4  [.] ScratchEvidence::UpdateSumOfProtoEvidences
...

Then I tried non eng image: perf record tesseract hebrew.png hebrew -l heb:

  27,79%  tesseract  libtesseract.so.3.0.4  [.] IntegerMatcher::UpdateTablesForFeature
  27,34%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeCharNormArrays
   4,40%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::PruneClasses
   3,98%  tesseract  libtesseract.so.3.0.4  [.] ScratchEvidence::UpdateSumOfProtoEvidences
   3,05%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeNormMatch
   2,36%  tesseract  libtesseract.so.3.0.4  [.] tesseract::ShapeTable::MaxNumUnichars
   2,05%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ExpandShapesAndApplyCorrections
...
@zdenop

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2016

Just for record for possible improvement in this issue: there was interesting information posted in scantailor project: OpenCL alone only brings ~2x speed-up. Another ~6x speed-up comes from multi-threaded processing.

@anantthebiker

This comment has been minimized.

Copy link

commented Oct 21, 2016

Hi @ychtioui I am newbie and saw your first comment that you are able to get pretty accurate results from Tesseract. For your image itself i am no table to get any results its telling: Can't recognize image. Can you plz provide the code snippet on how you are processing the image.
Thanks - Anant.

@amitdo

This comment has been minimized.

Copy link
Contributor

commented Nov 28, 2016

@theraysmith
What do you use in the internal Google build, -O2 or -O3?

@paladini

This comment has been minimized.

Copy link

commented Apr 8, 2017

I'm interested in the same answer, @amitdo . Can you answer the question, @theraysmith ? It really can help us :)

@stweil

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2017

Don't expect much difference between -O2 and -O3. I tried different optimizations, and they only have small effects on the time needed for OCR of a page. Higher optimization levels can even result in slower code because the code gets larger (because of unfolding of loops), so CPU caches become less effective. It is much more important to write good code.

@theraysmith

This comment has been minimized.

Copy link
Contributor

commented Apr 14, 2017

@stweil

This comment has been minimized.

Copy link
Contributor

commented Apr 15, 2017

The improvement by using -fopenmp is useful when you want "realtime" OCR – running OCR for a single page and waiting for the result. Then it is fast because it uses more than one CPU core for some time consuming parts of the OCR process.

For mass OCR, it does not help. If many pages have to be processed, it is better to use single threaded Tesseract and run several Tesseract processes in parallel.

@amitdo

This comment has been minimized.

Copy link
Contributor

commented Apr 15, 2017

Stefan, what about using OpenMP for training?

@stweil

This comment has been minimized.

Copy link
Contributor

commented Apr 15, 2017

Yes, for training a single new model OpenMP could perhaps speed up the training process. Up to now, OpenMP is only used in ccmain/ and in lstm/. I don't know how much that part is used during training, and I never have run a performance evaluation for the training process (in fact I‌ have only run LSTM training once for Fraktur, and as I already said, it was not really successful).

@theraysmith

This comment has been minimized.

Copy link
Contributor

commented Apr 17, 2017

@xlight

This comment has been minimized.

Copy link

commented Apr 19, 2017

can I set more than 4 threads for Trainning LSTM?

@theraysmith

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2017

@amitdo

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2017

What about machines that have only 2 cores?
Shouldn't the 'num_threads' lowered to 2 in that case?

@theraysmith

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2017

@ychtioui

This comment has been minimized.

Copy link
Author

commented Apr 25, 2018

ShounakCy
Our in-house ocr reader is super fast in reading single lines of multi-fonts. It's proprietary (not open-source).
Tesseract 4.x is much accurate than 3.x since it uses Neural Networks.
I believe the key to improving Tesseract Speed is to use OpenCL.

@ShounakCy

This comment has been minimized.

Copy link

commented Apr 26, 2018

@MattyCi

This comment has been minimized.

Copy link

commented Apr 27, 2018

Hi, sorry if this is the wrong place to ask, but how are some users achieving very fast speeds compared to what I am getting? It takes me close to 4 seconds to run the OPs image. This user seems to run a 6 page PDF through tesseract in a matter of seconds, whereas it takes me minutes to run through that many pages of similar text. I have a Ryzen 3 1200 and 8 GB RAM. I have installed versions 3.02, 3.04, 3.05, and 4.00 with all the same results.

@zdenop

This comment has been minimized.

Copy link
Contributor

commented Apr 30, 2018

Yes, this is wrong place to post questions. As you can see that user is using version provided by is distribution his speed it related to:

  • power off his computer
  • complexity of input document.
@AbdelsalamHaa

This comment has been minimized.

Copy link

commented May 10, 2018

I'm using tesseract 3.04 with ara.traineddata of course i also use the cube files , to initialize the file it taks too much time , it takes from me 15 min just to initialize , any idea how to improve that

im using visual studio 2013

@Shreeshrii

This comment has been minimized.

Copy link
Contributor

commented May 10, 2018

@AbdelsalamHaa

This comment has been minimized.

Copy link

commented May 10, 2018

i have tested 4.0 it's very good and fast
the reason why im using 3.04 is due to i have so many other libraries build in 2013 visual studio , and tesseract 4 doesn't not support in 2013 vs . means if i want to use 4.0 i have to rebuild all the libraries again.

if u have any suggestion please let me know

@SandeepShaw2017

This comment has been minimized.

Copy link

commented Jun 11, 2018

I am also having similar issue .... am having more than 50K data .... I ran ocr and it took 12 hours to process only 1000 pdf .... how to make tessaract fast .... can using hadoop make it fast

@raffopazzo

This comment has been minimized.

Copy link

commented Jun 11, 2018

Have you tried OMP_NUM_THREADS=1 tesseract ... as described in #898 ?

@SandeepShaw2017

This comment has been minimized.

Copy link

commented Jun 11, 2018

How to use "OMP_NUM_THREADS=1 tesseract" in R

@amitdo

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2018

Have you tried OMP_NUM_THREADS=1 tesseract ... as described in #898 ?

OMP_NUM_THREADS=1 will have no impact.

#898 (comment)

Something that DOES work:
#898 (comment)

@raffopazzo

This comment has been minimized.

Copy link

commented Jun 11, 2018

@amitdo oops I copied from the wrong comment. Indeed OMP_THREAD_LIMIT=1 tesseract... is what worked for me

@SandeepShaw2017

This comment has been minimized.

Copy link

commented Jun 11, 2018

I am still not clear how to improve the speed .... my code is in R and I used ocr function .... where should I use "OMP_THREAD_LIMIT=1 tesseract..."

@raffopazzo

This comment has been minimized.

Copy link

commented Jun 11, 2018

@SandeepShaw2017 I'm not sure I can help. I don't know much about R so I can only give some general advise: if you are calling tesseract's functions directly from your R code, then maybe you have to set it when running your own app, e.g. from command line OCR_THREAD_LIMIT=1 ./my-R-script or via some System.setEnv("OCR_THREAD_LIMIT", 1); If you use tesseract as an application that your R code executes (eg via System.exec() or something), then you need to set the environment variable OCR_THREAD_LIMIT=1 for that process in whatever way R does it, or maybe via the same method as the former case if the child process inherits the environment variables. You should do your own googling, this seems to be an R-specific issue rather than tesseract's.

@SandeepShaw2017

This comment has been minimized.

Copy link

commented Jun 13, 2018

Setting Sys.setenv(OMP_THREAD_LIMIT= 1) is still taking more than 20 sec ..... can processing in R hadoop rmr2 help to reduce process time

@amitdo

This comment has been minimized.

Copy link
Contributor

commented Jun 13, 2018

Use multi-threading in your application. Initialize N instances of TessBaseAPI. N should be the number of
CPU cores. Each instance should handle a different image.

@SandeepShaw2017

This comment has been minimized.

Copy link

commented Jun 14, 2018

Dear Amit .... i am having 4 cores ... so does that mean I will be using the ocr tool in 4 consoles of RStudio ....

@amitdo

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2018

I don't know R. Just try and see.

@stweil

This comment has been minimized.

Copy link
Contributor

commented Jun 15, 2018

Adding more consoles which run R with Tesseract until all CPU cores are fully used is one way how you can get maximum throughput.

@WaltPeter

This comment has been minimized.

Copy link

commented Jul 2, 2018

Hi, I have a similar but slightly different problem here.
I am using Python 3.7 with Tesseract 3.02. And I am new to Tesseract.
I used pytesseract.image_to_string function, and it took me a long duration on the "first run".

result for first run:

'Cuz my associate professor 
at college advises the club. 

Duration: 259.72785544395447

result for second run:

'Cuz my associate professor
at college advises the club.

Duration: 0.9130520820617676

Can anyone please explain to me why will it happened? This is the 2nd day I am using Tesseract.
Thank you.


complete python code:

from PIL import Image
import time 
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files (x86)/Tesseract-OCR/tesseract.exe"

start_time = time.time()

img = Image.open("Pic5.png")
print("start")
result = pytesseract.image_to_string(img, config='--tessdata-dir "C:/Program Files (x86)/Tesseract-OCR/tessdata"')
print(result)

duration = time.time() - start_time
print("\nDuration:", duration)
@stweil

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2018

@WaltPeter, you are obviously running on Windows, so anything can happen in the background and delay your test, for example AV scans, disk defragmentation or software updates. Try running your test many times to see how times vary.

A Python program name.py is compiled at the first run into a name.pyc, but that should not take more than a second. You can remove all *.pyc files to force a new compilation.

@burinov

This comment has been minimized.

Copy link

commented Jul 6, 2018

So, guys... How speed things up? Any practical ideas?

@Wesley-Li

This comment has been minimized.

Copy link

commented Sep 28, 2018

I get the same issue with Tesseract 4.0.0 beta upon my Centos 7.3 setup.
It takes 0.91 second to detect one character.
Anything updated for this issue?

@dagnelies

This comment has been minimized.

Copy link

commented Jan 18, 2019

Just a detail, but I recommend using OMP_THREAD_LIMIT=1 so that tesseract runs in single thread mode.

By default, tesseract runs in multithread mode but apparently this just burns out CPU cycles without benefits. Here is an example on a 4 cores machine:

root@ubuntu-16gb-nbg1-1:/fv# export OMP_THREAD_LIMIT=1
root@ubuntu-16gb-nbg1-1:/fv# time tesseract 2.tif 2.txt
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
...
Page 12

real    0m34.300s
user    0m33.682s
sys     0m0.617s
root@ubuntu-16gb-nbg1-1:/fv# export OMP_THREAD_LIMIT=4
root@ubuntu-16gb-nbg1-1:/fv# time tesseract 2.tif 2.txt
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
...
Page 12

real    0m31.943s
user    1m19.374s
sys     0m1.346s

Consumes three times more CPU while not even 10% faster.

@stweil

This comment has been minimized.

Copy link
Contributor

commented Jan 19, 2019

Yes, I can confirm that. For mass production I'd even build an executable without OpenMP support (configure --disable-openmp ...) to remove the remaining overhead.

@tfmorris

This comment has been minimized.

Copy link
Contributor

commented Jan 19, 2019

Sounds like we should change the build defaults if OpenMP is providing no real benefit.

@dagnelies

This comment has been minimized.

Copy link

commented Jan 21, 2019

I have no idea how the multithreading takes place but I have a feeling it's too low level, resulting in more overhead than gains. If the document's pages as a whole would be processed in parallel, that would probably be a real boost!

zdenop added a commit that referenced this issue Jan 24, 2019

@zdenop

This comment has been minimized.

Copy link
Contributor

commented Jan 24, 2019

I turn off default openmp usage for cmake.
Patch for autotools is welcomed (I have no possibility to get to my linux machine soon).

@noyessie

This comment has been minimized.

Copy link

commented Feb 1, 2019

I'm not speculating anything. The reality is that TesseRact takes more than 3 seconds to read the above image that I initially attached (I use VS2010). When I use the console test application that comes with the TesseRact, it takes about the same time (more than 3 seconds).

Anyone would speculate a lot in 3 seconds

I have more than 20 years in machine vision. I used several OCR engines in the past. Actually I have one -in house- that reads the same image in less than 100ms, but our engine is designed more for reading a single line of text (i.e. it returns a single line of text).

TesseRact database is not that large. Most of the techniques used by TesseRact are quite standard in the OCR-area (page layout, line extraction, possible character extraction, word forming, and then several phases of classification). However, the TesseRact manages very badly memory usage. why? it takes more than 3 seconds to read a typical texted-image.

please if you're not bringing any meaningful ideas to my posting, just spare me your comment.

Hi @ychtioui ,

i am in the same case as you. I have many single line text images and i want to know if you can suggest me a fast and good OCR like the OCR you specify in example.

Thank in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.