Feature Request: Utility to unpack lstmf files #2669

Shreeshrii · 2019-09-23T04:44:38Z

lstmf files contain the image information, ground truth text (and box file information?).

Lines 196 to 204 in c40159a

    
           private: 
        
            STRING imagefilename_;             // File to read image from. 
        
            int32_t page_number_;              // Page number if multi-page tif or -1. 
        
            GenericVector<char> image_data_;   // PNG/PNM file data. 
        
            STRING language_;                  // Language code for image. 
        
            STRING transcription_;             // UTF-8 ground truth of image. 
        
            GenericVector<TBOX> boxes_;        // If non-empty boxes of the image. 
        
            GenericVector<STRING> box_texts_;  // String for text in each box. 
        
            bool vertical_text_;               // Image has been rotated from vertical.

It will be useful to have a utility (similar to combine_tessdata used for traineddata files) which can be used to extract all the components of lstmf files.

This will be useful for verifying the correctness of the files as well as remove the necessity to save original input files (tif/groundtruth/box).

stweil · 2019-10-06T20:45:00Z

Which image format would you prefer? PNG? TIFF?

stweil · 2019-10-06T20:49:19Z

I think that the box file information is implicitly there, because box files for LSTM only need line texts and the bounding box for each line (so you won't get character boxes which might have existed in the initial box files).

Shreeshrii · 2019-10-07T01:16:10Z

lstmf files can be made from multi-page tifs also, so I would say tif as the output format.

Shreeshrii · 2019-10-07T08:09:24Z

Related request, I can open another issue, if you prefer...

Enhance combine_tessdata to also output info from the lstm files - it will be useful to know the network spec and whether the lstm model is compressed or not (integer vs float). This could then be interrogated for any start models given for training so that appropriate error can be reported.

Ray had sent info via email on tessdata_fast models - see https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--40000alpha--network-specification-for-tessdata_fast and #1404 (comment)

The network configuration is stored in the lstm data in the traineddata.
With a small change to combine_tessdata, I produced the attached.

AyushP123 · 2019-11-04T09:59:33Z

I would love to work on creating a new utility or expanding combine_tessdata if anyone hasn't taken this up.

Shreeshrii · 2019-11-27T13:41:39Z

@stweil Any progress on this. Even a single line png version without box information will be useful. I am interested specially in checking how the RTL text is being stored in it.

stweil · 2019-11-27T13:56:24Z

It's less a technical problem but a question where to add it in a user friendly way. Technically the unpack feature would fit well into the tesseract executable because the relevant code parts are already there. But how can we extend the command line syntax for that program (which is already a mess today)? One possibility would be a syntax similar to the one used by git and others:

tesseract [<command>] ...

So tesseract can be followed by a command (for example recognize, lstmf info or lstmf unpack). That command is optional for backward compatibility.

@AyushP123, have you already started working on a utility?

Shreeshrii · 2019-11-27T14:18:13Z

I came across https://github.com/OpenArabic/OCR_GS_Data (used for Kraken Arabic models) and wanted to test training with it. But looks like the wordstrbox is not the right format, hence wanted to check. (file created using python script and using tesseract are in different order).

cat /home/ubuntu/OCR_GS_Data/ara/book_IbnFaqihHamadhani.Buldan/7_final/a_000004.gt.txt
الحسن: مصر عمر سبعة أمصار: المدينة، والبحرين، والبصرة، والكوفة،

python script - gt text copied as is

cat /home/ubuntu/OCR_GS_Data/ara/ground-truth/book_IbnFaqihHamadhani.Buldan_7_final_a_000004.box
WordStr 0 0 2977 170 0 #الحسن: مصر عمر سبعة أمصار: المدينة، والبحرين، والبصرة، والكوفة،
         0 0 2977 170 0

tesseract - order of text is reversed in wordstrbox

OMP_THREAD_LIMIT=1 tesseract /home/ubuntu/OCR_GS_Data/ara/book_IbnFaqihHamadhani.Buldan/7_final/a_000004.png - -l ara --tessdata-dir ~/tessdata_best wordstrbox
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 889
WordStr 1 39 2976 169 0 #.ةفوكلاو »ةرصبلاو »نيرحبلاو »ةنيدملا :راصمأ ةعبس رمع رِّصم :نسحلا
         2977 39 2981 169 0

tesseract - order of recognized text same as gt text

OMP_THREAD_LIMIT=1 tesseract /home/ubuntu/OCR_GS_Data/ara/book_IbnFaqihHamadhani.Buldan/7_final/a_000004.png - -l ara --tessdata-dir ~/tessdata_best
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 889
الحسن: مصِّر عمر سبعة أمصار: المدينة» والبحرين» والبصرة» والكوفة.

stweil · 2019-11-27T14:55:28Z

That looks interesting. https://github.com/OpenITI/ seems to include OCR_GS_Data and could be used to improve the Arabic model (or create a new one) for Tesseract.

A comparison of Tesseract with the published results from Kraken would also be interesting.

Shreeshrii · 2019-11-29T17:17:24Z

A comparison of Tesseract with the published results from Kraken would also be interesting.

My first trial shows Tesseract with worse results compared to Kraken. However, the accuracy is getting pulled down by a subset of lines that are being recognized as multiple lines.

example.

gt =
البصرة والكوفة، وقد تفعل العرب هذا فتسمي الاثنين باسم الجميع، وقال

tess =
البصرة والكوفة
لكوفة، وقذ ره
تفعل العرب هذا فتسم
فتسمي الأاثنين با
مه ه ‎١‏
‏سم لجميع؛ وقال

Shreeshrii · 2019-11-30T03:58:47Z

Works with --psm 7 or --psm 13. I will rerun the reports.

$ tesseract ara-err.png - -l ara --tessdata-dir ~/tessdata_best --psm 6  -c page_separator=''
Warning: Invalid resolution 0 dpi. Using 70 instead.
البصرة والكوفة
لكوفة,؛ وقد ن
تفعل العربة هذا فتسمٌ
فتسمّي الأثنين با
م ‎٠‏ '
سم لجميع؛ وقال

$ tesseract ara-err.png - -l ara --tessdata-dir ~/tessdata_best --psm 7  -c page_separator=''
Warning: Invalid resolution 0 dpi. Using 70 instead.
البصرة والكوفة» وقد تفعل العربة هذا فتسمّي الأثنين باسم الجميعم؛ وقال

$ tesseract ara-err.png - -l ara --tessdata-dir ~/tessdata_best --psm 13  -c page_separator=''
Warning: Invalid resolution 0 dpi. Using 70 instead.
البصرة والكوفة» وقد تفعل العربة هذا فتسمّي الأثنين باسم الجميعم؛ وقال

Shreeshrii · 2019-11-30T07:51:36Z

Buldan	Kraken	Tesseract
7_final_a	Count Missed %Right 8188 45 99.45 ASCII Spacing Characters 1033 190 81.61 ASCII Special Symbols 203 26 87.19 ASCII Digits 40 40 0.00 Latin1 Spacing Characters 7 7 0.00 Latin1 Special Symbols 39179 124 99.68 Basic Arabic 1675 92 94.51 Arabic Extended 50325 524 98.96 Total	Count Missed %Right 8188 16 99.80 ASCII Spacing Characters 1033 54 94.77 ASCII Special Symbols 203 203 0.00 ASCII Digits 40 40 0.00 Latin1 Spacing Characters 7 0 100.00 Latin1 Special Symbols 39179 168 99.57 Basic Arabic 48650 481 99.01 Total
7_final_a_200	Count Missed %Right 6865 44 99.36 ASCII Spacing Characters 841 159 81.09 ASCII Special Symbols 158 42 73.42 ASCII Digits 33 33 0.00 Latin1 Spacing Characters 6 6 0.00 Latin1 Special Symbols 33142 104 99.69 Basic Arabic 1385 94 93.21 Arabic Extended 42430 482 98.86 Total	Count Missed %Right 6865 13 99.81 ASCII Spacing Characters 841 41 95.12 ASCII Special Symbols 158 158 0.00 ASCII Digits 33 33 0.00 Latin1 Spacing Characters 6 0 100.00 Latin1 Special Symbols 33142 148 99.55 Basic Arabic 41045 393 99.04 Total
7_final_b	Count Missed %Right 8409 44 99.48 ASCII Spacing Characters 702 132 81.20 ASCII Special Symbols 103 13 87.38 ASCII Digits 13 13 0.00 Latin1 Spacing Characters 8 8 0.00 Latin1 Special Symbols 39509 121 99.69 Basic Arabic 1623 84 94.82 Arabic Extended 50367 415 99.18 Total	Count Missed %Right 8409 11 99.87 ASCII Spacing Characters 702 33 95.30 ASCII Special Symbols 103 103 0.00 ASCII Digits 13 13 0.00 Latin1 Spacing Characters 8 0 100.00 Latin1 Special Symbols 39509 1471 96.28 Basic Arabic 1330 1330 0.00 Arabic Extended 50074 2961 94.09 Total
7_final_b_200	Count Missed %Right 8409 43 99.49 ASCII Spacing Characters 702 142 79.77 ASCII Special Symbols 103 16 84.47 ASCII Digits 13 13 0.00 Latin1 Spacing Characters 8 8 0.00 Latin1 Special Symbols 39509 131 99.67 Basic Arabic 1623 110 93.22 Arabic Extended 50367 463 99.08 Total	Count Missed %Right 8409 11 99.87 ASCII Spacing Characters 702 33 95.30 ASCII Special Symbols 103 103 0.00 ASCII Digits 13 13 0.00 Latin1 Spacing Characters 8 0 100.00 Latin1 Special Symbols 39509 1471 96.28 Basic Arabic 1330 1330 0.00 Arabic Extended 50074 2961 94.09 Total

stweil · 2019-11-30T08:26:42Z

So Tesseract missed all digits and Arabic extended?

Shreeshrii · 2019-11-30T08:46:33Z

In their groundtruth files they have used 0-9 digits for the Arabic script digits. I substituted those so that the image and text would match. examples:

حدثنا بشر بن محمد بن أبان عن داود بن المخير عن الصلت [89 أ] بن دينار عن

https://github.com/OpenArabic/OCR_GS_Data/blob/master/ara/book_IbnFaqihHamadhani.Buldan/7_final_a/a_000120.gt.txt

I haven't looked at Arabic extended to see what characters are there.

These results are based on using the finetuned traineddata on the training set (similar to what they have done in their research). So, it could be overfitted. I haven't tried it with other 'unseen' books for testing yet.

Shreeshrii · 2019-11-30T08:48:28Z

203 203 0.00 ASCII Digits

It is possible that my substitution also changed some 0-9 digits that were 0-9 in the images.

Shreeshrii · 2019-12-01T04:46:05Z

So Tesseract missed all digits and Arabic extended?

grep into Accuracy reports per image shows that Arabic extended is referring to digits in Arabic script. While they are not getting detected correctly with default psm, they are recognized with --psm 13.


   Count   Missed   %Right
       3        0   100.00   Arabic Extended
       3        0   100.00   Total

  Errors   Marked   Correct-Generated
       1        0   {}-{<\n>}

   Count   Missed   %Right
       1        0   100.00   {٠}
       1        0   100.00   {٤}
       1        0   100.00   {٦}

tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 878
Empty page!!
Estimating resolution as 878
Empty page!!

tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 3
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 878
Empty page!!
Estimating resolution as 878
Empty page!!

tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 4
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 878
٤٠٦

tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 6
Warning: Invalid resolution 0 dpi. Using 70 instead.
٤٠٦

tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 13
Warning: Invalid resolution 0 dpi. Using 70 instead.
٤٠٦

tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 7
Warning: Invalid resolution 0 dpi. Using 70 instead.
٤٠٦

stweil · 2019-12-30T20:09:55Z

A first implementation is available in my unpack branch. It also introduces a new command line syntax. Usage:

tesseract unpack [LSTMF_FILE ...]

This writes two files (unpacked.gt.txt and unpacked.png, overwritten for each lstmf file, so currently not very useful) and shows the transcription and the first box information for each lstmf file.

Only simple lstmf files are currently handled. Do you have a more complex example (multi page tiff)?

Shreeshrii · 2019-12-31T05:39:03Z

Attached is a zip file with a sample of lstmf files in different languages, including multi-page tiff. Do you also need the corresponding images and transcription?

lstmf-samples.zip

Shreeshrii · 2020-01-01T11:47:29Z

@stweil The build is failing. https://travis-ci.org/stweil/tesseract/jobs/631055250#L552

libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I../.. -O2 -DNDEBUG -I../../include -I./include -I../../src/arch -I../../src/ccmain -I../../src/ccstruct -I../../src/ccutil -I../../src/classify -I../../src/cutil -I../../src/dict -I../../src/lstm -I../../src/opencl -I../../src/textord -I../../src/training -I../../src/viewer -I../../src/wordrec -I/usr/include/leptonica -O3 -maltivec -mabi=altivec -mcpu=power8 -mtune=power8 -fopenmp -std=c++17 -MT src/api/libtesseract_la-hocrrenderer.lo -MD -MP -MF src/api/.deps/libtesseract_la-hocrrenderer.Tpo -c ../../src/api/hocrrenderer.cpp -o src/api/libtesseract_la-hocrrenderer.o
../../src/api/tesseractmain.cpp:342:10: fatal error: filesystem: No such file or directory
 #include <filesystem>
          ^~~~~~~~~~~~
compilation terminated.
Makefile:4011: recipe for target 'src/api/tesseract-tesseractmain.o' failed
make[2]: *** [src/api/tesseract-tesseractmain.o] Error 1
make[2]: *** Waiting for unfinished jobs....
mv -f src/api/.deps/libtesseract_la-lstmboxrenderer.Tpo src/api/.deps/libtesseract_la-lstmboxrenderer.Plo
mv -f src/api/.deps/libtesseract_la-hocrrenderer.Tpo src/api/.deps/libtesseract_la-hocrrenderer.Plo
mv -f src/api/.deps/libtesseract_la-baseapi.Tpo src/api/.deps/libtesseract_la-baseapi.Plo
make[2]: Leaving directory '/home/ubuntu/unpack/bin/master'
Makefile:4119: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/ubuntu/unpack/bin/master'
Makefile:1354: recipe for target 'all' failed
make: *** [all] Error 2

stweil · 2020-01-01T12:08:39Z

It requires C++-17. Try to build without that include statement.

stweil · 2020-01-02T20:55:39Z

Attached is a zip file with a sample of lstmf files in different languages, including multi-page tiff.

The latest test code should fix the build problem with C++ before C++17 and adds support for lstmf files with more than one text line. It handles all files in the sample, but I'm not sure that the result is as expected. The results are now written to individual files.

Shreeshrii · 2020-01-03T05:22:36Z

Still not able to build with -std=c++17

Making all in .
g++ -DHAVE_CONFIG_H -I. -I../..  -I../../src/arch -I../../src/ccstruct -I../../src/ccutil -I../../src/dict -I../../src/viewer -O2 -DNDEBUG -I../../include -I./include    -I/usr/include/leptonica  -O3 -maltivec -mabi=altivec -mcpu=power8 -mtune=power8 -fopenmp -std=c++17 -MT src/api/tesseract-tesseractmain.o -MD -MP -MF src/api/.deps/tesseract-tesseractmain.Tpo -c -o src/api/tesseract-tesseractmain.o `test -f 'src/api/tesseractmain.cpp' || echo '../../'`src/api/tesseractmain.cpp
make[2]: Entering directory '/home/ubuntu/tesseract/bin/power8'
g++ -DHAVE_CONFIG_H -I. -I../..  -I../../src/arch -I../../src/ccstruct -I../../src/ccutil -I../../src/dict -I../../src/viewer -O2 -DNDEBUG -I../../include -I./include    -I/usr/include/leptonica  -O3 -maltivec -mabi=altivec -mcpu=power8 -mtune=power8 -fopenmp -std=c++17 -MT src/api/tesseract-tesseractmain.o -MD -MP -MF src/api/.deps/tesseract-tesseractmain.Tpo -c -o src/api/tesseract-tesseractmain.o `test -f 'src/api/tesseractmain.cpp' || echo '../../'`src/api/tesseractmain.cpp
../../src/api/tesseractmain.cpp: In function ‘bool std::filesystem::exists(const char*)’:

../../src/api/tesseractmain.cpp: In function ‘bool std::filesystem::exists(const char*)’:
../../src/api/tesseractmain.cpp:408:10: error: ‘access’ was not declared in this scope
   return access(filename, 0) == 0;
          ^~~~~~
../../src/api/tesseractmain.cpp:408:10: note: suggested alternative: ‘acosl’
   return access(filename, 0) == 0;
          ^~~~~~
          acosl
../../src/api/tesseractmain.cpp: In function ‘bool std::filesystem::exists(const char*)’:
../../src/api/tesseractmain.cpp:408:10: error: ‘access’ was not declared in this scope
   return access(filename, 0) == 0;
          ^~~~~~
../../src/api/tesseractmain.cpp:408:10: note: suggested alternative: ‘acosl’
   return access(filename, 0) == 0;
          ^~~~~~
          acosl
Makefile:4013: recipe for target 'src/api/tesseract-tesseractmain.o' failed
make[1]: *** [src/api/tesseract-tesseractmain.o] Error 1
make[1]: Leaving directory '/home/ubuntu/tesseract/bin/power8'
Makefile:4121: recipe for target 'check-recursive' failed
make: *** [check-recursive] Error 1

gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/powerpc64le-linux-gnu/7/lto-wrapper
Target: powerpc64le-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=powerpc64le-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-objc-gc=auto --enable-secureplt --with-cpu=power8 --enable-targets=powerpcle-linux --disable-multilib --enable-multiarch --disable-werror --with-long-double-128 --enable-checking=release --build=powerpc64le-linux-gnu --host=powerpc64le-linux-gnu --target=powerpc64le-linux-gnu
Thread model: posix
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)

stweil · 2020-01-03T08:31:17Z

That's fixed now. An include statement was missing.

Shreeshrii · 2020-01-03T16:09:39Z

@stweil Thank you for adding this functionality. It will be a very useful feature.

I tested with a multi-page tiff generated by text2image for langdata/eng/eng.training_text. tesseract unpack creates single lines images and their groundtruth transcription.

While all lines in .tif are converted to .png line images, the order of images does not match the order of text lines in original training_text, tif and box. eg. line 50 in text file was image number 71.

The line image numbering is continuous for the lstmf file, there is no indication of page numbering of original tif.

line level box files are not generated.

engeval.zip - Multipage tif

I tested for Devanagari script, using both text2image generated box files and also using the wordstrbox files with a single text line as input. In both cases the output .png and .gt.txt were created correctly.

hineval.zip - text2image

saneval.zip - wordstrbox

For RTL, I tested with Arabic text. While the generated png and original tif match, the training text and generated gt.txt files do not match. I have not tested with Hebrew text yet.

araeval.zip - RTL

Shreeshrii · 2020-01-03T16:17:00Z

hebeval.zip

training text

אחרי אחת אבל מידע כמה במסגרת נולד לו של למרות ב' ב־4

generated gt.txt

4־ב 'ב תורמל לש ול דלונ תרגסמב המכ עדימ לבא תחא ירחא

Shreeshrii · 2020-01-03T16:34:20Z

The reversal for RTL languages is probably to be expected since order of text is reversed in the box files.
I will test further regarding the reversal in case of digits, punctuation etc.

stweil · 2020-01-03T17:01:23Z

the order of images does not match the order of text lines in original training_text

That's correct. Tesseract shuffles the original data, see https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/linerec.cpp#L69. As it uses a pseudo random sequence which is derived from the document name, it might be possible to reverse that shuffling.

stweil · 2020-01-03T17:04:04Z

line level box files are not generated.

The internal data has character level boxes. Writing that data to box files must still be implemented.

Shreeshrii · 2020-01-03T17:05:33Z

The reversal for RTL languages is probably to be expected since order of text is reversed in the box files.

Well, thinking further about it, reversed text is ok if generating box files via tesseract unpack. But if generating ground truth, then text should be in same order as original text/ recognized text.

Shreeshrii · 2020-01-04T05:10:06Z

kur_araeval.zip

original training_text

غير الموقع أن مركز برامج حتى الرمزية من يكون 24 - يوم

wordstrbox

WordStr 0 0 905 82 0 #موي - 24 نوكي نم ةيزمرلا ىتح جمارب زكرم نأ عقوملا ريغ
0 0 905 82 0

text generated by tesseract unpack (matches BOX not GT)

موي - 24 نوكي نم ةيزمرلا ىتح جمارب زكرم نأ عقوملا ريغ

Shreeshrii · 2020-01-04T05:19:46Z

Also tested with jpn and chi_sim - one line of training_text. Works as expected.

chi_simeval.zip
jpneval.zip

Shreeshrii · 2020-02-13T16:21:16Z

@stweil This is a useful feature. I would suggest applying it to tesseract repo, with a comment about RTL as a TODO.

stweil · 2020-02-26T16:54:08Z

I am still looking for a solution how to output RTL text either with the ICU library or with existing Tesseract functions.

stweil · 2020-02-26T19:03:44Z

pybidi (part of the Python package python-bidi) can be used to fix the generated GT text.

Shreeshrii · 2020-02-27T01:49:25Z

Pybidi is what is used in the PR in tesstrain.

An earlier poster regarding RTL training had suggested fribidi

Shreeshrii · 2020-04-20T02:03:41Z

@stweil I am using this feature to convert lstmf files from text2image generated box and multi page tiffs to single line png images and gt.txt files that can be used with tesstrain.

Please consider commiting this to master branch.

Shreeshrii · 2021-02-22T05:39:42Z

@stweil Any plans to merge your changes?
I am not able to use the commits from your repo because of merge conflicts.

stweil · 2021-02-26T11:03:42Z

I rebased https://github.com/stweil/tesseract/tree/unpack now, so it should be possible to use it again.

The branch not only adds the unpack command but also the infocommand. Before that gets merged, I'd like to improve the code further.

Shreeshrii · 2021-02-26T12:51:15Z

Thanks, @stweil.

I will add another unpack feature request since you are looking to improve this further.

While creating the starter traineddata (proto model) tesseract outputs a readable version of the recoder (created by combine_lang_model command, I think). Files are named as [MODEL_NAME].charset_size=NNN.txt.

The current info produced from the traineddata file using combine_tessdata does not create this readable format. It will be good to have an option to do so.

EDIT: In some cases I have seen two entries for NULL in these files and I wonder if that is correct.

0	 
1	<nul>
2	<nul>
3	/
4	(
5	व
6	ि
7	श
8	ल
9	य
10	े

Shreeshrii · 2021-11-13T13:39:45Z

@stweil Any plans to include this for 5.0.0

stweil · 2021-11-27T08:53:29Z

No, it won't be included in 5.0.0, but maybe in some later version (5.1.0?).

stweil added the feature request label Oct 6, 2019

stweil mentioned this issue Oct 9, 2019

Fine tuning Training related questions from forum tesseract-ocr/tesstrain#91

Closed

Shreeshrii mentioned this issue Nov 30, 2019

Add script to generate reversed RTL text wordstrbox files tesseract-ocr/tesstrain#127

Closed

Shreeshrii mentioned this issue Dec 1, 2019

Report on RTL training with OCR_GS_Data for Arabic tesseract-ocr/tesstrain#128

Open

amitdo added the training label May 18, 2020

Shreeshrii mentioned this issue Dec 22, 2020

obscure OCR after fine tuning? tesseract-ocr/tesstrain#222

Closed

Shreeshrii mentioned this issue Nov 27, 2021

Feature Request: Have combine_tessdata output readable version of recoder file #3660

Open

Feature Request: Utility to unpack lstmf files #2669

Feature Request: Utility to unpack lstmf files #2669

Comments

Shreeshrii commented Sep 23, 2019 • edited Loading

stweil commented Oct 6, 2019

stweil commented Oct 6, 2019

Shreeshrii commented Oct 7, 2019

Shreeshrii commented Oct 7, 2019 • edited Loading

AyushP123 commented Nov 4, 2019

Shreeshrii commented Nov 27, 2019

stweil commented Nov 27, 2019 • edited Loading

Shreeshrii commented Nov 27, 2019 • edited Loading

stweil commented Nov 27, 2019

Shreeshrii commented Nov 29, 2019

Shreeshrii commented Nov 30, 2019 • edited Loading

Shreeshrii commented Nov 30, 2019 • edited Loading

stweil commented Nov 30, 2019

Shreeshrii commented Nov 30, 2019 • edited Loading

Shreeshrii commented Nov 30, 2019 • edited Loading

Shreeshrii commented Dec 1, 2019 • edited Loading

stweil commented Dec 30, 2019

Shreeshrii commented Dec 31, 2019

Shreeshrii commented Jan 1, 2020

stweil commented Jan 1, 2020

stweil commented Jan 2, 2020

Shreeshrii commented Jan 3, 2020 • edited Loading

stweil commented Jan 3, 2020

Shreeshrii commented Jan 3, 2020

Shreeshrii commented Jan 3, 2020

Shreeshrii commented Jan 3, 2020

stweil commented Jan 3, 2020

stweil commented Jan 3, 2020

Shreeshrii commented Jan 3, 2020 • edited Loading

Shreeshrii commented Jan 4, 2020

Shreeshrii commented Jan 4, 2020

Shreeshrii commented Feb 13, 2020

stweil commented Feb 26, 2020

stweil commented Feb 26, 2020

Shreeshrii commented Feb 27, 2020

Shreeshrii commented Apr 20, 2020

Shreeshrii commented Feb 22, 2021

stweil commented Feb 26, 2021

Shreeshrii commented Feb 26, 2021 • edited Loading

Shreeshrii commented Nov 13, 2021

stweil commented Nov 27, 2021

Shreeshrii commented Sep 23, 2019 •

edited

Loading

Shreeshrii commented Oct 7, 2019 •

edited

Loading

stweil commented Nov 27, 2019 •

edited

Loading

Shreeshrii commented Nov 27, 2019 •

edited

Loading

Shreeshrii commented Nov 30, 2019 •

edited

Loading

Shreeshrii commented Nov 30, 2019 •

edited

Loading

Shreeshrii commented Nov 30, 2019 •

edited

Loading

Shreeshrii commented Nov 30, 2019 •

edited

Loading

Shreeshrii commented Dec 1, 2019 •

edited

Loading

Shreeshrii commented Jan 3, 2020 •

edited

Loading

Shreeshrii commented Jan 3, 2020 •

edited

Loading

Shreeshrii commented Feb 26, 2021 •

edited

Loading