-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Utility to unpack lstmf files #2669
Comments
Which image format would you prefer? PNG? TIFF? |
I think that the box file information is implicitly there, because box files for LSTM only need line texts and the bounding box for each line (so you won't get character boxes which might have existed in the initial box files). |
lstmf files can be made from multi-page tifs also, so I would say tif as the output format. |
Related request, I can open another issue, if you prefer... Enhance Ray had sent info via email on tessdata_fast models - see https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--40000alpha--network-specification-for-tessdata_fast and #1404 (comment)
|
I would love to work on creating a new utility or expanding combine_tessdata if anyone hasn't taken this up. |
@stweil Any progress on this. Even a single line png version without box information will be useful. I am interested specially in checking how the RTL text is being stored in it. |
It's less a technical problem but a question where to add it in a user friendly way. Technically the unpack feature would fit well into the
So @AyushP123, have you already started working on a utility? |
I came across https://github.com/OpenArabic/OCR_GS_Data (used for Kraken Arabic models) and wanted to test training with it. But looks like the wordstrbox is not the right format, hence wanted to check. (file created using python script and using tesseract are in different order).
|
That looks interesting. https://github.com/OpenITI/ seems to include OCR_GS_Data and could be used to improve the Arabic model (or create a new one) for Tesseract. A comparison of Tesseract with the published results from Kraken would also be interesting. |
My first trial shows Tesseract with worse results compared to Kraken. However, the accuracy is getting pulled down by a subset of lines that are being recognized as multiple lines. example. gt = tess = |
Works with --psm 7 or --psm 13. I will rerun the reports.
|
|
So Tesseract missed all digits and Arabic extended? |
It is possible that my substitution also changed some 0-9 digits that were 0-9 in the images. |
grep into Accuracy reports per image shows that Arabic extended is referring to digits in Arabic script. While they are not getting detected correctly with default psm, they are recognized with --psm 13.
tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 3 tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 4 tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 6 tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 13 tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 7 |
A first implementation is available in my unpack branch. It also introduces a new command line syntax. Usage:
This writes two files ( Only simple lstmf files are currently handled. Do you have a more complex example (multi page tiff)? |
Attached is a zip file with a sample of lstmf files in different languages, including multi-page tiff. Do you also need the corresponding images and transcription? |
@stweil The build is failing. https://travis-ci.org/stweil/tesseract/jobs/631055250#L552
|
It requires C++-17. Try to build without that include statement. |
The latest test code should fix the build problem with C++ before C++17 and adds support for lstmf files with more than one text line. It handles all files in the sample, but I'm not sure that the result is as expected. The results are now written to individual files. |
Still not able to build with -std=c++17
|
That's fixed now. An include statement was missing. |
@stweil Thank you for adding this functionality. It will be a very useful feature. I tested with a multi-page tiff generated by text2image for langdata/eng/eng.training_text. While all lines in .tif are converted to .png line images, the order of images does not match the order of text lines in original training_text, tif and box. eg. line 50 in text file was image number 71. The line image numbering is continuous for the lstmf file, there is no indication of page numbering of original tif. line level box files are not generated. I tested for Devanagari script, using both text2image generated box files and also using the wordstrbox files with a single text line as input. In both cases the output .png and .gt.txt were created correctly. For RTL, I tested with Arabic text. While the generated png and original tif match, the training text and generated gt.txt files do not match. I have not tested with Hebrew text yet. |
training text
generated gt.txt
|
The reversal for RTL languages is probably to be expected since order of text is reversed in the box files. |
That's correct. Tesseract shuffles the original data, see https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/linerec.cpp#L69. As it uses a pseudo random sequence which is derived from the document name, it might be possible to reverse that shuffling. |
The internal data has character level boxes. Writing that data to box files must still be implemented. |
Well, thinking further about it, reversed text is ok if generating box files via |
original training_text غير الموقع أن مركز برامج حتى الرمزية من يكون 24 - يوم wordstrbox WordStr 0 0 905 82 0 #موي - 24 نوكي نم ةيزمرلا ىتح جمارب زكرم نأ عقوملا ريغ text generated by موي - 24 نوكي نم ةيزمرلا ىتح جمارب زكرم نأ عقوملا ريغ |
Also tested with |
@stweil This is a useful feature. I would suggest applying it to tesseract repo, with a comment about RTL as a TODO. |
I am still looking for a solution how to output RTL text either with the ICU library or with existing Tesseract functions. |
|
Pybidi is what is used in the PR in tesstrain. An earlier poster regarding RTL training had suggested |
@stweil I am using this feature to convert lstmf files from text2image generated box and multi page tiffs to single line png images and gt.txt files that can be used with tesstrain. Please consider commiting this to master branch. |
@stweil Any plans to merge your changes? |
I rebased https://github.com/stweil/tesseract/tree/unpack now, so it should be possible to use it again. The branch not only adds the |
Thanks, @stweil. I will add another unpack feature request since you are looking to improve this further. While creating the starter traineddata (proto model) tesseract outputs a readable version of the recoder (created by The current info produced from the traineddata file using combine_tessdata does not create this readable format. It will be good to have an option to do so. EDIT: In some cases I have seen two entries for NULL in these files and I wonder if that is correct.
|
@stweil Any plans to include this for 5.0.0 |
No, it won't be included in 5.0.0, but maybe in some later version (5.1.0?). |
lstmf files contain the image information, ground truth text (and box file information?).
tesseract/src/ccstruct/imagedata.h
Lines 196 to 204 in c40159a
It will be useful to have a utility (similar to combine_tessdata used for traineddata files) which can be used to extract all the components of lstmf files.
This will be useful for verifying the correctness of the files as well as remove the necessity to save original input files (tif/groundtruth/box).
The text was updated successfully, but these errors were encountered: