-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicharset — Incomplete properties #318
Comments
|
This is the case, the file I get from
I have some questions:
Is it normal?
|
To follow up with #316 , I added the line /home/ggdhines/github/tesseract/training/set_unicharset_properties -U unicharset -O new_unicharset --script_dir=/home/ggdhines/github/langdata/Latin.unicharset then new_unicharset looks like: 1 8 0,255,0,255,0,0,0,0,0,0 Common 3 2 3 1 # 1 [31 ]0 (Only trying for 3 characters right now). This looks better than before (no null values) but I'm still getting the error: @ne0zer0 's questions are good ones. |
Also just realized that the example unicharset file in the Compute the Character Set of the official documents: appears to be out of date (I think that's Tesseract version 2) |
Try this:
Since your unicharset file has glyphs which belong to the |
Yes, you should read unicharset(5) doc: And for
Or a way to compute these xheights for every kind of fonts? Maybe, it is a problem with wctype functions on systems? As I read on the documentation:
|
You need to pass this file to
Download these files: Lets say you put these files in Now, run this:
|
thanks @amitdo for the help. I'm a little confused though as to why we need to use Latin.unicharset and Common.unicharset. Shouldn't we be teaching Tesseract new fonts based on the actual examples (and box files). Using some preexisting unicharset file makes it seem as if we're not actually training Tesseract on the new font |
It is exactly what I do.
Already done.
Done after reading your post addressed to ggdhines
What did and I wrote in the first post (current directory):
where Common.unicharset is now put. But if I put Latin.unicharset, what I get is not what is adapted for my fonts, As says ggdhines, according to the documentation, we have to compute the actual size of our fonts:
What you suggest is likely to produce such degraded result. (what I seem to experiment) Always for the above link:
So, as it is stated that "assign default values to missing fields, the accuracy will be degraded", and "is likely to have dire consequences", your proposition can not be accepted, because it lead to what I experiment if I do like you say: strange results. Unless I did not understand anything, in which case, as I am not the only one, you have to review the documentation. |
Here some output according to your recommendation: tesseract training:
set_unicharset_properties
shapeclustering
mftraining
The result is as strange as before, but now I have this warning in mftraining:
And why these failure?
|
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#shapeclustering-new-in-302
|
Yes I know, At first, i tried without Anyway, here the output:
For the same result… |
And:
As long as you get only a few of these failures, it's probably OK.
This is also normal. Overall, the output of the commands looks OK. |
Thank you for your reply, But what about theses questions:
|
ne0zer0:
It's the opposite of what you said. You interpreting the above paragraph wrongly. I'll give you more answers later. |
The meaning of this 'CAVEAT': Starting from Tesseract version 3.02 the unicharset file should look like this:
If you will use the old format:
the accuracy will be degraded, because the training tool will assign default (suboptimal) values to missing fields. |
|
What I understood is: incomplete v3.02 unicharset format file (like what I get) will result, infine, to the old format (after all, some fields are zeroing). These 0 lead to default value. As informations are missing, owing to the fact that 0 are put instead of actual value, the accuracy will be degraded. |
@amitdo - why is this necessary at all? Shouldn't Tesseract being learning based on the training examples we provide? Pre-existing data isn't going to be helpful with new fonts. |
@ne0zer0 You should both be patient and wait for my further answers which may clear things up for you. |
Ok, I will wait. Thanks for your patience. |
When you run In this stage all the fields accept the So we run
The tool will take our More details: What left to fill are the 10 Tesseract does not provide a training tool that generate the correct values for the Instead, there is a pre-made unicharset file for each script (the unicharset files in the The tool will look for a few files in the directory you told it to search, The tool will scan the lines in the A second file that the tool will search is a According to Ray Smith, the lead developer of Tesseract:
('using' = 'training') The tool will process the info in |
That's it. I did my best efforts to explain things you ask about. You can now respond... :) |
Hi, Thank you for this clarification. I am somewhat disappointed, and I have more questions than before:
but the xheights file generated is always blank, and the file Latin.xheights seems to do nothing (I already tried this), and I always get the same output_unicharset file, with or without Latin.xheights (located in langdata folder, or in the current directory). What can be done with a filled (with default values) xheights file? Thank you |
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
If you develop such a tool (or hire someone to do so) we can add a link in the wiki to your site... My answer for your other questions: Your last question - you probably did something wrong if you get an empty file. I will try to test it later. Two last notes: |
The |
You seem to think that |
https://github.com/tesseract-ocr/tesseract/blob/master/doc/mftraining.1.asc |
Indeed. In fact, I expected too much from Tesseract.
I will wait for your test.
Sorry, but I did not want to hurt anybody.
No, just a mistake when I wrote.
Thanks for the link. Anyway, thank you for everything. |
I finally resolved my problem with xheights. I did a spelling mistake :s Anyway, thanks for all. |
Nick White @nickjwhite had tried to build the tool you want. |
More recently I made the addmetrics and xheights tools, which are in the tools directory of the git repo https://ancientgreekocr.org/grctraining.git |
Hi Nick! |
Pinging @nickjwhite ... |
Hi,
I get some strange result when I try to train Tesseract.
Some part are very improved comparing to the default eng.tessdata, when some part are strangely added or modified, while the image quality is very good (24 become eat ???; uppercase letter become lowercase; some words are cut in two words; etc)
I think it may be cause by unicharset.
Indeed, when I try to generate a unicharset file with the following command :
unicharset_extractor eng.palladio-regular.exp8.box
I get an incomplete file. Here the result :
When I try to fix it with with set_unicharset_properties:
set_unicharset_properties --F font_properties -U unicharset -O output_unicharset --script_dir=/
I get these warnings :
And this incomplete file :
We can see that some parts are missing.
Here from the documentation to see the difference, so the "missing" part :
https://github.com/tesseract-ocr/tesseract/blob/a3ba11b030345d32829b1e8355afea5419978d82/doc/unicharset.5.asc
The text was updated successfully, but these errors were encountered: