-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
output true CER for checkpoints (at least the final one) #3560
Comments
This comment was marked as outdated.
This comment was marked as outdated.
That's a problem which might have been less important before using tesstrain: the initial training used lines with artificial ground truth. As those lines were long (with similar lengths), CER and bag-of-character error would not differ. Both numbers will only differ much if the CER depends on the line length. Calculating a weighted mean error might be sufficient to get a CER. I expect that WER has the same issue. |
It's true bag-of-character error rate (BCER) gets more "stable" with line length, and in that sense (only) CER and BCER tend to be more similar on longer lines. But BCER also always systematically underestimates CER, and both measures also diverge (get more and more uncorrelated) the larger the error is in total. (Thus, no weighting scheme for shorter lines can make up for this bias.) And as I said, my empirical data suggest that there's a huge difference. (But I'll still have to compute CER measurements of the training error for a direct comparison.) BTW, Now, CER vs. BCER is quite a blow, but it's (relatively) easy to fix: just incorporate some C++ implementation of global sequence alignment (Needleman-Wunsch algorithm) and distance measure (Damerau-Levenshtein metric) into Tesseract and use that for the evaluation data in a checkpoint. The other problem mentioned above, i.e. training error instead of test error, is even bigger I believe: Even the checkpointing itself merely runs on training error minima (since |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
For avoidance of doubt: CER is usually computed based on Levenshtein distance in the narrow sense, which means with allowed edit operations of insert, delete and substitute each with a cost of 1. Damerau is used as the term for additionally allowing transpositions. One of the formulas for CER (resp. WER, LER) is:
It's easy to show, that this formula can result in values > 1, e.g. for GRT = 'm' and OCR = 'iii', CER is 3. If we divide by the average length
CER is 1.5 in the above example. Second, I couldn't find a definition for BCER, only one for BWER. The term 'bag' is used in IT for the mathematical term 'multiset'. If this is meant measuring BCER makes (IMHO) only sense for mismatch tables, i.e how often e.g. the character 'e' is mismatched with 'c' and how much this contributes to CER. Or does it make a difference for the training process to use BCER versus CER? NB: An end to end measure should compare glyphs to graphemes, which should be the same as graphemes to graphemes in case of a well transliterated GRT. The difference between character and grapheme based is neglectable for most scripts, but is maybe important for some Asian scripts. Also a good global alignment needs to take split ('h' -> 'li') and merge ('li -> 'h') into account, which can be m:n:
|
Sure (but you might want to be able to use weighted metrics later on, say for certain confusions that are semantically more/ir/relevant). This was merely to contrast with simple histogram metrics, which are not based on alignment.
The correct denominator is not the real issue here, but please have a look at this comment (I believe the natural choice is the length of the alignment path).
Yes, it's the same beast, only on character level.
Not really. For a true confusion/edit table you need the actual alignment.
The training process itself does not use either measure, but the CTC error (which gives a correct gradient but is not interpretable). For checkpointing and evaluation, usually CER on the test set is used, which is very different to BCER on the train set (see above).
True. You can get closer to graphemes if you glue codepoints to grapheme clusters (before or after alignment) and get the edit distance of those sequences instead. The quality of the alignment itself is another minor issue (but the best global path is usually unique except left/right preference). But let's first try to get the basics right for Tesseract (to catch up with everyone else OCR). |
No one seems to care that their trained models are selected from suboptimal checkpoints and reported with way-too-optimistic error rates? As long as lstmtraining and lstmeval report figures the way they do, I consider this a bug (not a feature request). (Changing the messages should not be too difficult for a start, and would get the user's attention where it belongs.) |
I don't think that we need changes for release 5.0.0. After all, the current models are not so bad, it is already possible to train even better models, and the required changes would delay the release further. It is sufficient to enhance the current training process to write an additional new model after each epoch and use that model to calculate normal CER / WER values. That's what other OCR software like Calamari or Kraken does. And it is already possible to do that now by using an extra script which watches the latest checkpoint. I don't expect that those checkpoints at the end of an epoch will be much better, but we'll see. Any error rate is only an indicator. It is only an objective number for a certain validation set. As long as people don't expect that the error rates in Tesseract training are the same which they will get with their real images, they are good enough to show the training progress. There are other training aspects which I consider more important. One is continuation of an interrupted training. That should start with the line following the last one which was used for training. I'm afraid that it currently starts with the first line, so a training which was interrupted is different compared to a training which runs without any interrupt. And the second thing is performance. Thanks to And for later releases it would also be useful to support models from other OCR engines. |
This comment was marked as off-topic.
This comment was marked as off-topic.
they could be even better if an appropriate checkpoint was selected – which a huge waste
yes, but only if you know about the problem, and if you rely on external tools for CER calculation
as I said in my last post, fixing the messaging would already be a good start, and won't be much effort at all – i.e. instead of claiming Also, instead of the occasional
Yes, one can already do
Of course it is only an indicator. But there are well established standards here, and Tesseract not only ignores them, it's not even honest about that. (And again, it's not just about objective measurement, but also model selection.) |
I agree and vote for it. Better sooner than later.
IMHO as an observation the training data itself has an important influence. The usual method of splitting a corpus into training and validation is a sort of incest. It is also self-fulfilling because seldom glyphs (e. g. capital letter X or Y in German, Y appears in AustrianNewspapers with p = 29/2200309 ~ 0.00001) contribute not much to the the overall CER if wrong recognised. My theories are that the models could be improved by emphasising seldom glyphs a little bit more in training sets, use more different fonts and partition the corpus in historic OCR into periods of 50 years, which can be combined into models spanning 100 (or 150) years. I still do this for my dictionaries: 1750-99, 1800-49, etc. |
@stweil wrote:
@bertsky wrote:
My problem is that I need a way to report CER in scientific contexts. And if Tesseract's self-evaluation is not to be trusted than I consider this a severe problem (i.e. major bug). @bertsky wrote:
Agreed! IMHO, we cannot rely on people's awareness of this problem. Which in turn may lead to incorrect training results getting published or used in project proposals etc. So in the first step, the status messages should be adjusted in a way that they correctly state what they actually mean to prevent their misinterpretation (and this should naturally be done befora a major release) and in the second step, we should discuss how we can apply scientifically sound evaluation metrics to Tesseract's training procedure. |
@wrznr, the problem is that people now wait for a new stable release since two years. The current code in It would be possible to add a final text which is printed when |
The old messages could wrongly be interpreted as CER / WER values, but Tesseract training currently uses simple bag of characters / bag of words error rates (see LSTMTrainer::ComputeCharError, LSTMTrainer::ComputeWordError). Signed-off-by: Stefan Weil <sw@weilnetz.de>
The old messages could wrongly be interpreted as CER / WER values, but Tesseract training currently uses simple bag of characters / bag of words error rates (see LSTMTrainer::ComputeCharError, LSTMTrainer::ComputeWordError). Signed-off-by: Stefan Weil <sw@weilnetz.de>
The old messages could wrongly be interpreted as CER / WER values, but Tesseract training currently uses simple bag of characters / bag of words error rates (see LSTMTrainer::ComputeCharError, LSTMTrainer::ComputeWordError). Signed-off-by: Stefan Weil <sw@weilnetz.de>
The old messages could wrongly be interpreted as CER / WER values, but Tesseract training currently uses simple bag of characters / bag of words error rates (see LSTMTrainer::ComputeCharError, LSTMTrainer::ComputeWordError). Signed-off-by: Stefan Weil <sw@weilnetz.de>
Maybe the RapidFuzz C++ library can be used. It supports several different metrics, and the license is compatible with Tesseract. |
Levenshtein language:c++ stars:>50 -license:gpl Edit: Indeed, RapidFuzz seems to be the best option. |
What exactly are the requirements for a Levenshtein library? Must have:
For CER:
For normalized CER:
How often is it called on e.g. a journal page with 5,000 characters, how long are the strings (line average ~60)? Don't need IMHO (feature bloat is slow):
With state of the art bit-parallel algos (Myers 1999, Hyrrö 2004-2006) I can get for length ~10 in UTF-8 10 Mio comparisons per second (distance or length of LCS), 8-bit 35 Mio. This is 700 times faster as edlib (using also bit-parallel). Alignment adds the complexity of backtracking, which is linear. I would estimate (from my implementations in Perl) that this would reach 5 Mio/s including construction of the alignment array. |
At least in theory, we can use a tool written in Python/Perl or any other language and use unix sockets to communicate with it. The visual debugger already uses this technique (C++<->Java). |
Not tagged with that topic, but https://github.com/seqan/seqan3 could be an option, too. @wollmers, agreeing with most of your comment, but I don't think you can adequately measure line pairs per second – it completely depends on the (average and worst-case) distance of these lines. Also, different libraries behave very different regarding worst-case vs. best/average case performance (as the latter leaves lots of room for clever optimisations). Do we really need the alignment path itself, though (not just its total length and distance)? |
Sure, different algos are sensitive to different parameters. Most important is length of the strings. The "snake" algorithms of Myers/Ukkonen (~1985/86) are sensitive on distance. ISRI uses "optimised" Ukkonen. But this and diff algorithms measure LCS (maximise matches). Levenshtein minimises distance. Another factor is size of the alphabet, which for Unicode is theoretically very large. But the used alphabet size in the comparison is <= length(string1). The largest influence on speed besides algo (bit-parallel is fastest) has implementation. I.e. the skills of the developer.
You can save the time of creating the alignment array and just count edit operations during backtracking. You can implement both methods. It's not complicated. But it does not solve a basic problem of backtracking:
It's seldom and in OCR it appears more often on high CERs. For training we IMHO can ignore these small and seldom inaccurracy. |
So basically we can split this into two separate tasks:
If someone has the knowledge, time and motivation for this, I recommend to start with the first task. |
Regarding the second issue/task, BCER->CER. Until this issue is fixed in Tesseract itself, perhaps the issue could be mitigated in tesstrain with a Python/Perl script? |
I concur. Let me add that while spending time in that area of the code it would make sense to also write out the statistics about the training process in a controlled and re-usable way. For example, plotting of train/test error curves is very difficult if you have to parse the log output meant for humans to read. At least some CSV file interface. (It would be very cool to have something like a Tensorboard interface, but that's probably out of scope for Tesseract.)
Hardly. You'd need to convert the checkpoint to a model file, use that for prediction (of the training or test dataset) with the API, and then align with GT and calculate CER. (My local workaround is to do these steps ex-post via makefile/shell means.) |
It is already possible to run |
@stweil, what you describe are external means, though. But the question raised by @amitdo was whether there might be some script solution to address the CER/BCER calculation problem from within tesstrain. And the answer is definitively no IMO (since there's no way to directly access the weights to make predictions). That's why I asked for a solution in Tesseract itself. (So I hope you are not suggesting to add script files on top of tesstrain/Makefile now. This would only complicate matters and deflect development efforts. The goal should be to become as comfortable as Calamari/Kraken, training-wise.) |
For CER we just need distance/ length(GT). Yes, this can be higher than 100%. Distance only needs ~50% of CPU and duration compared to a full alignment. I know one C implementation that's very fast, even via a Perl binding. It's the implementation used in PostgreSql with many precompiler options. License is to investigate. For testing correctness I can use the Perl binding, as I have ~200 testcases for all corner cases of LCS and Levenshtein for my own implementations. https://fastapi.metacpan.org/source/MBETHKE/Text-Levenshtein-Flexible-0.09/ |
What @stweil suggested is quite close to what I wanted to know. Your local workaround seems similar. Is @stweil's/your workaround is 'good enough'? Tesseract's C++ developers pool is quite small, and I have a hunch this feature (especially the second task) won't be implemented in Tesseract itself in the next 12 months at least. A workaround written in a scripting language seems more likely to be implemented. Actually, it was already implemented (according to your comment). |
My tests pass all for
and
I wrote extra tests for Unicode using the ranges ASCII, Latin-1, Hindi and Meroitic Hieroglyphs (U+10980 - 1099F), and also testing up to lengths (in characters = code points) of 1028, because the documentation says length limit 256. That's in the code
It has more features than we need. But it works and is fast for quadratic time complexity O(m*n). Did not find the latest code at https://github.com/postgres/postgres. The licence is MIT-like. https://github.com/postgres/postgres/blob/master/COPYRIGHT |
https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/levenshtein.c |
Thanks. Then I can play around with it. It does not use prefix/suffix optimisation, which is easy to implement and saves 50% of time as an overall average. The smaller the differences, the more it saves. And in case of O(m*n) the effect is quadratic. Also it catches not the special case of |
@wollmers Please share the scripts that you are using. I am interested to try them for Hindi as well as for a custom IAST traineddata that I am training for Roman transliteration of Sanskrit. Thanks. |
Will publish them on github soon (next days), because they need some minimal comments to be usable by others. The reason for using Hindi as a testcase was my simple first guess, to use something outside Latin with heavy use of combing characters. Could have been also Hangul or Arabic, but Arabic is right to left and boring to handle in a source code editor. |
Do we need also a true WER instead of Bag-WER? IMHO it has less priority, because the probability of false positive equal words is very low, and I even don't know what impact this has on training quality. |
@wollmers less priority I agree, and yes, no potential influence on training process. But at least for common/short words, both FP and FN are realistic – so if we have a good implementation, why stop on character level? (CER is often used to judge models in comparison to each other, while WER is often associated with a practical interpretation...) |
Now I have my own fast version for Levenshtein distance in C https://github.com/wollmers/Text-Levenshtein-Uni ported from my bit-vector implementation in Perl https://github.com/wollmers/Text-Levenshtein-BV. It works via Perl XS binding, compiles under clang C++11 and C99. The other implementations named in this thread are either not bug free, or are slower. The traditional (simple) implementation of PostgreSQL has complicated code (400 lines) and is 50% slower than it can be stripped (94 lines) to the needed features (fixed edit costs, working on U32 code points). Maybe edlib is still interesting, but it is optimised for long (1 M) DNA sequence alignment and was 700 times slower for short strings than my implementation. AFAIK it calculates the alignment and after this the distance from the alignment. The Perl XS version using it (old version?) didn't pass my tests. Here the benchmarks via Perl XS:
TL::Flex uses the Postgres implementation. l52 means two lines of length 52 characters (something typical for books). Others are "words" length ~10. TL::Uni is ours. Called from pure C:
For some reasons called from C++ is always a little bit faster:
Implementing distance for a list words (= array of strings) should be straight forward. I have similar code for LCS in C and for XS. This will be slower as it needs string comparison and hashes. But an average line of text has usually only a few words. I wonder if distance is faster than bag of characters. A bag needs a hash with >32 instructions for each insert or locate. I use a lookup table (1 instruction) for ASCII which makes more than 90% in European languages. |
AFAICS,
lstmtraining
produces two types of figures for measuring the error:list.train
): this is shown aschar train=%.3f%%
every 100 iterationsFinished! Error rate = %.2f
at the endlist.eval
): this is shown asEval Char error rate=%f
(but also in percent) whenever (IIUC) training error reached a new minimumCall-tree for 1:
tesseract/src/training/unicharset/lstmtrainer.cpp
Lines 945 to 947 in 7a308ed
tesseract/src/training/unicharset/lstmtrainer.cpp
Lines 1261 to 1286 in 7a308ed
tesseract/src/training/unicharset/lstmtrainer.cpp
Lines 1201 to 1221 in 7a308ed
tesseract/src/training/unicharset/lstmtrainer.cpp
Lines 1325 to 1339 in 7a308ed
tesseract/src/training/unicharset/lstmtrainer.cpp
Lines 369 to 378 in 7a308ed
Call-tree for 2:
tesseract/src/training/unicharset/lstmtester.cpp
Lines 48 to 134 in 7a308ed
tesseract/src/training/unicharset/lstmtrainer.h
Lines 151 to 154 in 7a308ed
tesseract/src/training/lstmtraining.cpp
Lines 188 to 197 in 7a308ed
tesseract/src/training/lstmtraining.cpp
Line 217 in 7a308ed
However, IMHO this is higly misleading, for two reasons:
error
to mean character error rate (CER) – not the bag-of-character error (average character count confusion per line)Or am I missing something very large and obvious?
(I do usually get an order of magnitude larger error rate when measuring via prediction and Levenshtein distance compared to lstmtraining. So at least there's empirical evidence something is not right with lstmtraining's figures. I can provide them when needed here.)
The text was updated successfully, but these errors were encountered: