Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output true CER for checkpoints (at least the final one) #3560

Open
bertsky opened this issue Jun 6, 2021 · 39 comments
Open

output true CER for checkpoints (at least the final one) #3560

bertsky opened this issue Jun 6, 2021 · 39 comments

Comments

@bertsky
Copy link
Contributor

bertsky commented Jun 6, 2021

AFAICS, lstmtraining produces two types of figures for measuring the error:

  1. bag-of-character training error (on list.train): this is shown as
    • char train=%.3f%% every 100 iterations
    • Finished! Error rate = %.2f at the end
    • first figure in the file name of the checkpoint files
  2. bag-of-character testing error (on list.eval): this is shown as
    • Eval Char error rate=%f (but also in percent) whenever (IIUC) training error reached a new minimum
Call-tree for 1:

  • double char_error = ComputeCharError(truth_labels, ocr_labels);
    double word_error = ComputeWordError(&truth_text, &ocr_text);
    double delta_error = ComputeErrorRates(*targets, char_error, word_error);
  • // Computes a very simple bag of chars char error rate.
    double LSTMTrainer::ComputeCharError(const std::vector<int> &truth_str,
    const std::vector<int> &ocr_str) {
    std::vector<int> label_counts;
    label_counts.resize(NumOutputs(), 0);
    int truth_size = 0;
    for (auto ch : truth_str) {
    if (ch != null_char_) {
    ++label_counts[ch];
    ++truth_size;
    }
    }
    for (auto ch : ocr_str) {
    if (ch != null_char_) {
    --label_counts[ch];
    }
    }
    int char_errors = 0;
    for (auto label_count : label_counts) {
    char_errors += abs(label_count);
    }
    if (truth_size == 0) {
    return (char_errors == 0) ? 0.0 : 1.0;
    }
    return static_cast<double>(char_errors) / truth_size;
    }
  • // Computes network errors, and stores the results in the rolling buffers,
    // along with the supplied text_error.
    // Returns the delta error of the current sample (not running average.)
    double LSTMTrainer::ComputeErrorRates(const NetworkIO &deltas, double char_error,
    double word_error) {
    UpdateErrorBuffer(ComputeRMSError(deltas), ET_RMS);
    // Delta error is the fraction of timesteps with >0.5 error in the top choice
    // score. If zero, then the top choice characters are guaranteed correct,
    // even when there is residue in the RMS error.
    double delta_error = ComputeWinnerError(deltas);
    UpdateErrorBuffer(delta_error, ET_DELTA);
    UpdateErrorBuffer(word_error, ET_WORD_RECERR);
    UpdateErrorBuffer(char_error, ET_CHAR_ERROR);
    // Skip ratio measures the difference between sample_iteration_ and
    // training_iteration_, which reflects the number of unusable samples,
    // usually due to unencodable truth text, or the text not fitting in the
    // space for the output.
    double skip_count = sample_iteration_ - prev_sample_iteration_;
    UpdateErrorBuffer(skip_count, ET_SKIP_RATIO);
    return delta_error;
    }
  • // Updates the error buffer and corresponding mean of the given type with
    // the new_error.
    void LSTMTrainer::UpdateErrorBuffer(double new_error, ErrorTypes type) {
    int index = training_iteration_ % kRollingBufferSize_;
    error_buffers_[type][index] = new_error;
    // Compute the mean error.
    int mean_count = std::min<int>(training_iteration_ + 1, error_buffers_[type].size());
    double buffer_sum = 0.0;
    for (int i = 0; i < mean_count; ++i) {
    buffer_sum += error_buffers_[type][i];
    }
    double mean = buffer_sum / mean_count;
    // Trim precision to 1/1000 of 1%.
    error_rates_[type] = IntCastRounded(100000.0 * mean) / 1000.0;
    }
  • // Builds a string containing a progress message with current error rates.
    void LSTMTrainer::PrepareLogMsg(std::string &log_msg) const {
    LogIterations("At", log_msg);
    log_msg += ", Mean rms=" + std::to_string(error_rates_[ET_RMS]);
    log_msg += "%, delta=" + std::to_string(error_rates_[ET_DELTA]);
    log_msg += "%, char train=" + std::to_string(error_rates_[ET_CHAR_ERROR]);
    log_msg += "%, word train=" + std::to_string(error_rates_[ET_WORD_RECERR]);
    log_msg += "%, skip ratio=" + std::to_string(error_rates_[ET_SKIP_RATIO]);
    log_msg += "%, ";
    }

Call-tree for 2:

  • // Runs an evaluation asynchronously on the stored data and returns a string
    // describing the results of the previous test.
    std::string LSTMTester::RunEvalAsync(int iteration, const double *training_errors,
    const TessdataManager &model_mgr, int training_stage) {
    std::string result;
    if (total_pages_ == 0) {
    result += "No test data at iteration " + std::to_string(iteration);
    return result;
    }
    if (!LockIfNotRunning()) {
    result += "Previous test incomplete, skipping test at iteration " + std::to_string(iteration);
    return result;
    }
    // Save the args.
    std::string prev_result = test_result_;
    test_result_ = "";
    if (training_errors != nullptr) {
    test_iteration_ = iteration;
    test_training_errors_ = training_errors;
    test_model_mgr_ = model_mgr;
    test_training_stage_ = training_stage;
    std::thread t(&LSTMTester::ThreadFunc, this);
    t.detach();
    } else {
    UnlockRunning();
    }
    return prev_result;
    }
    // Runs an evaluation synchronously on the stored data and returns a string
    // describing the results.
    std::string LSTMTester::RunEvalSync(int iteration, const double *training_errors,
    const TessdataManager &model_mgr, int training_stage,
    int verbosity) {
    LSTMTrainer trainer;
    trainer.InitCharSet(model_mgr);
    TFile fp;
    if (!model_mgr.GetComponent(TESSDATA_LSTM, &fp) || !trainer.DeSerialize(&model_mgr, &fp)) {
    return "Deserialize failed";
    }
    int eval_iteration = 0;
    double char_error = 0.0;
    double word_error = 0.0;
    int error_count = 0;
    while (error_count < total_pages_) {
    const ImageData *trainingdata = test_data_.GetPageBySerial(eval_iteration);
    trainer.SetIteration(++eval_iteration);
    NetworkIO fwd_outputs, targets;
    Trainability result = trainer.PrepareForBackward(trainingdata, &fwd_outputs, &targets);
    if (result != UNENCODABLE) {
    char_error += trainer.NewSingleError(tesseract::ET_CHAR_ERROR);
    word_error += trainer.NewSingleError(tesseract::ET_WORD_RECERR);
    ++error_count;
    if (verbosity > 1 || (verbosity > 0 && result != PERFECT)) {
    tprintf("Truth:%s\n", trainingdata->transcription().c_str());
    std::vector<int> ocr_labels;
    std::vector<int> xcoords;
    trainer.LabelsFromOutputs(fwd_outputs, &ocr_labels, &xcoords);
    std::string ocr_text = trainer.DecodeLabels(ocr_labels);
    tprintf("OCR :%s\n", ocr_text.c_str());
    if (verbosity > 2 || (verbosity > 1 && result != PERFECT)) {
    tprintf("Line Char error rate=%f, Word error rate=%f\n\n",
    trainer.NewSingleError(tesseract::ET_CHAR_ERROR),
    trainer.NewSingleError(tesseract::ET_WORD_RECERR));
    }
    }
    }
    }
    char_error *= 100.0 / total_pages_;
    word_error *= 100.0 / total_pages_;
    std::string result;
    result += "At iteration " + std::to_string(iteration);
    result += ", stage " + std::to_string(training_stage);
    result += ", Eval Char error rate=" + std::to_string(char_error);
    result += ", Word error rate=" + std::to_string(word_error);
    return result;
    }
    // Helper thread function for RunEvalAsync.
    // LockIfNotRunning must have returned true before calling ThreadFunc, and
    // it will call UnlockRunning to release the lock after RunEvalSync completes.
    void LSTMTester::ThreadFunc() {
    test_result_ =
    RunEvalSync(test_iteration_, test_training_errors_, test_model_mgr_, test_training_stage_,
    /*verbosity*/ 0);
    UnlockRunning();
    }
  • // Returns the error that was just calculated by PrepareForBackward.
    double NewSingleError(ErrorTypes type) const {
    return error_buffers_[type][training_iteration() % kRollingBufferSize_];
    }
  • tesseract::LSTMTester tester(static_cast<int64_t>(FLAGS_max_image_MB) * 1048576);
    tesseract::TestCallback tester_callback = nullptr;
    if (!FLAGS_eval_listfile.empty()) {
    using namespace std::placeholders; // for _1, _2, _3...
    if (!tester.LoadAllEvalData(FLAGS_eval_listfile.c_str())) {
    tprintf("Failed to load eval data from: %s\n", FLAGS_eval_listfile.c_str());
    return EXIT_FAILURE;
    }
    tester_callback = std::bind(&tesseract::LSTMTester::RunEvalAsync, &tester, _1, _2, _3, _4);
    }
  • trainer.MaintainCheckpoints(tester_callback, log_str);

However, IMHO this is higly misleading, for two reasons:

  • when you pass a validation set, you expect checkpoint and final error rate to refer to that – not the training error
  • today you expect error to mean character error rate (CER) – not the bag-of-character error (average character count confusion per line)

Or am I missing something very large and obvious?

(I do usually get an order of magnitude larger error rate when measuring via prediction and Levenshtein distance compared to lstmtraining. So at least there's empirical evidence something is not right with lstmtraining's figures. I can provide them when needed here.)

@bertsky

This comment was marked as outdated.

@stweil
Copy link
Contributor

stweil commented Jun 7, 2021

That's a problem which might have been less important before using tesstrain: the initial training used lines with artificial ground truth. As those lines were long (with similar lengths), CER and bag-of-character error would not differ.

Both numbers will only differ much if the CER depends on the line length. Calculating a weighted mean error might be sufficient to get a CER.

I expect that WER has the same issue.

@bertsky
Copy link
Contributor Author

bertsky commented Jun 7, 2021

That's a problem which might have been less important before using tesstrain: the initial training used lines with artificial ground truth. As those lines were long (with similar lengths), CER and bag-of-character error would not differ.

Both numbers will only differ much if the CER depends on the line length. Calculating a weighted mean error might be sufficient to get a CER.

It's true bag-of-character error rate (BCER) gets more "stable" with line length, and in that sense (only) CER and BCER tend to be more similar on longer lines. But BCER also always systematically underestimates CER, and both measures also diverge (get more and more uncorrelated) the larger the error is in total. (Thus, no weighting scheme for shorter lines can make up for this bias.)

And as I said, my empirical data suggest that there's a huge difference. (But I'll still have to compute CER measurements of the training error for a direct comparison.)

BTW, lstmeval is no better than lstmtraining in this regard, as it delegates to the same error metric.

Now, CER vs. BCER is quite a blow, but it's (relatively) easy to fix: just incorporate some C++ implementation of global sequence alignment (Needleman-Wunsch algorithm) and distance measure (Damerau-Levenshtein metric) into Tesseract and use that for the evaluation data in a checkpoint.

The other problem mentioned above, i.e. training error instead of test error, is even bigger I believe: Even the checkpointing itself merely runs on training error minima (since MaintainCheckpoints runs on the trainer instance and merely calls the tester instance for log messages). So we simply don't know which checkpoints (or even which iterations) performed best on the test set. I'm afraid this means that all models must be retrained after implementing proper test error checkpointing / early-stopping.

@bertsky

This comment was marked as resolved.

@stweil

This comment was marked as resolved.

@bertsky

This comment was marked as resolved.

@stweil stweil transferred this issue from tesseract-ocr/tesstrain Sep 14, 2021
@wollmers
Copy link

@bertsky

Now, CER vs. BCER is quite a blow, but it's (relatively) easy to fix: just incorporate some C++ implementation of global sequence alignment (Needleman-Wunsch algorithm) and distance measure (Damerau-Levenshtein metric) into Tesseract and use that for the evaluation data in a checkpoint.

For avoidance of doubt: CER is usually computed based on Levenshtein distance in the narrow sense, which means with allowed edit operations of insert, delete and substitute each with a cost of 1. Damerau is used as the term for additionally allowing transpositions.

One of the formulas for CER (resp. WER, LER) is:

ErrorRate = ( inserts + deletions + substitutions ) / items(GRT)

It's easy to show, that this formula can result in values > 1, e.g. for GRT = 'm' and OCR = 'iii', CER is 3.

If we divide by the average length

ErrorRate = ( inserts + deletions + substitutions ) / (0.5 * ( items(GRT) + items(OCR) ) )

CER is 1.5 in the above example.

Second, I couldn't find a definition for BCER, only one for BWER. The term 'bag' is used in IT for the mathematical term 'multiset'. If this is meant measuring BCER makes (IMHO) only sense for mismatch tables, i.e how often e.g. the character 'e' is mismatched with 'c' and how much this contributes to CER. Or does it make a difference for the training process to use BCER versus CER?

NB: An end to end measure should compare glyphs to graphemes, which should be the same as graphemes to graphemes in case of a well transliterated GRT. The difference between character and grapheme based is neglectable for most scripts, but is maybe important for some Asian scripts. Also a good global alignment needs to take split ('h' -> 'li') and merge ('li -> 'h') into account, which can be m:n:

GRT: communicato
OCR: eonnnnnicaio

split/merge alignment:

c o mm  u n i c a t o
e o nnn n n i c a i o

@bertsky
Copy link
Contributor Author

bertsky commented Sep 15, 2021

@wollmers

For avoidance of doubt: CER is usually computed based on Levenshtein distance in the narrow sense, which means with allowed edit operations of insert, delete and substitute each with a cost of 1. Damerau is used as the term for additionally allowing transpositions.

Sure (but you might want to be able to use weighted metrics later on, say for certain confusions that are semantically more/ir/relevant). This was merely to contrast with simple histogram metrics, which are not based on alignment.

One of the formulas for CER (resp. WER, LER) is:

ErrorRate = ( inserts + deletions + substitutions ) / items(GRT)

It's easy to show, that this formula can result in values > 1, e.g. for GRT = 'm' and OCR = 'iii', CER is 3.

If we divide by the average length

ErrorRate = ( inserts + deletions + substitutions ) / (0.5 * ( items(GRT) + items(OCR) ) )

CER is 1.5 in the above example.

The correct denominator is not the real issue here, but please have a look at this comment (I believe the natural choice is the length of the alignment path).

Second, I couldn't find a definition for BCER, only one for BWER. The term 'bag' is used in IT for the mathematical term 'multiset'.

Yes, it's the same beast, only on character level.

If this is meant measuring BCER makes (IMHO) only sense for mismatch tables, i.e how often e.g. the character 'e' is mismatched with 'c' and how much this contributes to CER.

Not really. For a true confusion/edit table you need the actual alignment.

Or does it make a difference for the training process to use BCER versus CER?

The training process itself does not use either measure, but the CTC error (which gives a correct gradient but is not interpretable).

For checkpointing and evaluation, usually CER on the test set is used, which is very different to BCER on the train set (see above).

NB: An end to end measure should compare glyphs to graphemes, which should be the same as graphemes to graphemes in case of a well transliterated GRT. The difference between character and grapheme based is neglectable for most scripts, but is maybe important for some Asian scripts. Also a good global alignment needs to take split ('h' -> 'li') and merge ('li -> 'h') into account, which can be m:n:

True. You can get closer to graphemes if you glue codepoints to grapheme clusters (before or after alignment) and get the edit distance of those sequences instead. The quality of the alignment itself is another minor issue (but the best global path is usually unique except left/right preference).

But let's first try to get the basics right for Tesseract (to catch up with everyone else OCR).

@bertsky
Copy link
Contributor Author

bertsky commented Nov 11, 2021

No one seems to care that their trained models are selected from suboptimal checkpoints and reported with way-too-optimistic error rates?

As long as lstmtraining and lstmeval report figures the way they do, I consider this a bug (not a feature request). (Changing the messages should not be too difficult for a start, and would get the user's attention where it belongs.)

@stweil
Copy link
Contributor

stweil commented Nov 11, 2021

I don't think that we need changes for release 5.0.0. After all, the current models are not so bad, it is already possible to train even better models, and the required changes would delay the release further.

It is sufficient to enhance the current training process to write an additional new model after each epoch and use that model to calculate normal CER / WER values. That's what other OCR software like Calamari or Kraken does. And it is already possible to do that now by using an extra script which watches the latest checkpoint. I don't expect that those checkpoints at the end of an epoch will be much better, but we'll see.

Any error rate is only an indicator. It is only an objective number for a certain validation set. As long as people don't expect that the error rates in Tesseract training are the same which they will get with their real images, they are good enough to show the training progress.

There are other training aspects which I consider more important. One is continuation of an interrupted training. That should start with the line following the last one which was used for training. I'm afraid that it currently starts with the first line, so a training which was interrupted is different compared to a training which runs without any interrupt. And the second thing is performance. Thanks to float training is already much faster than before, but it still takes too much time.

And for later releases it would also be useful to support models from other OCR engines.

@Shreeshrii

This comment was marked as off-topic.

@bertsky
Copy link
Contributor Author

bertsky commented Nov 12, 2021

@stweil

the current models are not so bad,

they could be even better if an appropriate checkpoint was selected – which a huge waste

it is already possible to train even better models,

yes, but only if you know about the problem, and if you rely on external tools for CER calculation

and the required changes would delay the release further.

as I said in my last post, fixing the messaging would already be a good start, and won't be much effort at all – i.e. instead of claiming Finished! Error rate = %f the tool could state Finished! Selected model with minimal training error rate (BCER) = %f

Also, instead of the occasional char train=%f, word train=%f, ... Eval Char error rate=%f message we could say BCER train=%f, BWER train=%f, ... BCER eval=%f.

It is sufficient to enhance the current training process to write an additional new model after each epoch and use that model to calculate normal CER / WER values. That's what other OCR software like Calamari or Kraken does. And it is already possible to do that now by using an extra script which watches the latest checkpoint. I don't expect that those checkpoints at the end of an epoch will be much better, but we'll see.

Yes, one can already do --stop_training for each checkpoint (as in tesstrain's traineddata target). But then you'd still have orchestrate prediction of list.eval on each checkpoint model, and then evaluation (with true CER/WER) using an external tool. That's very much different to modern engines like Calamari or Kraken which already encapsulate that (and offer test error-based early stopping etc). I do put up with that procedure for Tesseract regularly, and believe me, the CER results usually vary widely across checkpoints.

Any error rate is only an indicator. It is only an objective number for a certain validation set. As long as people don't expect that the error rates in Tesseract training are the same which they will get with their real images, they are good enough to show the training progress.

Of course it is only an indicator. But there are well established standards here, and Tesseract not only ignores them, it's not even honest about that. (And again, it's not just about objective measurement, but also model selection.)

@wollmers
Copy link

@stweil

I don't think that we need changes for release 5.0.0. After all, the current models are not so bad, it is already possible to train even better models, and the required changes would delay the release further.

I agree and vote for it. Better sooner than later.

Any error rate is only an indicator. It is only an objective number for a certain validation set. As long as people don't expect that the error rates in Tesseract training are the same which they will get with their real images, they are good enough to show the training progress.

IMHO as an observation the training data itself has an important influence. The usual method of splitting a corpus into training and validation is a sort of incest. It is also self-fulfilling because seldom glyphs (e. g. capital letter X or Y in German, Y appears in AustrianNewspapers with p = 29/2200309 ~ 0.00001) contribute not much to the the overall CER if wrong recognised. My theories are that the models could be improved by emphasising seldom glyphs a little bit more in training sets, use more different fonts and partition the corpus in historic OCR into periods of 50 years, which can be combined into models spanning 100 (or 150) years. I still do this for my dictionaries: 1750-99, 1800-49, etc.

@wrznr
Copy link

wrznr commented Nov 15, 2021

@stweil wrote:

Any error rate is only an indicator. It is only an objective number for a certain validation set.

@bertsky wrote:

Of course it is only an indicator. But there are well established standards here, and Tesseract not only ignores them, it's not even honest about that.

My problem is that I need a way to report CER in scientific contexts. And if Tesseract's self-evaluation is not to be trusted than I consider this a severe problem (i.e. major bug).

@bertsky wrote:

yes, but only if you know about the problem, and if you rely on external tools for CER calculation

Agreed! IMHO, we cannot rely on people's awareness of this problem. Which in turn may lead to incorrect training results getting published or used in project proposals etc.

So in the first step, the status messages should be adjusted in a way that they correctly state what they actually mean to prevent their misinterpretation (and this should naturally be done befora a major release) and in the second step, we should discuss how we can apply scientifically sound evaluation metrics to Tesseract's training procedure.

@stweil
Copy link
Contributor

stweil commented Nov 15, 2021

@wrznr, the problem is that people now wait for a new stable release since two years. The current code in main is much better than release 4.1.1. So any changes which delay the new release further must have a very good reason.

It would be possible to add a final text which is printed when lstmtraining terminates, either in lstmtraining or in the tesstrain Makefile. Adding it in tesstrain would work with any version of lstmtraining. Would that be sufficient for now? Which text would you propose?

stweil added a commit to stweil/tesseract that referenced this issue Nov 15, 2021
The old messages could wrongly be interpreted as CER / WER values,
but Tesseract training currently uses simple bag of characters /
bag of words error rates (see LSTMTrainer::ComputeCharError,
LSTMTrainer::ComputeWordError).

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit to stweil/tesseract that referenced this issue Nov 17, 2021
The old messages could wrongly be interpreted as CER / WER values,
but Tesseract training currently uses simple bag of characters /
bag of words error rates (see LSTMTrainer::ComputeCharError,
LSTMTrainer::ComputeWordError).

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit to stweil/tesseract that referenced this issue Nov 17, 2021
The old messages could wrongly be interpreted as CER / WER values,
but Tesseract training currently uses simple bag of characters /
bag of words error rates (see LSTMTrainer::ComputeCharError,
LSTMTrainer::ComputeWordError).

Signed-off-by: Stefan Weil <sw@weilnetz.de>
amitdo pushed a commit that referenced this issue Nov 17, 2021
The old messages could wrongly be interpreted as CER / WER values,
but Tesseract training currently uses simple bag of characters /
bag of words error rates (see LSTMTrainer::ComputeCharError,
LSTMTrainer::ComputeWordError).

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Contributor

stweil commented Dec 11, 2021

Maybe the RapidFuzz C++ library can be used. It supports several different metrics, and the license is compatible with Tesseract.

@amitdo
Copy link
Collaborator

amitdo commented Dec 12, 2021

Levenshtein language:c++ stars:>50 -license:gpl

Edit: Indeed, RapidFuzz seems to be the best option.

@wollmers
Copy link

What exactly are the requirements for a Levenshtein library?

Must have:

  • appropriate license
  • Unicode/UTF-8
  • C/C++

For CER:

  • Levenshtein alignment (needs the 4 edit operations)

For normalized CER:

  • same as 1 - accuracy (easier and faster to calculate)

How often is it called on e.g. a journal page with 5,000 characters, how long are the strings (line average ~60)?

Don't need IMHO (feature bloat is slow):

  • weighted costs
  • pluggable comparison
  • threshold for max distance

With state of the art bit-parallel algos (Myers 1999, Hyrrö 2004-2006) I can get for length ~10 in UTF-8 10 Mio comparisons per second (distance or length of LCS), 8-bit 35 Mio. This is 700 times faster as edlib (using also bit-parallel).

Alignment adds the complexity of backtracking, which is linear. I would estimate (from my implementations in Perl) that this would reach 5 Mio/s including construction of the alignment array.

@amitdo
Copy link
Collaborator

amitdo commented Dec 13, 2021

  • C/C++

At least in theory, we can use a tool written in Python/Perl or any other language and use unix sockets to communicate with it. The visual debugger already uses this technique (C++<->Java).

@bertsky
Copy link
Contributor Author

bertsky commented Dec 13, 2021

Levenshtein language:c++ stars:>50 -license:gpl

Edit: Indeed, RapidFuzz seems to be the best option.

Not tagged with that topic, but https://github.com/seqan/seqan3 could be an option, too.

@wollmers, agreeing with most of your comment, but I don't think you can adequately measure line pairs per second – it completely depends on the (average and worst-case) distance of these lines. Also, different libraries behave very different regarding worst-case vs. best/average case performance (as the latter leaves lots of room for clever optimisations).

Do we really need the alignment path itself, though (not just its total length and distance)?

@wollmers
Copy link

@bertsky

@wollmers, agreeing with most of your comment, but I don't think you can adequately measure line pairs per second – it completely depends on the (average and worst-case) distance of these lines. Also, different libraries behave very different regarding worst-case vs. best/average case performance (as the latter leaves lots of room for clever optimisations).

Sure, different algos are sensitive to different parameters. Most important is length of the strings. The "snake" algorithms of Myers/Ukkonen (~1985/86) are sensitive on distance. ISRI uses "optimised" Ukkonen. But this and diff algorithms measure LCS (maximise matches). Levenshtein minimises distance. Another factor is size of the alphabet, which for Unicode is theoretically very large. But the used alphabet size in the comparison is <= length(string1).

The largest influence on speed besides algo (bit-parallel is fastest) has implementation. I.e. the skills of the developer.

Do we really need the alignment path itself, though (not just its total length and distance)?

You can save the time of creating the alignment array and just count edit operations during backtracking. You can implement both methods. It's not complicated.

But it does not solve a basic problem of backtracking:

LCS alignment (1 match, distance 4):

ABC--
--CDA
     
Lev alignment (0 matches, distance 3):

ABC
CDA

It's seldom and in OCR it appears more often on high CERs. For training we IMHO can ignore these small and seldom inaccurracy.

@amitdo
Copy link
Collaborator

amitdo commented Dec 19, 2021

  • when you pass a validation set, you expect checkpoint and final error rate to refer to that – not the training error
  • today you expect error to mean character error rate (CER) – not the bag-of-character error (average character count confusion per line)

So basically we can split this into two separate tasks:

  1. Make Tesseract use the eval set to select checkpoints and to determine when to finish training.
  2. Use external library to measure CER instead of BCER.

If someone has the knowledge, time and motivation for this, I recommend to start with the first task.

@amitdo
Copy link
Collaborator

amitdo commented Dec 19, 2021

Regarding the second issue/task, BCER->CER. Until this issue is fixed in Tesseract itself, perhaps the issue could be mitigated in tesstrain with a Python/Perl script?

@bertsky
Copy link
Contributor Author

bertsky commented Dec 19, 2021

So basically we can split this into two separate tasks:

  1. Make Tesseract use the eval set to select checkpoints and to determine when to finish training.
  2. Use external library to measure CER instead of BCER.

If someone has the knowledge, time and motivation for this, I recommend to start with the first task.

I concur.

Let me add that while spending time in that area of the code it would make sense to also write out the statistics about the training process in a controlled and re-usable way. For example, plotting of train/test error curves is very difficult if you have to parse the log output meant for humans to read. At least some CSV file interface. (It would be very cool to have something like a Tensorboard interface, but that's probably out of scope for Tesseract.)

Regarding the second issue/task, BCER->CER. Until this issue is fixed in Tesseract itself, perhaps the issue could be mitigated in tesstrain with a Python/Perl script?

Hardly. You'd need to convert the checkpoint to a model file, use that for prediction (of the training or test dataset) with the API, and then align with GT and calculate CER.

(My local workaround is to do these steps ex-post via makefile/shell means.)

@stweil
Copy link
Contributor

stweil commented Dec 19, 2021

It is already possible to run for n in $(seq 1 100); do make training EPOCHS=$n [...] && calculate_cer && break; done in tesstrain. All you have to do is write a script calculate_cer which converts the final checkpoint of each epoch into a traineddata model, use that model to run OCR on the validation data set, calculate the CER and return 0 as soon as it is below the desired margin. So it is possible to have a training process which is very similar to other trainings (calamari, kraken, ...) just by adding a small script and using one of the existing tools which measure CER for a set of GT and OCR lines.

@bertsky
Copy link
Contributor Author

bertsky commented Dec 19, 2021

@stweil, what you describe are external means, though. But the question raised by @amitdo was whether there might be some script solution to address the CER/BCER calculation problem from within tesstrain. And the answer is definitively no IMO (since there's no way to directly access the weights to make predictions). That's why I asked for a solution in Tesseract itself.

(So I hope you are not suggesting to add script files on top of tesstrain/Makefile now. This would only complicate matters and deflect development efforts. The goal should be to become as comfortable as Calamari/Kraken, training-wise.)

@wollmers
Copy link

@amitdo

Regarding the second issue/task, BCER->CER. Until this issue is fixed in Tesseract itself, perhaps the issue could be mitigated in tesstrain with a Python/Perl script?

For CER we just need distance/ length(GT). Yes, this can be higher than 100%. Distance only needs ~50% of CPU and duration compared to a full alignment. I know one C implementation that's very fast, even via a Perl binding. It's the implementation used in PostgreSql with many precompiler options. License is to investigate. For testing correctness I can use the Perl binding, as I have ~200 testcases for all corner cases of LCS and Levenshtein for my own implementations.

https://fastapi.metacpan.org/source/MBETHKE/Text-Levenshtein-Flexible-0.09/
https://fastapi.metacpan.org/source/MBETHKE/Text-Levenshtein-Flexible-0.09/levenshtein_internal.c

@amitdo
Copy link
Collaborator

amitdo commented Dec 20, 2021

@bertsky,

What @stweil suggested is quite close to what I wanted to know. Your local workaround seems similar.

Is @stweil's/your workaround is 'good enough'?

Tesseract's C++ developers pool is quite small, and I have a hunch this feature (especially the second task) won't be implemented in Tesseract itself in the next 12 months at least.

A workaround written in a scripting language seems more likely to be implemented. Actually, it was already implemented (according to your comment).

@amitdo
Copy link
Collaborator

amitdo commented Dec 20, 2021

@wollmers
Copy link

wollmers commented Dec 21, 2021

@amitdo

My tests pass all for

use Text::Levenshtein::Flexible qw( levenshtein);

and

    my $distance = levenshtein($A,$B);

    is(
      Levenshtein::Simple->distance(\@a,\@b),
      $distance,

      "[$a] m: $m, [$b] n: $n -> " . $distance
    );

I wrote extra tests for Unicode using the ranges ASCII, Latin-1, Hindi and Meroitic Hieroglyphs (U+10980 - 1099F), and also testing up to lengths (in characters = code points) of 1028, because the documentation says length limit 256.

That's in the code

#define MAX_LEVENSHTEIN_STRLEN	16384

It has more features than we need. But it works and is fast for quadratic time complexity O(m*n).

Did not find the latest code at https://github.com/postgres/postgres.

The licence is MIT-like. https://github.com/postgres/postgres/blob/master/COPYRIGHT

@amitdo
Copy link
Collaborator

amitdo commented Dec 21, 2021

@wollmers
Copy link

https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/levenshtein.c

Thanks. Then I can play around with it. It does not use prefix/suffix optimisation, which is easy to implement and saves 50% of time as an overall average. The smaller the differences, the more it saves. And in case of O(m*n) the effect is quadratic. Also it catches not the special case of string1 eq string2 => distance = 0.

@Shreeshrii
Copy link
Collaborator

@wollmers Please share the scripts that you are using. I am interested to try them for Hindi as well as for a custom IAST traineddata that I am training for Roman transliteration of Sanskrit. Thanks.

@wollmers
Copy link

@Shreeshrii

@wollmers Please share the scripts that you are using. I am interested to try them for Hindi as well as for a custom IAST traineddata that I am training for Roman transliteration of Sanskrit. Thanks.

Will publish them on github soon (next days), because they need some minimal comments to be usable by others.

The reason for using Hindi as a testcase was my simple first guess, to use something outside Latin with heavy use of combing characters. Could have been also Hangul or Arabic, but Arabic is right to left and boring to handle in a source code editor.

@wollmers
Copy link

@bertsky

Do we need also a true WER instead of Bag-WER? IMHO it has less priority, because the probability of false positive equal words is very low, and I even don't know what impact this has on training quality.

@bertsky
Copy link
Contributor Author

bertsky commented Feb 20, 2022

@wollmers less priority I agree, and yes, no potential influence on training process. But at least for common/short words, both FP and FN are realistic – so if we have a good implementation, why stop on character level? (CER is often used to judge models in comparison to each other, while WER is often associated with a practical interpretation...)

@wollmers
Copy link

wollmers commented Mar 10, 2022

Now I have my own fast version for Levenshtein distance in C https://github.com/wollmers/Text-Levenshtein-Uni ported from my bit-vector implementation in Perl https://github.com/wollmers/Text-Levenshtein-BV.

It works via Perl XS binding, compiles under clang C++11 and C99.

The other implementations named in this thread are either not bug free, or are slower. The traditional (simple) implementation of PostgreSQL has complicated code (400 lines) and is 50% slower than it can be stripped (94 lines) to the needed features (fixed edit costs, working on U32 code points).

Maybe edlib is still interesting, but it is optimised for long (1 M) DNA sequence alignment and was 700 times slower for short strings than my implementation. AFAIK it calculates the alignment and after this the distance from the alignment. The Perl XS version using it (old version?) didn't pass my tests.

Here the benchmarks via Perl XS:

Text-Levenshtein-Uni$ perl xt/50_distance_bench.t
                    Rate TL::Flex_l52 TL::Uni_l52 TL::Flex_uni TL::Flex_ascii TL::Uni_uni TL::Uni_ascii
TL::Flex_l52    225467/s           --        -90%         -91%           -94%        -97%          -97%
TL::Uni_l52    2184532/s         869%          --         -15%           -41%        -73%          -74%
TL::Flex_uni   2572440/s        1041%         18%           --           -30%        -68%          -69%
TL::Flex_ascii 3674916/s        1530%         68%          43%             --        -54%          -56%
TL::Uni_uni    8065969/s        3477%        269%         214%           119%          --           -4%
TL::Uni_ascii  8385413/s        3619%        284%         226%           128%          4%            --

TL::Flex uses the Postgres implementation. l52 means two lines of length 52 characters (something typical for books). Others are "words" length ~10. TL::Uni is ours.

Called from pure C:

Text-Levenshtein-Uni/src$ ./levtest
[dist_utf8_ucs] distance: 4 expect: 4
[dist_hybrid]   distance: 4 expect: 4
[dist_mixed]    distance: 4 expect: 4
[dist_simple]   distance: 4 expect: 4
[dist_utf8_ucs] iters: 20 M Elapsed: 1.888164 s Rate: 10.6 (M/sec) 4
[dist_hybrid]   iters: 20 M Elapsed: 1.016698 s Rate: 19.7 (M/sec) 4
[dist_mixed]    iters: 20 M Elapsed: 1.027602 s Rate: 19.5 (M/sec) 4
[dist_simple]   iters: 20 M Elapsed: 2.977173 s Rate: 6.7 (M/sec) 4
Total: 6.909637 seconds

For some reasons called from C++ is always a little bit faster:

Text-Levenshtein-Uni/src$ ./levtestcpp
[dist_utf8_ucs] distance: 4 expect: 4
[dist_hybrid]   distance: 4 expect: 4
[dist_mixed]    distance: 4 expect: 4
[dist_simple]   distance: 4 expect: 4
[dist_utf8_ucs] iters: 20 M Elapsed: 1.888937 s Rate: 10.6 (M/sec) 4 
[dist_hybrid]   iters: 20 M Elapsed: 0.989074 s Rate: 20.2 (M/sec) 4
[dist_hybrid]   iters: 20 M Elapsed: 0.995838 s Rate: 20.1 (M/sec) 4
[dist_mixed]    iters: 20 M Elapsed: 1.037983 s Rate: 19.3 (M/sec) 4
[dist_simple]   iters: 20 M Elapsed: 2.336609 s Rate: 8.6 (M/sec) 4
Total: 7.248441 seconds

mixed is the way to go at the moment, as sequences with length >= 64 (or 32 on 32-bit architectures) with the bit-vector implementation fail my tests for a not yet debugged reason. They pass in my Perl version, which needed me some days to get it correct, because the author of the papers is unclear and ambiguous about handling integer overflows (carry bit). IMHO he never tested it. I run ~900 tests covering all possible problems.

Implementing distance for a list words (= array of strings) should be straight forward. I have similar code for LCS in C and for XS. This will be slower as it needs string comparison and hashes. But an average line of text has usually only a few words.

I wonder if distance is faster than bag of characters. A bag needs a hash with >32 instructions for each insert or locate. I use a lookup table (1 instruction) for ASCII which makes more than 90% in European languages.

@amitdo

This comment was marked as resolved.

@stweil

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants