Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many bugs in training the legacy engine #3925

Closed
SpaceView opened this issue Sep 22, 2022 · 25 comments
Closed

Many bugs in training the legacy engine #3925

SpaceView opened this issue Sep 22, 2022 · 25 comments

Comments

@SpaceView
Copy link

I doubt anybody have successfully trained custom data with tesseract 5.2.0 and 5.1.0, the latest I can succeed is 5.0.0-alpha-20201224.
Below are some BUGs when I'm running tesseract 5.2.0 for custom data training. I can say there are TOO MANY BUGS, thus I was not able to finish the whole training due to limited time at this moment, below are just a few of the found BUGs for reference.

// Deletes all samples with zero features marked by KillSample.
void TrainingSampleSet::DeleteDeadSamples() {
  using namespace std::placeholders; // for _1
  auto old_it = samples_.begin();
  for (; old_it < samples_.end(); ++old_it) {
    if (*old_it == nullptr || (*old_it)->class_id() < 0) {
      break;
    }
  }
  auto new_it = old_it;
  for (; old_it < samples_.end(); ++old_it) {
    if (*old_it == nullptr || (*old_it)->class_id() < 0) {
      delete *old_it;
    } else {
      *new_it = *old_it;
      ++new_it;
    }
  }
  //samples_.resize(new_it - samples_.begin() + 1);      //<------------crash the program when samples_.size() is 0
  samples_.resize(new_it - samples_.begin());
  num_raw_samples_ = samples_.size();
  // Samples must be re-organized now we have deleted a few.
}



INT_TEMPLATES_STRUCT *Classify::CreateIntTemplates(CLASSES FloatProtos,
                                           const UNICHARSET &target_unicharset) {
  CLASS_TYPE FClass;
  INT_CLASS_STRUCT *IClass;
  int ProtoId;
  int ConfigId;

  auto IntTemplates = new INT_TEMPLATES_STRUCT;

  for (unsigned ClassId = 0; ClassId < target_unicharset.size(); ClassId++) {
    FClass = &(FloatProtos[ClassId]);
    if (FClass->NumProtos == 0 && FClass->NumConfigs == 0 &&
        strcmp(target_unicharset.id_to_unichar(ClassId), " ") != 0) {
      tprintf("Warning: no protos/configs for %s in CreateIntTemplates()\n",
              target_unicharset.id_to_unichar(ClassId));
    }
    assert(UnusedClassIdIn(IntTemplates, ClassId));
    IClass = new INT_CLASS_STRUCT(FClass->NumProtos, FClass->NumConfigs);
    //FontSet fs{FClass->font_set.size()};             //<---------------------- it will force to push an element in, not size of the vector
    int fsize = FClass->font_set.size();
    FontSet fs(fsize);
	
	
	
	
/**
 * This routine converts from the old floating point format
 * to the new integer format.
 * @param FloatProtos prototypes in old floating pt format
 * @param target_unicharset the UNICHARSET to use
 * @return New set of training templates in integer format.
 * @note Globals: none
 */
INT_TEMPLATES_STRUCT *Classify::CreateIntTemplates(CLASSES FloatProtos,
                                           const UNICHARSET &target_unicharset) {
  CLASS_TYPE FClass;
  INT_CLASS_STRUCT *IClass;
  int ProtoId;
  int ConfigId;

  auto IntTemplates = new INT_TEMPLATES_STRUCT;

  for (unsigned ClassId = 0; ClassId < target_unicharset.size(); ClassId++) {
    FClass = &(FloatProtos[ClassId]);
    if (FClass->NumProtos == 0 && FClass->NumConfigs == 0 &&
        strcmp(target_unicharset.id_to_unichar(ClassId), " ") != 0) {
      tprintf("Warning: no protos/configs for %s in CreateIntTemplates()\n",
              target_unicharset.id_to_unichar(ClassId));
    }
    assert(UnusedClassIdIn(IntTemplates, ClassId));
    IClass = new INT_CLASS_STRUCT(FClass->NumProtos, FClass->NumConfigs);
    //FontSet fs{FClass->font_set.size()};
    int fsize = FClass->font_set.size();
    FontSet fs(fsize);
    for (unsigned i = 0; i < fs.size(); ++i) {
      fs[i] = FClass->font_set.at(i);
    }
    IClass->font_set_id = this->fontset_table_.push_back(fs);  // <------------------------------ref. to below push_back function
    AddIntClass(IntTemplates, ClassId, IClass);                         

    for (ProtoId = 0; ProtoId < FClass->NumProtos; ProtoId++) {
      AddIntProto(IClass);
      ConvertProto(ProtoIn(FClass, ProtoId), ProtoId, IClass);
      AddProtoToProtoPruner(ProtoIn(FClass, ProtoId), ProtoId, IClass,
                            classify_learning_debug_level >= 2);
      AddProtoToClassPruner(ProtoIn(FClass, ProtoId), ClassId, IntTemplates);
    }

    for (ConfigId = 0; ConfigId < FClass->NumConfigs; ConfigId++) {
      AddIntConfig(IClass);
      ConvertConfig(FClass->Configurations[ConfigId], ConfigId, IClass);
    }
  }
  return (IntTemplates);
} /* CreateIntTemplates */

//ref. unicity_table.h
  /// Add an element in the table
  int push_back(T object)  {
    auto idx = get_index(object);
    if (idx == -1) {
      //table_.push_back(object);  //<----------- it will crash the program since idx will be 1 and when size() is 1; actually index should be 0 for size of 1;
      //idx = size();
      idx = table_.push_back(object);
    }
    return idx;
  }


bool write_set(FILE *f, const FontSet &fs) {
  int size = fs.size();
  //return tesseract::Serialize(f, &size) && tesseract::Serialize(f,  &fs[0], size); //<----------------this will crash the program when fs.size() is 0
  return tesseract::Serialize(f, &size) && tesseract::Serialize(f, (size?&fs[0]:0), size);
}



/*---------------------------------------------------------------------------*/
// TODO(rays) This is now used only by cntraining. Convert cntraining to use
// the new method or get rid of it entirely.
/**
 * This routine reads training samples from a file and
 * places them into a data structure which organizes the
 * samples by FontName and CharName.  It then returns this
 * data structure.
 * @param file open text file to read samples from
 * @param feature_definitions
 * @param feature_name
 * @param max_samples
 * @param unicharset
 * @param training_samples
 */
void ReadTrainingSamples(const FEATURE_DEFS_STRUCT &feature_definitions, const char *feature_name,
                         int max_samples, UNICHARSET *unicharset, FILE *file,
                         LIST *training_samples) {
  char buffer[2048];
  char unichar[UNICHAR_LEN + 1];
  LABELEDLIST char_sample;
  FEATURE_SET feature_samples;
  uint32_t feature_type = ShortNameToFeatureType(feature_definitions, feature_name);

  // Zero out the font_sample_count for all the classes.
  LIST it = *training_samples;
  iterate(it) {
    char_sample = reinterpret_cast<LABELEDLIST>(it->first_node());
    char_sample->font_sample_count = 0;
  }

  while (fgets(buffer, 2048, file) != nullptr) {
    if (buffer[0] == '\n') {
      continue;
    }

    sscanf(buffer, "%*s %s", unichar);
    if (unicharset != nullptr && !unicharset->contains_unichar(unichar)) {
      unicharset->unichar_insert(unichar);
      if (unicharset->size() > MAX_NUM_CLASSES) {
        tprintf(
            "Error: Size of unicharset in training is "
            "greater than MAX_NUM_CLASSES\n");
        exit(1);
      }
    }
    char_sample = FindList(*training_samples, unichar);
    if (char_sample == nullptr) {
      char_sample = new LABELEDLISTNODE(unichar);
      *training_samples = push(*training_samples, char_sample);
    }
    auto char_desc = ReadCharDescription(feature_definitions, file);
    feature_samples = char_desc->FeatureSets[feature_type];
    if (char_sample->font_sample_count < max_samples || max_samples <= 0) {
      char_sample->List = push(char_sample->List, feature_samples);
      char_sample->SampleCount++;
      char_sample->font_sample_count++;
    } else {
      delete feature_samples;
    }
    for (size_t i = 0; i < char_desc->NumFeatureSets; i++) {
      if (feature_type != i) {
        delete char_desc->FeatureSets[i];
        char_desc->FeatureSets[i] = nullptr;  //<--------------newly added, otherwise crash the program on "delete char_desc;" when destruction is forced by char_desc;
      }
    }
    delete char_desc;
  }
} // ReadTrainingSamples

I changed the above code and can get "shapeclustering.exe" and "mftraining.exe" to run smoothly, all training materail such as "inttemp" and "pffmtable" are well generated.
Currently the cntraining.exe will crash, but I don't have any more time to test.

@SpaceView SpaceView changed the title BUGs in training TOO MANY MANY BUGs in training with tesseract 5.2.0 Sep 22, 2022
@stweil
Copy link
Contributor

stweil commented Sep 22, 2022

Thank you for this detailled report. So you are training a legacy model? That is indeed rarely done as most people (including myself) typically train LSTM models.

It would help if you could describe the single steps which are necessary to reproduce the failures. Ideally we should create unit tests then to avoid future regressions.

@amitdo
Copy link
Collaborator

amitdo commented Sep 22, 2022

Unit testing is not enough, we should do real world testing on thousands of pages to test the layout analysis and the two OCR engines.

BTW, there was a report by @tfmorris about a huge drop in speed and accuracy that occurred between version 3.02 and version 3.03 (and some later versions). Nobody did anything to find out the source of the regression.

I also read a report on a drop in accuracy of the layout analysis that occurred between 3.04 and 4.0. I don't have a reference to that report.

There were also some general reports (without much details) about a drop in accuracy that occurred between 4.x and 5.0.

@amitdo amitdo changed the title TOO MANY MANY BUGs in training with tesseract 5.2.0 Many bugs in training the legacy engine Sep 22, 2022
@SpaceView
Copy link
Author

I'm not sure which is the legacy method which is not, I'm working for industry application program thus I MUST use c++ only. I don't use tesstrain since it seems works in PYTHON environment and cannot be deployed in c++. And I don't find any real step-by-step training guidance for latest versions.
Please let me know if you have different training methods.
My method is given as follows,

(1) add path to environment (windows 10)
E:\pkg_ocr\tesseract\tesseract520
 
(2) edit your image with jTessBoxEditor
cd  E:\pkg_ocr\tesstrain\jTessBoxEditor231
train.bat ----> jTessBoxEditor  ---> merge TIFF ---> save it as myfontlab.normal.exp0.tif
 
(3) do the following operation,
tesseract  myfontlab.normal.exp0.tif   myfontlab.normal.exp0   batch.nochop   makebox
tesseract   myfontlab.normal.exp0.tif    myfontlab.normal.exp0   nobatch   box.train
NOTE: you have to adjust image contrast or brightness if these reports "empty ..."
 
(4)
unicharset_extractor myfontlab.normal.exp0.box
 
(5)
echo normal 0 0 0 0 0 > font_properties
NOTE: file name is font_properties ( it works if you use font_properties.txt). content is normal 0 0 0 0 0 . note that the word "normal" must be the same work as in file name "myfontlab.normal.exp0.tif ".
 
(6)
shapeclustering -F font_properties -U unicharset myfontlab.normal.exp0.tr
OR
shapeclustering -F font_properties.txt -U unicharset myfontlab.normal.exp0.tr
 
(7)
mftraining  -F font_properties -U unicharset -O train.unicharset myfontlab.normal.exp0.tr
NOTE: this step will generate inttemp、pffmtable, if it doesn't work, use the below cmd,
mftraining -F font_properties.txt -U unicharset -O train.unicharset myfontlab.normal.exp0.tr
 
(8)
cntraining myfontlab.normal.exp0.tr
 
(9)
combine_tessdata normal
 
(10)generated result is t_7B-normal.txt
tesseract E:\test_images\ocr\t_7B.png  E:\test_images\ocr\t_7B-normal -l normal

@amitdo
Copy link
Collaborator

amitdo commented Sep 24, 2022

(1)

samples_.resize(new_it - samples_.begin() + 1);

cac116d

(2)

IClass = new INT_CLASS_STRUCT(FClass->NumProtos, FClass->NumConfigs);
FontSet fs{FClass->font_set.size()};

bool write_set(FILE *f, const FontSet &fs) {
int size = fs.size();
return tesseract::Serialize(f, &size) && tesseract::Serialize(f, &fs[0], size);

a7f938d

(3)

table_.push_back(object);
idx = size();

1d3d1fb

@amitdo
Copy link
Collaborator

amitdo commented Sep 25, 2022

@stweil, @egorpugin,

Can you please see if the suggested changes can be applied?

@stweil
Copy link
Contributor

stweil commented Sep 25, 2022

I still try to reproduce the bugs locally.

@amitdo
Copy link
Collaborator

amitdo commented Oct 27, 2022

@stweil,

Were you able to reproduce the reported bugs?

@stweil
Copy link
Contributor

stweil commented Oct 27, 2022

No, not up to now.

@zdenop
Copy link
Contributor

zdenop commented Nov 24, 2022

I did a miniature reproduction part of the problem (crash of shapeclustering) for those who want to dig into this problem:
i3925_test_case.zip

I also find an old version of tesseract 3.05.02, which is able to create shapetable from this example.

The steps for reproducing are quite simple:

tesseract num.ocra.exp0.png num.ocra.exp0 nobatch box.train
unicharset_extractor num.ocra.exp0.box
shapeclustering -F font_properties -U unicharset num.ocra.exp0.tr

@amitdo
Copy link
Collaborator

amitdo commented Nov 25, 2022

If the training tools for the the legacy are broken and nobody will fix it in time for the 5.3.0 release, I suggest to modify cmake, sw and autotools to not compile and install the legacy training tools.

@stweil
Copy link
Contributor

stweil commented Nov 25, 2022

Thank you, @zdenop, for the test code. git bisect finds commit cac116d which caused the regression.

@amitdo
Copy link
Collaborator

amitdo commented Nov 25, 2022

I already pointed to that commit in #3925 (comment)

stweil added a commit to stweil/tesseract that referenced this issue Nov 25, 2022
Fixes: cac116d ("Replace more PointerVector by std::vector [...]")
Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Contributor

stweil commented Nov 25, 2022

@SpaceView, pull request #3970 fixes the issue in my test. Perhaps you can try it and confirm whether it works for you, too.

@stweil stweil added this to In progress in Tesseract next Nov 25, 2022
amitdo pushed a commit that referenced this issue Nov 30, 2022
Fixes: cac116d ("Replace more PointerVector by std::vector [...]")
Signed-off-by: Stefan Weil <sw@weilnetz.de>

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@amitdo
Copy link
Collaborator

amitdo commented Nov 30, 2022

Fixed in #3970.

@amitdo amitdo closed this as completed Nov 30, 2022
@zdenop
Copy link
Contributor

zdenop commented Nov 30, 2022

I am afraid this issue is not solved fully. This set of commands works for me with tesseract 3.05.02 (to be sure how the process should look like):

tesseract num.ocra.exp0.png num.ocra.exp0 nobatch box.train
unicharset_extractor num.ocra.exp0.box
set_unicharset_properties -U unicharset -O num.unicharset --script_dir=langdata/
shapeclustering -F font_properties -U num.unicharset num.ocra.exp0.tr
mftraining -F font_properties -U num.unicharset -O num.unicharset num.ocra.exp0.tr
cntraining num.ocra.exp0.tr
mv inttemp num.inttemp
mv pffmtable num.pffmtable
mv normproto num.normproto
mv shapetable num.shapetable
combine_tessdata num.
mkdir tessdata
mv num.traineddata tessdata
tesseract num.ocra.exp0.png - --psm 7 -l num --tessdata-dir .

However mftraining from the current code has the problem reading created shapetable:

Error: Failed to read shape table shapetable
Reading num.ocra.exp0.tr ...
Flat shape table summary: Number of shapes = 10 max unichars = 1 number with multiple unichars = 0

Unfortunately, I do not have time to test the other version mentioned by the reporter.

@amitdo amitdo reopened this Dec 1, 2022
@zdenop
Copy link
Contributor

zdenop commented Dec 1, 2022

I found some spare time for testing are here are some observations:

  • tesseract-ocr-w64-setup-v4.1.0-elag2019 and Tesseract-OCR-5.0.0-alpha.20201127 works for me => problem seems to be related to code modernization
  • here are outputs from training i3925_legacy_training_outputs.zip. I realized that current version of tesseract produces (significantly) different output (num.ocra.exp0.tr) from box training (first step). It use different rounding (6 decimal poinsts instead of 8), different number type (float instead of integer). Not sure if this is problem. Anyway shapetable is smaller (84b vs 184b)

IMO it would be good to create small test case also for LSTM training to checks if the output is similar as of 5.0.0-alpha.

@stweil
Copy link
Contributor

stweil commented Dec 3, 2022

I'm afraid that the changes 51909d5...36f9131 at least contribute to the regression.

Extract from old num.ocra.exp0.tr:

if 84
 80 82 192
 80 96 192
 80 109 192
 80 123 192

Extract from new num.ocra.exp0.tr:

if 84
 80.000000 82.000000 192.000000
 80.000000 96.000000 192.000000
 80.000000 109.000000 192.000000
 80.000000 123.000000 192.000000

The old code used add_str_double(), the new code uses std::to_string() which obviously gives a different string. In addition, std::to_string() writes a decimal comma instead of a decimal point with a German locale.

Related functions: tesseract::WriteCharDescription (with Type==2) and tesseract::WriteFeatureSet.

stweil added a commit to stweil/tesseract that referenced this issue Dec 3, 2022
Fixes: 3b07599 ("Replace more STRING by std::string")
Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit to stweil/tesseract that referenced this issue Dec 11, 2022
mftraining crashed because the returned value was 1 instead of 0
for the first call of UnicityTable::push_back.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit to stweil/tesseract that referenced this issue Dec 11, 2022
It crashed when running mftraining with fs.size() == 0.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit to stweil/tesseract that referenced this issue Dec 11, 2022
It crashed when running mftraining because unicharset_size in file
"inttemp" was written with 8 bytes instead of 4 bytes.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit to stweil/tesseract that referenced this issue Dec 11, 2022
This fixes duplicate delete when running cntraining.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit to stweil/tesseract that referenced this issue Dec 12, 2022
…ract-ocr#3925)

It is required for mftraining which otherwise writes a wrong shapetable.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit to stweil/tesseract that referenced this issue Dec 12, 2022
The old code did not work correctly if FClass->font_set.size() was 0.
It created the FontSet fs with size 1 instead of 0.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit to stweil/tesseract that referenced this issue Dec 12, 2022
It was triggered by mftraining.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit to stweil/tesseract that referenced this issue Dec 12, 2022
mftraining crashed if the search did not find anything.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit to stweil/tesseract that referenced this issue Dec 12, 2022
mftraining crashed because the returned value was 1 instead of 0
for the first call of UnicityTable::push_back.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit to stweil/tesseract that referenced this issue Dec 12, 2022
It crashed when running mftraining with fs.size() == 0.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit to stweil/tesseract that referenced this issue Dec 12, 2022
It crashed when running mftraining because unicharset_size in file
"inttemp" was written with 8 bytes instead of 4 bytes.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit to stweil/tesseract that referenced this issue Dec 12, 2022
This fixes duplicate delete when running cntraining.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Contributor

stweil commented Dec 12, 2022

@SpaceView, hopefully the really many bugs which you found and reported are fixed by the many commits in pull request #3977. Some of those commits are nearly identical to your proposed code changes.

stweil added a commit that referenced this issue Dec 13, 2022
It is required for mftraining which otherwise writes a wrong shapetable.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit that referenced this issue Dec 13, 2022
The old code did not work correctly if FClass->font_set.size() was 0.
It created the FontSet fs with size 1 instead of 0.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit that referenced this issue Dec 13, 2022
It was triggered by mftraining.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit that referenced this issue Dec 13, 2022
mftraining crashed if the search did not find anything.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit that referenced this issue Dec 13, 2022
mftraining crashed because the returned value was 1 instead of 0
for the first call of UnicityTable::push_back.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit that referenced this issue Dec 13, 2022
It crashed when running mftraining with fs.size() == 0.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit that referenced this issue Dec 13, 2022
It crashed when running mftraining because unicharset_size in file
"inttemp" was written with 8 bytes instead of 4 bytes.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit that referenced this issue Dec 13, 2022
This fixes duplicate delete when running cntraining.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@amitdo amitdo closed this as completed Dec 18, 2022
GerHobbelt pushed a commit to GerHobbelt/tesseract that referenced this issue Jan 27, 2023
…ract-ocr#3925)

It is required for mftraining which otherwise writes a wrong shapetable.

Signed-off-by: Stefan Weil <sw@weilnetz.de>

# Conflicts:
#	src/ccutil/helpers.h
GerHobbelt pushed a commit to GerHobbelt/tesseract that referenced this issue Jan 27, 2023
The old code did not work correctly if FClass->font_set.size() was 0.
It created the FontSet fs with size 1 instead of 0.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
GerHobbelt pushed a commit to GerHobbelt/tesseract that referenced this issue Jan 27, 2023
It was triggered by mftraining.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
GerHobbelt pushed a commit to GerHobbelt/tesseract that referenced this issue Jan 27, 2023
mftraining crashed if the search did not find anything.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
GerHobbelt pushed a commit to GerHobbelt/tesseract that referenced this issue Jan 27, 2023
mftraining crashed because the returned value was 1 instead of 0
for the first call of UnicityTable::push_back.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
GerHobbelt pushed a commit to GerHobbelt/tesseract that referenced this issue Jan 27, 2023
It crashed when running mftraining with fs.size() == 0.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
GerHobbelt pushed a commit to GerHobbelt/tesseract that referenced this issue Jan 27, 2023
It crashed when running mftraining because unicharset_size in file
"inttemp" was written with 8 bytes instead of 4 bytes.

Signed-off-by: Stefan Weil <sw@weilnetz.de>

# Conflicts:
#	src/classify/intproto.cpp
GerHobbelt pushed a commit to GerHobbelt/tesseract that referenced this issue Jan 27, 2023
This fixes duplicate delete when running cntraining.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Tesseract next
  
In progress
Development

No branches or pull requests

4 participants