load pretrained model assert failed #106

striversist · 2016-11-07T04:00:43Z

Load a pretrained model to retrain new samples will cause assert failed in Codec::encode, but start training from scratch, this problem probably not happens.
see related issue #83

After digging the code a little, I found this clue:
from clstmocrtrain.cc main1

if (load_name != "") {
    clstm.load(load_name);
  } else {
    Codec codec;
    trainingset.getCodec(codec);
    print("got", codec.size(), "classes");

If training from scratch, the load_name is empty, so goes to trainingset.getCodec(codec);. In this function, the chain codec.build(gtnames, charsep); -> Codec::set is executed. So the training samples' all codec are inserted into the encoder map.

If loading pretrained model to retrain new samples, the load_name is not empty, the clstm.load(load_name); loads pretrained codec into encoder map. Next in the Codec::encode, if a new sample string contains a new codec(not in the pretrained encoder map), assert(encoder->count(c) > 0); fails.

Hope contributors fix this problem ASAP.

The text was updated successfully, but these errors were encountered:

wanghaisheng · 2016-11-07T04:40:09Z

+1

striversist · 2016-11-07T06:27:16Z

The following change temporarily fix the assert failure. @wanghaisheng

if (load_name != "") {
    clstm.load(load_name);
    trainingset.getCodec(clstm.net->codec);    // Add this line
  } else {
    Codec codec;
    trainingset.getCodec(codec);
    print("got", codec.size(), "classes");

I don't know whether there is other side effect.

wanghaisheng · 2016-11-07T07:08:05Z

i looked into the source code you refer ,it seems after load existing model we should first get codec vector for this model and codec for trainningset ,then combine these two vector into one and using Codec::set (https://github.com/tmbdev/clstm/blob/master/clstm.cc)
at first i want to train against chinese based on the japanese model , i have not tried ,but my trainning data and the existing japanese model used are definitely not the same one

amitdo · 2016-11-07T09:16:13Z

Thanks for reporting.

I don't think it's a good solution to read all the dataset in each loading.

amitdo · 2016-11-07T09:37:58Z

~~The best solution IMHO would be to encode all the ~128.000 unicode codepoints at the first load.~~

Update: It's not a good idea, see comments below.

striversist · 2016-11-07T09:44:51Z

@amitdo With experience, the more unicode codec you load, the slower the training process will be.
So I don't think it's a best solution, no offence.

amitdo · 2016-11-07T09:49:19Z

I don't suggest to actually do training on all those chars...

striversist · 2016-11-07T13:40:01Z

OK, looking forward to your solution.

amitdo · 2016-11-07T14:29:56Z

In the meantime, don't use your temporary solution. I believe it will mess your model.

jbaiter · 2016-11-07T15:22:26Z

Hope contributors fix this problem ASAP.

I don't think this will be easy. The codec determines the size of the network's layers, i.e. there will be weights/connections in the network for each of the characters in the codec. To add new characters not in the original training data during re-training, you would have to modify the structure of the network before training, which is pretty complicated: you'd have to add extra dimensions to a lot of the weight/bias matrices. Is this what you're suggesting, @amitdo?

amitdo · 2016-11-07T16:09:30Z

Is there a problem with registering chars in the model's codec at build time (first time only), even if some of them won't be trained? For example, for Chinese - registering 6000-10,000 symbols.

amitdo · 2016-11-07T16:19:03Z

Is this what you're suggesting, @amitdo?

I missed that sentence.

My answer: Certainly not!

amitdo · 2016-11-07T17:00:27Z

My suggested solution:
The user will have an option to point to a file which will contain all the chars he think he will ever need for a specific model. Some of the chars might not appear in the dataset given as input for the network. Later on the user can find another dataset and do training on it with the existing model. The codec won't be updated.

What do you think about that?

kba · 2016-11-08T01:19:45Z

I also do not think that there is a sensible approach to extending a trained model for symbols the network was not originally aware of. It is possible to adapt the data structures (e.g. just adding new code points to the codec) but it will result in an inconsistent model unless you fully retrain - which is what you do not want, obviously.

The user will have an option to point to a file which will contain all the chars he think he will ever need for a specific model. Some of the chars might not appear in the dataset given as input for the network. #106 (comment)

This seems a straightforward approach depending on how much providing all possible chars in the codec degrades training performance.

With experience, the more unicode codec you load, the slower the training process will be. #106 (comment)

Is it such a performance hit to have a large codec size even if the training data contains only a subset of those characters?

Implementing some form of "pre-loading" of e.g. full Unicode code pages instead of building the codec from the training set (as @amitdo suggests) is doable but I'm at a loss on the consequences wrt performance and network consistency. If the number and frequency of new char is small (e.g. a few new variants of letters), it will take a long time to accurately predict them, but it seems plausible. If it's a completely independent training set (like extending a Japanese model with Chinese training data), wouldn't that effectively require un-learning the old model and creating a new one?

Also, enabling such pre-loading would require retraining from scratch with the extended codec which can be very time-consuming, depending on the actual number of chars in the training set:

i am running a training over chinese character for 5 months ,iteration times is 700000
error rate still above 3.0 #81 (comment) @wanghaisheng

I am trying 2492-char subset. it seems to take several weeks (hidden=200, this time)
(NO nhidden = 200 seems to be hopeless, he/she seems to learn one char by forgetting another)
Now trying 3700 chars( little bigger tesseract jp-dataset ) with nhidden = 800 and nhidden =1200.
Unless my PC broke, I will see the result next spring. #49 @isaomatsunami

amitdo · 2016-11-08T10:35:29Z

The issue is mostly with Chinese and Japanese.

amitdo · 2016-11-09T00:34:15Z

Training both Chinese and Japanese in the same model is not a good idea.

striversist · 2016-11-09T02:00:07Z

Chinese has so many characters, we often train commonly used ones.
But sometimes that's not enough. We want to add some uncommon characters, So the problem happens.
If we retrain from scratch, that will take a long long time without a doubt. I think that's the same story with Japanese training.

striversist · 2016-11-09T02:40:13Z

I think your external codec file solution is good. We can prepare some codec for future use. @amitdo

wanghaisheng · 2016-11-09T02:50:31Z

we often came into multi-lingual document such as english-chinese, japanese-chinese.these characters are both valuable to use case .

mittagessen · 2016-11-09T19:59:18Z

Resizing the output layer of the network after training is generally not possible, although it would be possible to precreate unused nodes and making up codec entries for these afterwards. On the other hand this is a less than smart idea as the performance impact is rather high even for rather small scripts and their combination, e.g. Greek and Latin (codec size <300).

IMHO just retrain your models and invest some time to streamline the process. It's something you should be doing anyway and is quite a bit more straightforward than trying to repurpose already existing models.

Finally with Unihan it is actually quite a neat idea to train combined CJK models as it shouldn't increase output layer size for the vast majority of glyphs in either Hanzi scripts. On the other hand finding a network configuration that works for this multi-font model may take some hyperparameter exploration.

amitdo · 2016-11-09T22:54:27Z

@striversist, I decided not to implement what I suggested before. It seems not to be such a
good idea after all. Sorry.

kba added the bug label Nov 7, 2016

kba changed the title ~~load pretrainned model assert failed~~ load pretrained model assert failed Nov 7, 2016

amitdo mentioned this issue Dec 13, 2017

Implementation of resizing codec ocropus-archive/DUP-ocropy#277

Open

moucmou mentioned this issue Dec 18, 2017

When I want to continue training on the original model，Something wrong happened ocropus-archive/DUP-ocropy#278

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load pretrained model assert failed #106

load pretrained model assert failed #106

striversist commented Nov 7, 2016 •

edited

wanghaisheng commented Nov 7, 2016

striversist commented Nov 7, 2016 •

edited

wanghaisheng commented Nov 7, 2016

amitdo commented Nov 7, 2016 •

edited

amitdo commented Nov 7, 2016 •

edited

striversist commented Nov 7, 2016

amitdo commented Nov 7, 2016

striversist commented Nov 7, 2016 •

edited

amitdo commented Nov 7, 2016

jbaiter commented Nov 7, 2016 •

edited

amitdo commented Nov 7, 2016 •

edited

amitdo commented Nov 7, 2016 •

edited

amitdo commented Nov 7, 2016 •

edited

kba commented Nov 8, 2016

amitdo commented Nov 8, 2016 •

edited

amitdo commented Nov 9, 2016

striversist commented Nov 9, 2016

striversist commented Nov 9, 2016 •

edited

wanghaisheng commented Nov 9, 2016

mittagessen commented Nov 9, 2016 •

edited

amitdo commented Nov 9, 2016 •

edited

load pretrained model assert failed #106

load pretrained model assert failed #106

Comments

striversist commented Nov 7, 2016 • edited

wanghaisheng commented Nov 7, 2016

striversist commented Nov 7, 2016 • edited

wanghaisheng commented Nov 7, 2016

amitdo commented Nov 7, 2016 • edited

amitdo commented Nov 7, 2016 • edited

striversist commented Nov 7, 2016

amitdo commented Nov 7, 2016

striversist commented Nov 7, 2016 • edited

amitdo commented Nov 7, 2016

jbaiter commented Nov 7, 2016 • edited

amitdo commented Nov 7, 2016 • edited

amitdo commented Nov 7, 2016 • edited

amitdo commented Nov 7, 2016 • edited

kba commented Nov 8, 2016

amitdo commented Nov 8, 2016 • edited

amitdo commented Nov 9, 2016

striversist commented Nov 9, 2016

striversist commented Nov 9, 2016 • edited

wanghaisheng commented Nov 9, 2016

mittagessen commented Nov 9, 2016 • edited

amitdo commented Nov 9, 2016 • edited

striversist commented Nov 7, 2016 •

edited

striversist commented Nov 7, 2016 •

edited

amitdo commented Nov 7, 2016 •

edited

amitdo commented Nov 7, 2016 •

edited

striversist commented Nov 7, 2016 •

edited

jbaiter commented Nov 7, 2016 •

edited

amitdo commented Nov 7, 2016 •

edited

amitdo commented Nov 7, 2016 •

edited

amitdo commented Nov 7, 2016 •

edited

amitdo commented Nov 8, 2016 •

edited

striversist commented Nov 9, 2016 •

edited

mittagessen commented Nov 9, 2016 •

edited

amitdo commented Nov 9, 2016 •

edited