Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load pretrained model assert failed #106

Open
striversist opened this issue Nov 7, 2016 · 21 comments
Open

load pretrained model assert failed #106

striversist opened this issue Nov 7, 2016 · 21 comments
Labels

Comments

@striversist
Copy link

striversist commented Nov 7, 2016

Load a pretrained model to retrain new samples will cause assert failed in Codec::encode, but start training from scratch, this problem probably not happens.
see related issue #83

After digging the code a little, I found this clue:
from clstmocrtrain.cc main1

if (load_name != "") {
    clstm.load(load_name);
  } else {
    Codec codec;
    trainingset.getCodec(codec);
    print("got", codec.size(), "classes");

If training from scratch, the load_name is empty, so goes to trainingset.getCodec(codec);. In this function, the chain codec.build(gtnames, charsep); -> Codec::set is executed. So the training samples' all codec are inserted into the encoder map.

If loading pretrained model to retrain new samples, the load_name is not empty, the clstm.load(load_name); loads pretrained codec into encoder map. Next in the Codec::encode, if a new sample string contains a new codec(not in the pretrained encoder map), assert(encoder->count(c) > 0); fails.

Hope contributors fix this problem ASAP.

@wanghaisheng
Copy link

+1

@striversist
Copy link
Author

striversist commented Nov 7, 2016

The following change temporarily fix the assert failure. @wanghaisheng

if (load_name != "") {
    clstm.load(load_name);
    trainingset.getCodec(clstm.net->codec);    // Add this line
  } else {
    Codec codec;
    trainingset.getCodec(codec);
    print("got", codec.size(), "classes");

I don't know whether there is other side effect.

@wanghaisheng
Copy link

i looked into the source code you refer ,it seems after load existing model we should first get codec vector for this model and codec for trainningset ,then combine these two vector into one and using Codec::set (https://github.com/tmbdev/clstm/blob/master/clstm.cc)
at first i want to train against chinese based on the japanese model , i have not tried ,but my trainning data and the existing japanese model used are definitely not the same one

@kba kba added the bug label Nov 7, 2016
@amitdo
Copy link
Contributor

amitdo commented Nov 7, 2016

Thanks for reporting.

I don't think it's a good solution to read all the dataset in each loading.

@amitdo
Copy link
Contributor

amitdo commented Nov 7, 2016

The best solution IMHO would be to encode all the ~128.000 unicode codepoints at the first load.

Update: It's not a good idea, see comments below.

@striversist
Copy link
Author

@amitdo With experience, the more unicode codec you load, the slower the training process will be.
So I don't think it's a best solution, no offence.

@amitdo
Copy link
Contributor

amitdo commented Nov 7, 2016

I don't suggest to actually do training on all those chars...

@kba kba changed the title load pretrainned model assert failed load pretrained model assert failed Nov 7, 2016
@striversist
Copy link
Author

striversist commented Nov 7, 2016

OK, looking forward to your solution.

@amitdo
Copy link
Contributor

amitdo commented Nov 7, 2016

In the meantime, don't use your temporary solution. I believe it will mess your model.

@jbaiter
Copy link

jbaiter commented Nov 7, 2016

Hope contributors fix this problem ASAP.

I don't think this will be easy. The codec determines the size of the network's layers, i.e. there will be weights/connections in the network for each of the characters in the codec. To add new characters not in the original training data during re-training, you would have to modify the structure of the network before training, which is pretty complicated: you'd have to add extra dimensions to a lot of the weight/bias matrices. Is this what you're suggesting, @amitdo?

@amitdo
Copy link
Contributor

amitdo commented Nov 7, 2016

Is there a problem with registering chars in the model's codec at build time (first time only), even if some of them won't be trained? For example, for Chinese - registering 6000-10,000 symbols.

@amitdo
Copy link
Contributor

amitdo commented Nov 7, 2016

Is this what you're suggesting, @amitdo?

I missed that sentence.

My answer: Certainly not!

@amitdo
Copy link
Contributor

amitdo commented Nov 7, 2016

My suggested solution:
The user will have an option to point to a file which will contain all the chars he think he will ever need for a specific model. Some of the chars might not appear in the dataset given as input for the network. Later on the user can find another dataset and do training on it with the existing model. The codec won't be updated.

What do you think about that?

@kba
Copy link
Collaborator

kba commented Nov 8, 2016

I also do not think that there is a sensible approach to extending a trained model for symbols the network was not originally aware of. It is possible to adapt the data structures (e.g. just adding new code points to the codec) but it will result in an inconsistent model unless you fully retrain - which is what you do not want, obviously.

The user will have an option to point to a file which will contain all the chars he think he will ever need for a specific model. Some of the chars might not appear in the dataset given as input for the network. #106 (comment)

This seems a straightforward approach depending on how much providing all possible chars in the codec degrades training performance.

With experience, the more unicode codec you load, the slower the training process will be. #106 (comment)

Is it such a performance hit to have a large codec size even if the training data contains only a subset of those characters?

Implementing some form of "pre-loading" of e.g. full Unicode code pages instead of building the codec from the training set (as @amitdo suggests) is doable but I'm at a loss on the consequences wrt performance and network consistency. If the number and frequency of new char is small (e.g. a few new variants of letters), it will take a long time to accurately predict them, but it seems plausible. If it's a completely independent training set (like extending a Japanese model with Chinese training data), wouldn't that effectively require un-learning the old model and creating a new one?

Also, enabling such pre-loading would require retraining from scratch with the extended codec which can be very time-consuming, depending on the actual number of chars in the training set:

i am running a training over chinese character for 5 months ,iteration times is 700000
error rate still above 3.0 #81 (comment) @wanghaisheng

I am trying 2492-char subset. it seems to take several weeks (hidden=200, this time)
(NO nhidden = 200 seems to be hopeless, he/she seems to learn one char by forgetting another)
Now trying 3700 chars( little bigger tesseract jp-dataset ) with nhidden = 800 and nhidden =1200.
Unless my PC broke, I will see the result next spring. #49 @isaomatsunami

@amitdo
Copy link
Contributor

amitdo commented Nov 8, 2016

The issue is mostly with Chinese and Japanese.

@amitdo
Copy link
Contributor

amitdo commented Nov 9, 2016

Training both Chinese and Japanese in the same model is not a good idea.

@striversist
Copy link
Author

Chinese has so many characters, we often train commonly used ones.
But sometimes that's not enough. We want to add some uncommon characters, So the problem happens.
If we retrain from scratch, that will take a long long time without a doubt. I think that's the same story with Japanese training.

@striversist
Copy link
Author

striversist commented Nov 9, 2016

I think your external codec file solution is good. We can prepare some codec for future use. @amitdo

@wanghaisheng
Copy link

we often came into multi-lingual document such as english-chinese, japanese-chinese.these characters are both valuable to use case .

@mittagessen
Copy link
Contributor

mittagessen commented Nov 9, 2016

Resizing the output layer of the network after training is generally not possible, although it would be possible to precreate unused nodes and making up codec entries for these afterwards. On the other hand this is a less than smart idea as the performance impact is rather high even for rather small scripts and their combination, e.g. Greek and Latin (codec size <300).

IMHO just retrain your models and invest some time to streamline the process. It's something you should be doing anyway and is quite a bit more straightforward than trying to repurpose already existing models.

Finally with Unihan it is actually quite a neat idea to train combined CJK models as it shouldn't increase output layer size for the vast majority of glyphs in either Hanzi scripts. On the other hand finding a network configuration that works for this multi-font model may take some hyperparameter exploration.

@amitdo
Copy link
Contributor

amitdo commented Nov 9, 2016

@striversist, I decided not to implement what I suggested before. It seems not to be such a
good idea after all. Sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants