-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple models in one session #212
Comments
If you create both of your graphs within the same session, and as long as they are distinct subgraphs within the graph, you should be able to run both models within the same session. pseudo-example:
Let me know if that doesn't work for some reason. |
Thanks! Will try this and report back. |
I'm using a function very similar to
And I'm running into an error: Both of the models were built at a previous time. They were made with the same parameters (50,000 vocab size, 400 size layer, 3 layers) and if I run them in an interactive decoding mode, they both work just fine. In addition, the first model that I specify loads just fine, regardless of which one I put first. I know this because if I try running it as is, then the 1st one loads and the second one chokes. If I then remove the second one, the program runs without a hiccup. And if I switch them so that the second model comes before the first one, the same thing happens but in reverse. This leads me to think that there is something up with the graph loading but I'm quite sure what.
|
@lukaszkaiser: is this a matter of wrapping the model creation inside a name or var scope? |
I'm not 100% sure, but I think that the problem might be that the Seq2SeqModel class has some non-sharable things, like global_step and a separate saver. So if you just make 2 of them, you'll have 2 global_steps, 2 savers, and so on. I think you need to move the common things out of the class. And yes - best wrap the 2 models you're creating in different tf.variable_scope to not share parameters (or the same variable_scope with reuse=True to share them). Let us know if that helps or if there are more problems! |
Thanks for the fast replies guys. I changed it so that the model is created under a separate scope when I build it.
Then I made a couple of test models to a few hundred steps. Afterwards, I tried loading them again. They still load fine individually and decode, etc. So I tried loading them both at the same time:
The call to I thought this had to be just a coding error, but debugging it shows that the directory model it's restoring from and the model_name it's using for scope are aligned. What am I missing? I don't think it's to do with the saver because it is saving and restoring fine when using just one model.
|
Any thoughts on this? I looked through the embedding_rnn_seq2seq function and it looks like it builds from get_variables and variable_scope throughout. Even though there's a reuse_variables in, for example, the "RNN" scope, I would think that this would be in a different domain given that the top level scope is different ("en-10000-256-2-10000" vs "de-10000-256-2-10000") |
Hi cinjon, Sorry for the delay. I was thinking a bit about what you're doing and it took me a while to wrap my mind about it, but I think I can see why the first model loads ok but the second does not. Here is how I see it. Both model1 and model2 have a saver, right? Created like this: self.saver = tf.train.Saver(tf.all_variables()) This is ok for the first model, as it saves only the variables belonging to it. But -- because of the use of tf.all_variables() -- the second model with save both itself and model1. Does that make sense? I see 2 ways to correct that. For one, you could filter the variables saved in model2 to start with the prefix you're giving it. Another, and I think better way, is to remove the saver from the 2 models entirely, and have it only once in your main loop - after you've created both models. Then you'll have only a single checkpoint file and you can still use tf.all_variables() without the risk of forgetting anything, and it should all work. Hope that helps! Lukasz |
Thanks Lukasz! I had a chance to try it tonight and I still can't get this right. Are you suggesting that I have one file containing all of the models? What I'm trying utilizes two models, M_1 and M_2, and then a third, J_12, to join them. But there could be a lot more M_k and, for each two of them, there would be a J. Ideally, I would put each of the M_k in their own model directory so that I can build J_pq by loading only M_p and M_q. If I had just one checkpoint file, then I'd be breaking a lot of the modularity involved. It would also be difficult to keep track of experiments using different hyper-parameters. Taking your advice, I removed the saver from the Seq2Seq model. The training for the M_k now looks like this:
And for the J_pq it looks like this:
This didn't work and failed with similar errors (lots of them):
|
I got it to work! (I think) By following your other suggestion to set the variables explicitly, it seems to be working:
Thanks so much for your help guys! |
I am running into a similar error. I have trained two models. en-fr and fr-en models. The fr-en model has different vocab size compared to the en-fr model. The first model is being loaded properly. It crashes while loading the second one. I have trained both the models separately. Please check the log file attached to see the error logs. kindly tell me the right way to go ahead.
below are the two methods create_model and fr_en_create_model
|
You're creating both models so that they share variables, which is not possible when vocabulary sizes differ. If you want separate variables for each model, use, e.g., "with tf.variable_scope('enfr'):" when creating one and "with tf.variable_scope('fren'):" when creating the other. |
…pstream-rccl-patch-0.6.0-rc2 update RCCL tarball to 0.6.0-rc2
Is this possible? I'm doing something where I need the outputs of multiple distinct models to be compared. To do that, I'm batching up inputs and running them over each model. It's unclear to me how I can do this in one session.
The text was updated successfully, but these errors were encountered: