Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple models in one session #212

Closed
cinjon opened this issue Nov 13, 2015 · 12 comments
Closed

Multiple models in one session #212

cinjon opened this issue Nov 13, 2015 · 12 comments
Assignees

Comments

@cinjon
Copy link

cinjon commented Nov 13, 2015

Is this possible? I'm doing something where I need the outputs of multiple distinct models to be compared. To do that, I'm batching up inputs and running them over each model. It's unclear to me how I can do this in one session.

@vrv
Copy link

vrv commented Nov 13, 2015

If you create both of your graphs within the same session, and as long as they are distinct subgraphs within the graph, you should be able to run both models within the same session.

pseudo-example:

with tf.Session() as sess:
  # Build graph 1
  model1_output_node = build_model_1()
  model2_output_node = build_model_2()

  model1_output = sess.run(model1_output, feed_dict={...})
  model2_output = sess.run(model2_output, feed_dict={...})

  .. compare the result ...

Let me know if that doesn't work for some reason.

@cinjon
Copy link
Author

cinjon commented Nov 14, 2015

Thanks! Will try this and report back.

@vrv vrv self-assigned this Nov 14, 2015
@cinjon
Copy link
Author

cinjon commented Nov 19, 2015

I'm using a function very similar to create_model in models/rnn/translate/translate.py (reproduced below) to load two RNN encoder-decoders like this:

with tf.Session() as session:                                                                    
    print 'Creating model two from directory %s.' % FLAGS.model_dir_one                         
    model_two = create_model(session, forward_only=True, model_dir=FLAGS.model_dir_one)                            

    print 'Creating model one from directory %s.' % FLAGS.model_dir_two                          
    model_two = create_model(session, forward_only=True, model_dir=FLAGS.model_dir_two)                            

And I'm running into an error: tensorflow.python.framework.errors.NotFoundError: Tensor name "Variable_2" not found in checkpoint files <...>/model.ckpt-<###> (more extensive at bottom)

Both of the models were built at a previous time. They were made with the same parameters (50,000 vocab size, 400 size layer, 3 layers) and if I run them in an interactive decoding mode, they both work just fine.

In addition, the first model that I specify loads just fine, regardless of which one I put first. I know this because if I try running it as is, then the 1st one loads and the second one chokes. If I then remove the second one, the program runs without a hiccup. And if I switch them so that the second model comes before the first one, the same thing happens but in reverse.

This leads me to think that there is something up with the graph loading but I'm quite sure what.

def create_model(session, forward_only, model_dir=None):                                                             
  """Create translation model and initialize or load parameters in session."""                       
  model = seq2seq_model.Seq2SeqModel(                                                                
      FLAGS.en_vocab_size, FLAGS.fr_vocab_size, _buckets,                                            
      FLAGS.size, FLAGS.num_layers, FLAGS.max_gradient_norm, FLAGS.batch_size,                       
      FLAGS.learning_rate, FLAGS.learning_rate_decay_factor,                                         
      forward_only=forward_only)

  ckpt = tf.train.get_checkpoint_state(model_dir)
  if ckpt and gfile.Exists(ckpt.model_checkpoint_path):
    print("Reading model parameters from %s" % ckpt.model_checkpoint_path)                           
    model.saver.restore(session, ckpt.model_checkpoint_path)
  else:
    print("Created model with fresh parameters.")                                                    
    session.run(tf.variables.initialize_all_variables())
  return model
...
 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 867, in restore
    sess.run([self._restore_op_name], {self._filename_tensor_name: save_path})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 349, in run
    results = self._do_run(target_list, unique_fetch_targets, feed_dict_string)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 423, in _do_run
    e.code)
tensorflow.python.framework.errors.NotFoundError: Tensor name "Variable_2" not found in checkpoint files /home/ubuntu/tensorflow/research/lm/iwslt-train-tags-de-50000/translate.ckpt-59200  
         [[Node: save_1/restore_slice_2 = RestoreSlice[dt=DT_FLOAT, preferred_shard=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save_1/Const_0, save_1/restore_slice_2/tensor_name, save_1/restore_slice_2/shape_and_slice)]]  
         [[Node: save_1/restore_slice_12/_99 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_3360_save_1/restore_slice_12", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Caused by op u'save_1/restore_slice_2', defined at:

...

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 691, in __init__
    restore_sequentially=restore_sequentially)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 411, in build
    filename_tensor, vars_to_save, restore_sequentially, reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 170, in _AddRestoreOps
    values = self.restore_op(filename_tensor, vs, preferred_shard)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 87, in restore_op
    preferred_shard=preferred_shard)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 173, in _restore_slice
    preferred_shard, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 271, in _restore_slice
    preferred_shard=preferred_shard, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 638, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1733, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1008, in __init__
    self._traceback = _extract_stack()

@vrv vrv assigned lukaszkaiser and unassigned vrv Nov 19, 2015
@vrv
Copy link

vrv commented Nov 19, 2015

@lukaszkaiser: is this a matter of wrapping the model creation inside a name or var scope?

@lukaszkaiser
Copy link
Contributor

I'm not 100% sure, but I think that the problem might be that the Seq2SeqModel class has some non-sharable things, like global_step and a separate saver. So if you just make 2 of them, you'll have 2 global_steps, 2 savers, and so on. I think you need to move the common things out of the class. And yes - best wrap the 2 models you're creating in different tf.variable_scope to not share parameters (or the same variable_scope with reuse=True to share them).

Let us know if that helps or if there are more problems!

@cinjon
Copy link
Author

cinjon commented Nov 20, 2015

Thanks for the fast replies guys.

I changed it so that the model is created under a separate scope when I build it.

    with tf.variable_scope(model_name):
      model = Seq2SeqModel(...)

    ...

    Seq2SeqModel():
        def __init__(...):
            ...
            self.learning_rate = tf.get_variable('learning_rate', shape=[], initializer=tf.constant_initializer(float(learning_rate)), trainable=False)
            self.global_step = tf.get_variable('global_step', shape=[], initializer=tf.constant_initializer(0), trainable=False)
            ...

Then I made a couple of test models to a few hundred steps. Afterwards, I tried loading them again. They still load fine individually and decode, etc.

So I tried loading them both at the same time:

print 'Creating model one from directory %s with scope %s.' % (                              
    FLAGS.model_one_dir, FLAGS.model_one_scope)                                                    
model_one = get_rnn_model(session, model_dir=FLAGS.model_one_dir,                                  
                          model_name=FLAGS.model_one_scope)                                  
model_one.batch_size = bridge_batch_size                                                     

print 'Creating model two from directory %s with scope %s.' % (                              
    FLAGS.model_two_dir, FLAGS.model_two_scope)                                                    
model_two = get_rnn_model(session, model_dir=FLAGS.model_two_dir,                                  
                          model_name=FLAGS.model_two_scope)                                  
model_two.batch_size = bridge_batch_size   

The call to get_rnn_model just encapsulates the model restoration in a with tf.variable_scope(model_name). However, when loaded together, I'm running into the following error where the scope for the first model is used in place of the scope for the second model (...Tensor name "de-10000-256-2-10000/embedding_rnn_seq2seq/RNN/EmbeddingWrapper/embedding" not found in checkpoint files model-en-10000-256-2-10000/translate.ckpt-300)

I thought this had to be just a coding error, but debugging it shows that the directory model it's restoring from and the model_name it's using for scope are aligned.

What am I missing? I don't think it's to do with the saver because it is saving and restoring fine when using just one model.

Creating model one from directory <...>/model-de-10000-256-2-10000 with scope de-10000-256-2-10000.
Reading model parameters from model-de-10000-256-2-10000/translate.ckpt-400

Creating model two from directory <...>/model-en-10000-256-2-10000 with scope en-10000-256-2-10000.
Reading model parameters from model-en-10000-256-2-10000/translate.ckpt-300

W tensorflow/core/common_runtime/executor.cc:1052] 0xbec32b0 Compute status: Not found: Tensor name "de-10000-256-2-10000/embedding_rnn_seq2seq/RNN/EmbeddingWrapper/embedding" not found in checkpoint
 files model-en-10000-256-2-10000/translate.ckpt-300

@cinjon
Copy link
Author

cinjon commented Nov 23, 2015

Any thoughts on this? I looked through the embedding_rnn_seq2seq function and it looks like it builds from get_variables and variable_scope throughout. Even though there's a reuse_variables in, for example, the "RNN" scope, I would think that this would be in a different domain given that the top level scope is different ("en-10000-256-2-10000" vs "de-10000-256-2-10000")

@lukaszkaiser
Copy link
Contributor

Hi cinjon,

Sorry for the delay. I was thinking a bit about what you're doing and it took me a while to wrap my mind about it, but I think I can see why the first model loads ok but the second does not.

Here is how I see it. Both model1 and model2 have a saver, right? Created like this:

  self.saver = tf.train.Saver(tf.all_variables())

This is ok for the first model, as it saves only the variables belonging to it. But -- because of the use of tf.all_variables() -- the second model with save both itself and model1. Does that make sense?

I see 2 ways to correct that. For one, you could filter the variables saved in model2 to start with the prefix you're giving it. Another, and I think better way, is to remove the saver from the 2 models entirely, and have it only once in your main loop - after you've created both models. Then you'll have only a single checkpoint file and you can still use tf.all_variables() without the risk of forgetting anything, and it should all work.

Hope that helps!

Lukasz

@cinjon
Copy link
Author

cinjon commented Nov 25, 2015

Thanks Lukasz! I had a chance to try it tonight and I still can't get this right.

Are you suggesting that I have one file containing all of the models? What I'm trying utilizes two models, M_1 and M_2, and then a third, J_12, to join them. But there could be a lot more M_k and, for each two of them, there would be a J. Ideally, I would put each of the M_k in their own model directory so that I can build J_pq by loading only M_p and M_q.

If I had just one checkpoint file, then I'd be breaking a lot of the modularity involved. It would also be difficult to keep track of experiments using different hyper-parameters.

Taking your advice, I removed the saver from the Seq2Seq model. The training for the M_k now looks like this:

def create_model(..., do_restore=True):
  ...
  with tf.variable_scope(model_name):
    model = Seq2SeqModel(
      vocab_size, vocab_size, bucket_sizes, layer_size, num_layers,
      max_gradient_norm, batch_size, learning_rate, learning_rate_decay_factor,
      forward_only=forward_only)

  ...
  saver = tf.train.Saver(tf.all_variables())
  if model_checkpoint_path:
      if do_restore:
          saver.restore(session, model_checkpoint_path)
      else:
          return model_checkpoint_path, model, saver
  else:
      session.run(tf.variables.initialize_all_variables())
  return None, model, saver

ckpt_path, model, saver = create_model(..., model_name=model_name, do_restore=True)
<Train Train Train>
if current_step % steps_per_checkpoint == 0:
    saver.save(session, ckpt_path, global_step=model.global_step)

And for the J_pq it looks like this:

ckpt_path_one, model_one, _ = get_rnn(
            ..., model_dir=model_dir_one, model_name=model_one_scope, do_restore=False)
ckpt_path_two, model_two, _ = get_rnn(
            ..., model_dir=model_dir_two, model_name=model_two_scope, do_restore=False)

with tf.variable_scope('j_pq'):
    j_pq = ...

saver = tf.train.Saver(tf.all_variables())
saver.restore(session, ckpt_path_one)
saver.restore(session, ckpt_path_two)

This didn't work and failed with similar errors (lots of them):

Tensor name "j_pq/weights" not found in checkpoint files model-de-10000-256-2-10000/translate.ckpt-300
Tensor name "model-en-10000-256-2-10000/embedding_rnn_seq2seq/embedding_rnn_decoder/rnn_decoder/MultiRNNCell/Cell1/GRUCell/Candidate/Linear/Matrix" not found in checkpoint files model-de-10000-256-2-10000/translate.ckpt-300

@cinjon
Copy link
Author

cinjon commented Nov 25, 2015

I got it to work! (I think)

By following your other suggestion to set the variables explicitly, it seems to be working:

all_vars = tf.all_variables()
model_one_vars = [k for k in all_vars if k.name.startswith(FLAGS.model_one_scope)]
model_two_vars = [k for k in all_vars if k.name.startswith(FLAGS.model_two_scope)]
j_pq_vars    = [k for k in all_vars if k.name.startswith('j_pq')]

tf.train.Saver(model_one_vars).restore(sess, model_one_checkpoint)
tf.train.Saver(model_two_vars).restore(sess, model_two_checkpoint)

saver = tf.train.Saver(j_pq_vars)

Thanks so much for your help guys!

@cinjon cinjon closed this as completed Nov 28, 2015
@suraj-vantigodi
Copy link

suraj-vantigodi commented Jul 11, 2016

I am running into a similar error. I have trained two models. en-fr and fr-en models. The fr-en model has different vocab size compared to the en-fr model. The first model is being loaded properly. It crashes while loading the second one. I have trained both the models separately. Please check the log file attached to see the error logs. kindly tell me the right way to go ahead.

`global en_fr_sess

if en_fr_sess==None:
  en_fr_sess = tf.Session()

global en_fr_model
global en_fr_en_vocab_path
global en_fr_fr_vocab_path

  en_fr_model = create_model(en_fr_sess, True)
  en_fr_model.batch_size = 1  
  fr_en_model = fr_en_create_model(en_fr_sess, True)
  fr_en_model.batch_size = 1`

below are the two methods create_model and fr_en_create_model

`def create_model(session, forward_only):

    model = seq2seq_model.Seq2SeqModel(
    FLAGS.en_vocab_size, FLAGS.fr_vocab_size, _buckets,
    FLAGS.size, FLAGS.num_layers, FLAGS.max_gradient_norm, FLAGS.batch_size,
    FLAGS.learning_rate, FLAGS.learning_rate_decay_factor,
    forward_only=forward_only)
    ckpt = tf.train.get_checkpoint_state(FLAGS.train_dir)
    if ckpt and tf.gfile.Exists(ckpt.model_checkpoint_path):
      print("Reading model parameters from %s" % ckpt.model_checkpoint_path)
      model.saver.restore(session, ckpt.model_checkpoint_path)
    else:
       print("Created model with fresh parameters.")
    session.run(tf.initialize_all_variables())
    return model

 def fr_en_create_model(session, forward_only):

      model = seq2seq_model.Seq2SeqModel(
      380, 380, _buckets,
      FLAGS.size, FLAGS.num_layers, FLAGS.max_gradient_norm, FLAGS.batch_size,
      FLAGS.learning_rate, FLAGS.learning_rate_decay_factor,
      forward_only=forward_only)
      ckpt = tf.train.get_checkpoint_state("./password_data/train/")
      if ckpt and tf.gfile.Exists(ckpt.model_checkpoint_path):
        print("Reading model parameters from %s" % ckpt.model_checkpoint_path)
        model.saver.restore(session, ckpt.model_checkpoint_path)
      else:
        print("Created model with fresh parameters.")
      session.run(tf.initialize_all_variables())
      return model`

out.txt

@lukaszkaiser
Copy link
Contributor

You're creating both models so that they share variables, which is not possible when vocabulary sizes differ. If you want separate variables for each model, use, e.g., "with tf.variable_scope('enfr'):" when creating one and "with tf.variable_scope('fren'):" when creating the other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants