Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretrained model for img2txt? #466

Closed
ludazhao opened this issue Sep 28, 2016 · 115 comments
Closed

Pretrained model for img2txt? #466

ludazhao opened this issue Sep 28, 2016 · 115 comments

Comments

@ludazhao
Copy link

Please let us know which model this issue is about (specify the top-level directory)

models/img2txt

Can someone release a pre-trained model for the img2txt model trained on COCO? Would be great for someone here who doesn't have the computational resource yet to do a full training run. Thanks!

@concretevitamin
Copy link

@cshallue: could you comment on this? Thanks.

@concretevitamin concretevitamin added the stat:awaiting model gardener Waiting on input from TensorFlow model gardener label Sep 28, 2016
@siavashk
Copy link

+1

@cshallue
Copy link
Contributor

Sorry, we're not releasing a pre-trained version of this model at this time.

@cshallue cshallue removed the stat:awaiting model gardener Waiting on input from TensorFlow model gardener label Sep 30, 2016
@psycharo-zz
Copy link
Contributor

psycharo-zz commented Oct 5, 2016

here are links to a pre-trained model:

@cshallue
Copy link
Contributor

@psycharo thanks for sharing! Perhaps you could also share your word_counts.txt file. Different versions of the tokenizer can yield different results, so your model is specific to the word_counts.txt file that you used.

@siavashk
Copy link

@psycharo my training is still training on our GPU instance. It seems it would take another two weeks to finish. I would appreciate it if you would also release the fine-tuned model.

@ProgramItUp
Copy link

@psycharo Thanks for sharing your checkpoint!

When I try to use it I'm getting the error: "ValueError: No checkpoint file found in: None".
I don't have any trouble doing run_inference my own checkpoint files but I can't do it on yours. I've tried lots of things: adding a trailing "/", using absolute paths, relative paths, ..... Nothing seems to work.

Suggestions welcomed.
@cshallue - Any thoughts?

Thanks all.

Last login: Sat Oct 15 07:10:56 2016 from 3.202.121.241
user123@myhost:~$ ls -l /tmp/checkpoint_tmp/
total 175356
-rw-r--r-- 1 user123 user123  19629588 Oct 15 07:04 graph.pbtxt
-rw-r--r-- 1 user123 user123 149088120 Oct 15 07:04 model.ckpt-2000000
-rw-r--r-- 1 user123 user123  10675545 Oct 15 07:04 model.ckpt-2000000.meta
-rw-rw-r-- 1 user123 user123    156438 Oct 15 07:08 word_counts.txt
user123@myhost:~$  /data/home/user123/tensorflow_models/models/im2txt/bazel-bin/im2txt/run_inference   --checkpoint_path=/tmp/checkpoint_tmp   --vocab_file=/tmp/checkpoint_tmp/word_counts.txt   --input_files=${IMAGE_FILE}
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
Traceback (most recent call last):
  File "/data/home/user123/tensorflow_models/models/im2txt/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/run_inference.py", line 83, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "/data/home/user123/tensorflow_models/models/im2txt/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/run_inference.py", line 49, in main
    FLAGS.checkpoint_path)
  File "/data/home/user123/tensorflow_models/models/im2txt/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/inference_utils/inference_wrapper_base.py", line 118, in build_graph_from_config
    return self._create_restore_fn(checkpoint_path, saver)
  File "/data/home/user123/tensorflow_models/models/im2txt/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/inference_utils/inference_wrapper_base.py", line 92, in _create_restore_fn
    raise ValueError("No checkpoint file found in: %s" % checkpoint_path)
ValueError: No checkpoint file found in: None
user123@myhost:~$

@cshallue
Copy link
Contributor

cshallue commented Oct 15, 2016

@ProgramItUp Try the following: --checkpoint_path=/tmp/checkpoint_tmp/model.ckpt-2000000

When you pass a directory, it looks for a "checkpoint state" file in that directory, which is an index of all checkpoints in the directory. Your directory doesn't have a checkpoint state file, but you can just pass it the explicit filename.

@PredragBoksic
Copy link

PredragBoksic commented Oct 15, 2016

Getting better, but...

Traceback (most recent call last):
  File "/home/gamma/bin/models-master/im2txt/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/run_inference.py", line 83, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "/home/gamma/bin/models-master/im2txt/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/run_inference.py", line 53, in main
    vocab = vocabulary.Vocabulary(FLAGS.vocab_file)
  File "/home/gamma/bin/models-master/im2txt/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/inference_utils/vocabulary.py", line 50, in __init__
    assert start_word in reverse_vocab
AssertionError

@cshallue
Copy link
Contributor

Looks like the word_counts.txt file above is not formatted as expected:

b'a' 969108
b'</S>' 586368
b'<S>' 586368
b'.' 440479
b'on' 213612
b'of' 202290
b'the' 196219
b'in' 182598
b'with' 152984
...

vocabulary.py expects:

a 969108
</S> 586368
<S> 586368
. 440479
on 213612
of 202290
the 196219
in 182598
with 152984
...

A quick fix is to reformat the word_counts.txt in that way. Or, you could replace line 49 of vocabulary.py with

reverse_vocab = [eval(line.split()[0]) for line in reverse_vocab]

In the long run, I'll come up with a way to make sure word_counts.txt is outputted the same for everyone.

@PredragBoksic
Copy link

PredragBoksic commented Oct 15, 2016

It works!

http://stablescoop.horseradionetwork.com/wp-content/uploads/2013/10/ep271.jpg

Captions for image cb340488986cc40f8ec610348b7f5a24.jpg:
  0) a woman is standing next to a horse . (p=0.000726)
  1) a woman is standing next to a horse (p=0.000638)
  2) a woman is standing next to a brown horse . (p=0.000373)

@cshallue
Copy link
Contributor

@PredragBoksic great!

@psycharo , what version of python did you use to generate the word_counts.txt file?

I expect the script to output lines of the form:

a 969108
</S> 586368
<S> 586368

not:

b'a' 969108
b'</S>' 586368
b'<S>' 586368

@PredragBoksic
Copy link

I didn't generate the word_counts.txt file. I changed the line 49 as you suggested it, with:

    """ WORKAROUND for vocabulary file """
    """reverse_vocab = [line.split()[0] for line in reverse_vocab]"""
    reverse_vocab = [eval(line.split()[0]) for line in reverse_vocab]

I have Python 2.7.12 on KUbuntu 16.04 with CUDA 8.0 and CUDNN 5.1 and GTX970. I would not know how to do it in Python, because I program in Java usually. Do you need some code to change that file?

@cshallue
Copy link
Contributor

@PredragBoksic I'm asking the creator of that file. You can just keep using the workaround :)

@psycharo-zz
Copy link
Contributor

@cshallue python 3.5. I had to make a couple of dirty hacks to make it work on that version of python, this is why word_counts.txt looks different.

@PredragBoksic
Copy link

@psycharo How many hours did this take to train? I think that people would appreciate what you shared more if you mentioned this.

@psycharo-zz
Copy link
Contributor

@PredragBoksic
initial training took about 2-3 days, finetuning for 1m iterations took around 5-6 days. I used single GPU, Tesla P100.

@ProgramItUp
Copy link

@cshallue Thanks for the prompt replies. Your suggestions worked.

I was not able to follow the full execution path of the code:

Where would be the right place to put a bit of error checking to make sure that the files
--checkpoint_path, --vocab_file, --input_files exist and throw an error if they don't?

In the case of the checkpoint file it would be helpful to throw an error if "checkpoint state" is not found.
Where would this happen?

Thanks.

@cshallue
Copy link
Contributor

There are already error checks for all those things.

If no checkpoint state or no checkpoint file is found in --checkpoint_path, it will fail the check here.

If --vocab_file doesn't exist it will fail the check here.

If no files match --input_files then you will get the message "Running caption generation on 0 files matching..." and inference will exit: see here.

@PredragBoksic
Copy link

I did not notice any meaningful error messages, for example when the image file was missing. I suppose that this functionality will be completed in the future.

@siavashk
Copy link

@cshallue: I am running the finetuning step of the optimization. What I noticed was that the loss function is not changing much for the initial 22000 steps. The loss is pretty much stuck at 2.40.

I have attached the log file by pumping the stderr to a text file. Is the loss going to go significantly down in the remaining iterations? Or am I missing some "gotcha"?
log_finetune.txt

@cshallue
Copy link
Contributor

cshallue commented Oct 25, 2016

@siavashk The loss reported by the training script is expected to be really noisy: it reports on single batches of only 32 examples.

Are you running the evaluation script on the validation files? We expect to see validation perplexity decreasing slowly. It decreases slowly because the model is already near optimal and because we use a smaller learning rate during finetuning.

@siavashk
Copy link

siavashk commented Oct 25, 2016

@cshallue Maybe I am overly anxious, 22000 steps is about 1% of the optimization. I am just worried that it has been three weeks since I started training this model, and it seems it is going to take another two weeks for it to converge.
I am not running the validation script, since the training itself is taking too long (it's been three weeks now and I am at 1 million iterations). I thought running an additional validation step would make this even longer.

@cshallue
Copy link
Contributor

You won't be able to tell much from the training losses for a single batch any more. They will keep jumping around.

You could always just use the model in its current form. It will probably be sensible. There is not much improvement after 1M steps of fine tuning.

Or you could use the model shared in this thread above.

@iAInNet
Copy link

iAInNet commented Sep 6, 2017

@psycharo Thanks for sharing your checkpoint! excellent work!!

@RBirkeland
Copy link

RBirkeland commented Sep 7, 2017

I have successfully used the 1M model (model.ckpt-1000000) However I'm still struggling to use the fine-tuned 2M or 3M posted here. I've tried the solutions already discussed, but with no luck.

I'm using: Tensorflow 1.3 (for gpu), CUDA 8, cudnn 5.1. (I have yet to try to downgrade to TF 1.0, could this work?).

When using for simplex using the fine-tuned 2M model, as described posted by @psycharo, I get the errors discussed earlier:

NotFoundError (see above for traceback): Tensor name "lstm/basic_lstm_cell/kernel" not found in checkpoint files /home/ubuntu/im2txt/data/model.ckpt-2000000

I can fix this issue by running the following code:

OLD_CHECKPOINT_FILE = "model.ckpt-2000000"
NEW_CHECKPOINT_FILE = "model2.ckpt-2000000"

import tensorflow as tf
vars_to_rename = {
    "lstm/BasicLSTMCell/Linear/Matrix": "lstm/basic_lstm_cell/weights",
    "lstm/BasicLSTMCell/Linear/Bias": "lstm/basic_lstm_cell/biases",
}
new_checkpoint_vars = {}
reader = tf.train.NewCheckpointReader(OLD_CHECKPOINT_FILE)
for old_name in reader.get_variable_to_shape_map():
  if old_name in vars_to_rename:
    new_name = vars_to_rename[old_name]
  else:
    new_name = old_name
  new_checkpoint_vars[new_name] = tf.Variable(reader.get_tensor(old_name))

init = tf.global_variables_initializer()
saver = tf.train.Saver(new_checkpoint_vars)

with tf.Session() as sess:
  sess.run(init)
  saver.save(sess, NEW_CHECKPOINT_FILE)

However, when I try to run the evaluation using the new model2, I get the following error:

NotFoundError (see above for traceback): Key lstm/basic_lstm_cell/kernel not found in checkpoint

Here is the full stacktrace

INFO:tensorflow:Loading model from checkpoint: /home/ubuntu/im2txt/data/model2.ckpt-2000000
INFO:tensorflow:Restoring parameters from /home/ubuntu/im2txt/data/model2.ckpt-2000000
2017-09-07 14:38:17.078647: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key lstm/basic_lstm_cell/kernel not found in checkpoint
2017-09-07 14:38:17.100193: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key lstm/basic_lstm_cell/bias not found in checkpoint
Traceback (most recent call last):
  File "/home/ubuntu/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/run_inference.py", line 89, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/ubuntu/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/run_inference.py", line 66, in main
    restore_fn(sess)
  File "/home/ubuntu/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/inference_utils/inference_wrapper_base.py", line 96, in _restore_fn
    saver.restore(sess, checkpoint_path)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1560, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1124, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key lstm/basic_lstm_cell/kernel not found in checkpoint
  [[Node: save/RestoreV2_381 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_381/tensor_names, save/RestoreV2_381/shape_and_slices)]]

Caused by op u'save/RestoreV2_381', defined at:
  File "/home/ubuntu/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/run_inference.py", line 89, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/ubuntu/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/run_inference.py", line 52, in main
    FLAGS.checkpoint_path)
  File "/home/ubuntu/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/inference_utils/inference_wrapper_base.py", line 116, in build_graph_from_config
    saver = tf.train.Saver()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1140, in __init__
    self.build()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1172, in build
    filename=self._filename)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 688, in build
    restore_sequentially, reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 247, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 663, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

NotFoundError (see above for traceback): Key lstm/basic_lstm_cell/kernel not found in checkpoint
  [[Node: save/RestoreV2_381 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_381/tensor_names, save/RestoreV2_381/shape_and_slices)]]

Anyone have any idea how to make the fine-tuned model work?

@Giribushan
Copy link

Thank you..

@RazinShaikh
Copy link

RazinShaikh commented Oct 15, 2017

If someone is looking for the reformatted words_count file, here it is words_count.txt

@y734451909
Copy link

y734451909 commented Oct 15, 2017 via email

@tyler-lanigan-hs
Copy link

Has anyone figured out how to export the im2txt trained model as a TensorFlow SavedModelBundle to be served by Tensorflow Serving?

@yh0903
Copy link

yh0903 commented Dec 9, 2017

Has anyone meet the problem of UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte?

Traceback (most recent call last):
File "/Users/hanyu/Downloads/models-master2/research/im2txt/im2txt/run_inference.py", line 153, in
im2txt()
File "/Users/hanyu/Downloads/models-master2/research/im2txt/im2txt/run_inference.py", line 140, in im2txt
image = f.read()
File "/Users/hanyu/anaconda/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 125, in read
pywrap_tensorflow.ReadFromStream(self._read_buf, length, status))
File "/Users/hanyu/anaconda/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 93, in _prepare_value
return compat.as_str_any(val)
File "/Users/hanyu/anaconda/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 106, in as_str_any
return as_str(value)
File "/Users/hanyu/anaconda/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 84, in as_text
return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

@LogicHolmes
Copy link

I don't know why I send the command:
bazel-bin/im2txt/train --input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" --inception_checkpoint_file="${INCEPTION_CHECKPOINT}" --train_dir="${MODEL_DIR}/train" --train_inception=false --number_of_steps=1000000

there have a error:
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcufft.so.8.0. LD_LIBRARY_PATH:
I tensorflow/stream_executor/cuda/cuda_fft.cc:344] Unable to load cuFFT DSO.
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
*** Error in `/usr/bin/python': double free or corruption (!prev): 0x000000000231f8e0 ***
I don't know the error

@NavneethS
Copy link

NavneethS commented Dec 21, 2017

@yh0903 To solve the unicode error, make sure the file is being read in binary mode in run_inference.py :
with tf.gfile.GFile(filename, "rb") as f:

@vanpersie32
Copy link

@psycharo hi, thank you for providing us with so great model. I want to ask you some question. Have you noticed how the performance changes when you finetuen the model. Is the whole models' performance increasing or first the model performance(cider or bleu) drops a little, then it gradually increase.

@ksenyakor
Copy link

ksenyakor commented Feb 11, 2018

Hi all,
I try to use pretrained models by @psycharo . When I test the model to get softmax output and LSTM states I get an error: "Key lstm/logits/biases not found in checkpoint" .

Tensorflow version is 1.0.1, python 2.7

This is console output:

universal@universal-ubuntu:~/anaconda3/envs/MyGAN$ python test.py
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: universal-ubuntu
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: universal-ubuntu
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 384.111.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:363] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.111 Tue Dec 19 23:51:45 PST 2017
GCC version: gcc version 4.9.3 (Ubuntu 4.9.3-13ubuntu2)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 384.111.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 384.111.0
INFO:tensorflow:Loading model from checkpoint: /home/universal/anaconda3/envs/MyGAN/im2txt/model/pre-trained/model-new-renamed.ckpt-2000000
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key lstm/logits/biases not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key lstm/logits/weights not found in checkpoint
Traceback (most recent call last):
File "test.py", line 185, in
restore_fn(sess)
File "test.py", line 64, in _restore_fn
saver.restore(sess, checkpoint_path)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1428, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key lstm/logits/biases not found in checkpoint
[[Node: save/RestoreV2_379 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_379/tensor_names, save/RestoreV2_379/shape_and_slices)]]

Caused by op u'save/RestoreV2_379', defined at:
File "test.py", line 173, in
restore_fn = _create_restore_fn(checkpoint_path) # (inception_variables, inception_checkpoint_file)
File "test.py", line 55, in _create_restore_fn
saver = tf.train.Saver()
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1040, in init
self.build()
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1070, in build
restore_sequentially=self._restore_sequentially)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 675, in build
restore_sequentially, reshape)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 402, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 242, in restore_op
[spec.tensor.dtype])[0])
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 668, in restore_v2
dtypes=dtypes, name=name)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1226, in init
self._traceback = _extract_stack()

NotFoundError (see above for traceback): Key lstm/logits/biases not found in checkpoint
[[Node: save/RestoreV2_379 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_379/tensor_names, save/RestoreV2_379/shape_and_slices)]]

And this is my code for testing:

`

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import math
import os.path
import time


import numpy as np
import tensorflow as tf
import image_embedding
import image_processing
import inputs as input_ops
tf.logging.set_verbosity(tf.logging.INFO)

        # Dimensions of Inception v3 input images.
image_height = 299
image_width = 299
image_format = "jpeg"
train_inception=False
embedding_size = 512
vocab_size = 12000
num_lstm_units = 512
# To match the "Show and Tell" paper we initialize all variables with a
# random uniform initializer.
    # Scale used to initialize model variables.
initializer_scale = 0.08
initializer = tf.random_uniform_initializer(
        minval=-initializer_scale,
        maxval=initializer_scale)
    # Collection of variables from the inception submodel.
inception_variables = []
inception_checkpoint_file="/home/universal/anaconda3/envs/MyGAN/im2txt/model/inception_v3.ckpt"
checkpoint_path="/home/universal/anaconda3/envs/MyGAN/im2txt/model/pre-trained/model-new-renamed.ckpt-2000000"


def _create_restore_fn(checkpoint_path):
    """Creates a function that restores a model from checkpoint.

    Args:
      checkpoint_path: Checkpoint file or a directory containing a checkpoint
        file.
      saver: Saver for restoring variables from the checkpoint file.

    Returns:
      restore_fn: A function such that restore_fn(sess) loads model variables
        from the checkpoint file.

    Raises:
      ValueError: If checkpoint_path does not refer to a checkpoint file or a
        directory containing a checkpoint file.
    """

    saver = tf.train.Saver()

    if tf.gfile.IsDirectory(checkpoint_path):
        checkpoint_path = tf.train.latest_checkpoint(checkpoint_path)
        if not checkpoint_path:
            raise ValueError("No checkpoint file found in: %s" % checkpoint_path)

    def _restore_fn(sess):
        tf.logging.info("Loading model from checkpoint: %s", checkpoint_path)
        saver.restore(sess, checkpoint_path)
        tf.logging.info("Successfully loaded checkpoint: %s",
                        os.path.basename(checkpoint_path))

    return _restore_fn

def process_image(encoded_image, thread_id=0):
    """Decodes and processes an image string.

    Args:
      encoded_image: A scalar string Tensor; the encoded image.
      thread_id: Preprocessing thread id used to select the ordering of color
        distortions.

    Returns:
      A float32 Tensor of shape [height, width, 3]; the processed image.
    """
    return image_processing.process_image(encoded_image,
                                      is_training=False,
                                      height=image_height,
                                      width=image_width,
                                      thread_id=thread_id,
                                      image_format=image_format)


g = tf.Graph()
with g.as_default():
    image_feed = tf.placeholder(dtype=tf.string, shape=[], name="image_feed")
    input_feed = tf.placeholder(dtype=tf.int64,
                                shape=[None],  # batch_size
                                name="input_feed")
    # Process image and insert batch dimensions.
    # build_inputs
    images = tf.expand_dims(process_image(image_feed), 0)
    input_seqs = tf.expand_dims(input_feed, 1)

    # """Builds the image model subgraph and generates image embeddings.
    inception_output = image_embedding.inception_v3(
        images,
        trainable=train_inception,
        is_training=False)
    inception_variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="InceptionV3")

    # Map inception output into embedding space.
    with tf.variable_scope("image_embedding") as scope:
        image_embeddings = tf.contrib.layers.fully_connected(
            inputs=inception_output,
            num_outputs=embedding_size,
            activation_fn=None,
            weights_initializer=initializer,
            biases_initializer=None,
            scope=scope)

    # Save the embedding size in the graph.
    tf.constant(embedding_size, name="embedding_size")

    with tf.variable_scope("seq_embedding"), tf.device("/cpu:0"):
        embedding_map = tf.get_variable(
            name="map",
            shape=[vocab_size, embedding_size],
            initializer=initializer)
        seq_embeddings = tf.nn.embedding_lookup(embedding_map, input_seqs)

    # This LSTM cell has biases and outputs tanh(new_c) * sigmoid(o), but the
    # modified LSTM in the "Show and Tell" paper has no biases and outputs
    # new_c * sigmoid(o).
    lstm_cell = tf.contrib.rnn.BasicLSTMCell(
        num_units=num_lstm_units, state_is_tuple=True)

    with tf.variable_scope("lstm", initializer=initializer) as lstm_scope:
        # Feed the image embeddings to set the initial LSTM state.
        zero_state = lstm_cell.zero_state(
            batch_size=image_embeddings.get_shape()[0], dtype=tf.float32)
        _, initial_state = lstm_cell(image_embeddings, zero_state)

        # Allow the LSTM variables to be reused.
        lstm_scope.reuse_variables()

        # In inference mode, use concatenated states for convenient feeding and
        # fetching.
        tf.concat(axis=1, values=initial_state, name="initial_state")

        # Placeholder for feeding a batch of concatenated states.
        state_feed = tf.placeholder(dtype=tf.float32,
                                    shape=[None, sum(lstm_cell.state_size)],
                                    name="state_feed")
        state_tuple = tf.split(value=state_feed, num_or_size_splits=2, axis=1)

        # Run a single LSTM step.
        lstm_outputs, state_tuple = lstm_cell(
            inputs=tf.squeeze(seq_embeddings, axis=[1]),
            state=state_tuple)

        # Concatentate the resulting state.
        tf.concat(axis=1, values=state_tuple, name="state")

        # Stack batches vertically.
        lstm_outputs = tf.reshape(lstm_outputs, [-1, lstm_cell.output_size])

        with tf.variable_scope("logits") as logits_scope:
            logits = tf.contrib.layers.fully_connected(
                inputs=lstm_outputs,
                num_outputs=vocab_size,
                activation_fn=None,
                weights_initializer=initializer,
                scope=logits_scope)

        tf.nn.softmax(logits, name="softmax")

    restore_fn = _create_restore_fn(checkpoint_path)  # (inception_variables, inception_checkpoint_file)

g.finalize()


input_files= "/media/universal/264CB8084CB7D0B3/MSCOCO/raw-data/train2014/COCO_train2014_000000000009.jpg"
filenames = []
for file_pattern in input_files.split(","):
    filenames.extend(tf.gfile.Glob(file_pattern))

with tf.Session(graph=g) as sess:
    # Load the model from checkpoint.
    restore_fn(sess)
    for filename in filenames:
        with tf.gfile.GFile(filename, "rb") as f:
            image = f.read()

            #partial_captions_list = partial_captions.extract()
            #input_feed = np.array([c.sentence[-1] for c in partial_captions_list])
            # build_inputs
            # Test feeding a batch of inputs and LSTM states to get softmax output and
            # LSTM states.
            input_feed = np.random.randint(0, 10, size=3)
            state_feed = np.random.rand(3, 1024)
            feed_dict = {"input_feed:0": input_feed, "lstm/state_feed:0": state_feed, "image_feed:0": image}

            lstm_outputs_out = sess.run([softmax, lstm_outputs], feed_dict=feed_dict)
    print(lstm_outputs_out)
    """"""

`

What has gone wrong?
Are there any ckpt file with these vars?

When I generate cattions by runnung run_inference.py file, everything is OK. But I need to create my own model based on Im2Txt so I want to know how it works.

Thank you in advance

@JZakraoui
Copy link

Hello,

I am running the script " bazel-bin\im2txt\run_inference --checkpoint_path=${CHECKPOINT_DIR} --vocab_file=${VOCAB_FILE} --input_files=${IMAGE_FILE}" using python 3.5.2 under windows 7.
Python crashs with the message "Python has stopped working", could you please advice what is wrong!
Thank you

@JZakraoui
Copy link

@victoriastuart
@cshallue

I am running the script " bazel-bin\im2txt\run_inference --checkpoint_path=${CHECKPOINT_DIR} --vocab_file=${VOCAB_FILE} --input_files=${IMAGE_FILE}" using python 3.5.2 (downloaded with Anaconda 3) under windows 7.
Python crashs with the message "Python has stopped working", could you please advice what is wrong!
I need only to caption some images and use a pretained model.
Thank you

@victoriastuart
Copy link

victoriastuart commented May 4, 2018

@JZakraoui

  1. I work in Linux, not Windows. ;-)

  2. I don't know your level of experience, but as a general suggestion I would suggest reading up on creating and using Python virtual environments (venv) anytime you are installing and working with new software/projects. In my opinion, it will save you a lot of headaches in the long-run (preserving, e.g., your system and it's "base" Python installation ...

  3. Not to be dismissive, but "Python crashes with the message 'Python has stopped working' ...", by itself, is not very helpful:

    • "My power went out! Why?" << Blown fuse? Powerline down? Hurricane? ...
    • "My stomach hurts -- why?" << Indigestion? Hunger pangs? Stress? Ulcers? ...

    Again -- as a general practice -- include the exact error message, and the preceding 10 or 50 or 100 lines of code/messages (whatever concisely encapsulates the issue, in your opinion) whenever you describe a problem plus relevant system details: operating system (as you did), programming language / environment, program versions ... anything relevant.

  4. Not to ask the obvious, but did you "Google" this issue. Although often very archaic, error messages often indicate the precise nature of the issue, so searchin on that topic(s) leads to greater understanding of the problem.

    Again (my opinion), indicating that you tried to understand your problem and that you searched for a solution carries much weight, when finally asking for help.

  5. NEVER give up! Seriously: we ALL start somewhere! Tthings that seem really complicated at the time often seem much less complicated in hindsight, with aquired knlowledge and experience.

Just my thoughts; I do hope you sort this out! Post back here with additional detail, and perhaps someone can help. :-)

@JZakraoui
Copy link

@victoriastuart thank you
@psycharo @KranthiGV @cshallue
I am running the script
bazel-bin\im2txt\run_inference --checkpoint_path=%CHECKPOINT_PATH% --vocab_file=%VOCAB_FILE% --input_files=%IMAGE_FILE%

Python 3.5.5
tensorflow 1.8.0
windows 7(64 bit), CPU
@psycharo pre-trained model

I got the following error:
tensorflow.python.framework.errors_impl.NotFoundError: Tensor name "lstm/basic_lstm_cell/bias" not found in checkpoint files C:\Users\USER\Documents\models\pretrained1\model.ckpt-2000000

Any advice? thank you

@vpaharia
Copy link

vpaharia commented May 23, 2018

@JZakraoui Seems like variable names for basic_lstm_cell were changed again. You can change the variable name as pointed out by @cshallue. Copying his code however notice the variable names

OLD_CHECKPOINT_FILE = ".../model.ckpt-2000000"
NEW_CHECKPOINT_FILE = ".../model.ckpt-2000000"

import tensorflow as tf
vars_to_rename = {
    "lstm/BasicLSTMCell/Linear/Matrix": "lstm/basic_lstm_cell/kernel",
    "lstm/BasicLSTMCell/Linear/Bias": "lstm/basic_lstm_cell/bias",
}
new_checkpoint_vars = {}
reader = tf.train.NewCheckpointReader(OLD_CHECKPOINT_FILE)
for old_name in reader.get_variable_to_shape_map():
  if old_name in vars_to_rename:
    new_name = vars_to_rename[old_name]
  else:
    new_name = old_name
  new_checkpoint_vars[new_name] = tf.Variable(reader.get_tensor(old_name))

init = tf.global_variables_initializer()
saver = tf.train.Saver(new_checkpoint_vars)

with tf.Session() as sess:
  sess.run(init)
  saver.save(sess, NEW_CHECKPOINT_FILE)

It works for me with
Python 3.6.4
tensorflow 1.7.0
@psycharo's pre-trained model

@ds2268
Copy link

ds2268 commented Jun 7, 2018

Can confirm that @vpaharia latest fix works. Steps to follow on Python 3.5.5, TF-gpu 1.8:

  • Download @psycharo pre-trained model (e.g fine-tuned) and words_count.txt
  • fix words_count.txt by replacing line 49 in vocabulary.py with:
    reverse_vocab = [eval(line.split()[0]).decode() for line in reverse_vocab]
    or use already fixed file that was provided by @RazinShaikh without changing the code
  • use the script provided by @vpaharia - replace paths for checkpoint files correctly (e.g. ./model.ckpt-2000000 if the files are in current directory).
  • run inference, example:
    python3 im2txt/run_inference.py --checkpoint_path=models/model.ckpt-2000000 --vocab_file=models/word_counts.txt --input_files images/image1.jpg

In my case I created models directory where I extracted @psycharo learned models, I have also put the above mentioned script in this directory to fix the models (replaced paths with ./model.ckpt-2000000). I hope that this helps others, so that they don't have to look through all the posts :)

@Gharibim
Copy link

Gharibim commented Jul 16, 2018

@cshallue Thank you so much for your help!

Here is a 5000000 step model using TF 1.9:
https://github.com/Gharibim/Tensorflow_im2txt_5M_Step

@coliinkc
Copy link

Hey, thank you for the checkpoint files! I was wondering if anyone managed to use one of them to fine-tune the model with new data? How would the word counts file need to look like? Does the newly created word counts file from the new dataset need to be merged with the one from MSCOCO?

I am currently running the fine tuning with a merged word counts file but encountering two problems:

1.) the captions after 20,000 steps just consist of the same word repeated over and over, despite a very small loss of 0.2
1) day day day day day day day day day day day day . <S> . <S> <S> . (p=0.011221)

2.) I let the model fine-tune over night and somehow only the last 5 checkpoints got saved. Does anyone know how to prevent the overwriting of checkpoints and keep all of them?

Thank you in advance!

@sandipan
Copy link

sandipan commented Dec 4, 2019

@JZakraoui Seems like variable names for basic_lstm_cell were changed again. You can change the variable name as pointed out by @cshallue. Copying his code however notice the variable names

OLD_CHECKPOINT_FILE = ".../model.ckpt-2000000"
NEW_CHECKPOINT_FILE = ".../model.ckpt-2000000"

import tensorflow as tf
vars_to_rename = {
    "lstm/BasicLSTMCell/Linear/Matrix": "lstm/basic_lstm_cell/kernel",
    "lstm/BasicLSTMCell/Linear/Bias": "lstm/basic_lstm_cell/bias",
}
new_checkpoint_vars = {}
reader = tf.train.NewCheckpointReader(OLD_CHECKPOINT_FILE)
for old_name in reader.get_variable_to_shape_map():
  if old_name in vars_to_rename:
    new_name = vars_to_rename[old_name]
  else:
    new_name = old_name
  new_checkpoint_vars[new_name] = tf.Variable(reader.get_tensor(old_name))

init = tf.global_variables_initializer()
saver = tf.train.Saver(new_checkpoint_vars)

with tf.Session() as sess:
  sess.run(init)
  saver.save(sess, NEW_CHECKPOINT_FILE)

It works for me with
Python 3.6.4
tensorflow 1.7.0
@psycharo's pre-trained model

With this I could load the checkpoint file too, notice the variable names / values in the checkpoint file model.ckpt-2000000:

tensor_name: lstm/BasicLSTMCell/Linear/Bias
[-0.89432126 -0.34625703 0.16128121 ... 0.48277333 -0.5986251
1.2891939 ]
tensor_name: lstm/BasicLSTMCell/Linear/Matrix
[[ 0.16781631 -0.04221911 0.24709763 ... 0.04963883 -0.08704979
0.03227773]

@idmontie
Copy link

idmontie commented May 8, 2020

he captions after 20,000 steps just consist of the same word repeated over and over, despite a very small loss of 0.2

I've noticed that small amount of steps generate pretty bad results. I get one word responses and I'm around 366906 steps. I'm going to continue running and see how the results improve.

I additionally had to follow the comment here: #7204 (comment)

@Rawan19
Copy link

Rawan19 commented May 31, 2020

he captions after 20,000 steps just consist of the same word repeated over and over, despite a very small loss of 0.2

I've noticed that small amount of steps generate pretty bad results. I get one word responses and I'm around 366906 steps. I'm going to continue running and see how the results improve.

I additionally had to follow the comment here: #7204 (comment)

Hey! can you please tell us if the results improved? and how many steps did it take? I would be really grateful if you could share your log file as well! Thank you in advance!

@szhang12345
Copy link

HI,
I tried to run from cmd

python im2txt/run_inference.py --checkpoint_path="im2txt/model/train/newmodel.ckpt-2000000" --vocab_file="im2txt/data

but get the error below
Traceback (most recent call last):
File "im2txt/run_inference.py", line 27, in
from im2txt import configuration
ModuleNotFoundError: No module named 'im2txt'

can anyone help?

@fontes99
Copy link

got the same issue

@Vishwanath-Ayyappan
Copy link

Got

HI, I tried to run from cmd

python im2txt/run_inference.py --checkpoint_path="im2txt/model/train/newmodel.ckpt-2000000" --vocab_file="im2txt/data

but get the error below Traceback (most recent call last): File "im2txt/run_inference.py", line 27, in from im2txt import configuration ModuleNotFoundError: No module named 'im2txt'

can anyone help?

Resolved?

@TamaraAtanasoska
Copy link

Hi @szhang12345, @fontes99 and @Vishwanath-Ayyappan where are you running the code? If you are running it locally as a package you need to add the the directory to the PYTHONPATH like this: export PYTHONPATH="${PYTHONPATH}:/home/users/thenameofyouruser/anotherlocation/Im2txt/im2txt". I suppose this is very much not relevant at the moment, but maybe someone else can make use of it in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests