New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretrained model for img2txt? #466

Closed
ludazhao opened this Issue Sep 28, 2016 · 107 comments

Comments

Projects
None yet
@ludazhao

ludazhao commented Sep 28, 2016

Please let us know which model this issue is about (specify the top-level directory)

models/img2txt

Can someone release a pre-trained model for the img2txt model trained on COCO? Would be great for someone here who doesn't have the computational resource yet to do a full training run. Thanks!

@concretevitamin

This comment has been minimized.

concretevitamin commented Sep 28, 2016

@cshallue: could you comment on this? Thanks.

@siavashk

This comment has been minimized.

siavashk commented Sep 30, 2016

+1

@cshallue

This comment has been minimized.

Contributor

cshallue commented Sep 30, 2016

Sorry, we're not releasing a pre-trained version of this model at this time.

@psycharo

This comment has been minimized.

Contributor

psycharo commented Oct 5, 2016

here are links to a pre-trained model:

@cshallue

This comment has been minimized.

Contributor

cshallue commented Oct 11, 2016

@psycharo thanks for sharing! Perhaps you could also share your word_counts.txt file. Different versions of the tokenizer can yield different results, so your model is specific to the word_counts.txt file that you used.

@siavashk

This comment has been minimized.

siavashk commented Oct 12, 2016

@psycharo my training is still training on our GPU instance. It seems it would take another two weeks to finish. I would appreciate it if you would also release the fine-tuned model.

@ProgramItUp

This comment has been minimized.

ProgramItUp commented Oct 15, 2016

@psycharo Thanks for sharing your checkpoint!

When I try to use it I'm getting the error: "ValueError: No checkpoint file found in: None".
I don't have any trouble doing run_inference my own checkpoint files but I can't do it on yours. I've tried lots of things: adding a trailing "/", using absolute paths, relative paths, ..... Nothing seems to work.

Suggestions welcomed.
@cshallue - Any thoughts?

Thanks all.

Last login: Sat Oct 15 07:10:56 2016 from 3.202.121.241
user123@myhost:~$ ls -l /tmp/checkpoint_tmp/
total 175356
-rw-r--r-- 1 user123 user123  19629588 Oct 15 07:04 graph.pbtxt
-rw-r--r-- 1 user123 user123 149088120 Oct 15 07:04 model.ckpt-2000000
-rw-r--r-- 1 user123 user123  10675545 Oct 15 07:04 model.ckpt-2000000.meta
-rw-rw-r-- 1 user123 user123    156438 Oct 15 07:08 word_counts.txt
user123@myhost:~$  /data/home/user123/tensorflow_models/models/im2txt/bazel-bin/im2txt/run_inference   --checkpoint_path=/tmp/checkpoint_tmp   --vocab_file=/tmp/checkpoint_tmp/word_counts.txt   --input_files=${IMAGE_FILE}
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
Traceback (most recent call last):
  File "/data/home/user123/tensorflow_models/models/im2txt/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/run_inference.py", line 83, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "/data/home/user123/tensorflow_models/models/im2txt/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/run_inference.py", line 49, in main
    FLAGS.checkpoint_path)
  File "/data/home/user123/tensorflow_models/models/im2txt/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/inference_utils/inference_wrapper_base.py", line 118, in build_graph_from_config
    return self._create_restore_fn(checkpoint_path, saver)
  File "/data/home/user123/tensorflow_models/models/im2txt/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/inference_utils/inference_wrapper_base.py", line 92, in _create_restore_fn
    raise ValueError("No checkpoint file found in: %s" % checkpoint_path)
ValueError: No checkpoint file found in: None
user123@myhost:~$
@cshallue

This comment has been minimized.

Contributor

cshallue commented Oct 15, 2016

@ProgramItUp Try the following: --checkpoint_path=/tmp/checkpoint_tmp/model.ckpt-2000000

When you pass a directory, it looks for a "checkpoint state" file in that directory, which is an index of all checkpoints in the directory. Your directory doesn't have a checkpoint state file, but you can just pass it the explicit filename.

@PredragBoksic

This comment has been minimized.

PredragBoksic commented Oct 15, 2016

Getting better, but...

Traceback (most recent call last):
  File "/home/gamma/bin/models-master/im2txt/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/run_inference.py", line 83, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "/home/gamma/bin/models-master/im2txt/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/run_inference.py", line 53, in main
    vocab = vocabulary.Vocabulary(FLAGS.vocab_file)
  File "/home/gamma/bin/models-master/im2txt/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/inference_utils/vocabulary.py", line 50, in __init__
    assert start_word in reverse_vocab
AssertionError
@cshallue

This comment has been minimized.

Contributor

cshallue commented Oct 15, 2016

Looks like the word_counts.txt file above is not formatted as expected:

b'a' 969108
b'</S>' 586368
b'<S>' 586368
b'.' 440479
b'on' 213612
b'of' 202290
b'the' 196219
b'in' 182598
b'with' 152984
...

vocabulary.py expects:

a 969108
</S> 586368
<S> 586368
. 440479
on 213612
of 202290
the 196219
in 182598
with 152984
...

A quick fix is to reformat the word_counts.txt in that way. Or, you could replace line 49 of vocabulary.py with

reverse_vocab = [eval(line.split()[0]) for line in reverse_vocab]

In the long run, I'll come up with a way to make sure word_counts.txt is outputted the same for everyone.

@PredragBoksic

This comment has been minimized.

PredragBoksic commented Oct 15, 2016

It works!

http://stablescoop.horseradionetwork.com/wp-content/uploads/2013/10/ep271.jpg

Captions for image cb340488986cc40f8ec610348b7f5a24.jpg:
  0) a woman is standing next to a horse . (p=0.000726)
  1) a woman is standing next to a horse (p=0.000638)
  2) a woman is standing next to a brown horse . (p=0.000373)
@cshallue

This comment has been minimized.

Contributor

cshallue commented Oct 15, 2016

@PredragBoksic great!

@psycharo , what version of python did you use to generate the word_counts.txt file?

I expect the script to output lines of the form:

a 969108
</S> 586368
<S> 586368

not:

b'a' 969108
b'</S>' 586368
b'<S>' 586368
@PredragBoksic

This comment has been minimized.

PredragBoksic commented Oct 15, 2016

I didn't generate the word_counts.txt file. I changed the line 49 as you suggested it, with:

    """ WORKAROUND for vocabulary file """
    """reverse_vocab = [line.split()[0] for line in reverse_vocab]"""
    reverse_vocab = [eval(line.split()[0]) for line in reverse_vocab]

I have Python 2.7.12 on KUbuntu 16.04 with CUDA 8.0 and CUDNN 5.1 and GTX970. I would not know how to do it in Python, because I program in Java usually. Do you need some code to change that file?

@cshallue

This comment has been minimized.

Contributor

cshallue commented Oct 15, 2016

@PredragBoksic I'm asking the creator of that file. You can just keep using the workaround :)

@psycharo

This comment has been minimized.

Contributor

psycharo commented Oct 15, 2016

@cshallue python 3.5. I had to make a couple of dirty hacks to make it work on that version of python, this is why word_counts.txt looks different.

@PredragBoksic

This comment has been minimized.

PredragBoksic commented Oct 16, 2016

@psycharo How many hours did this take to train? I think that people would appreciate what you shared more if you mentioned this.

@psycharo

This comment has been minimized.

Contributor

psycharo commented Oct 16, 2016

@PredragBoksic
initial training took about 2-3 days, finetuning for 1m iterations took around 5-6 days. I used single GPU, Tesla P100.

@ProgramItUp

This comment has been minimized.

ProgramItUp commented Oct 16, 2016

@cshallue Thanks for the prompt replies. Your suggestions worked.

I was not able to follow the full execution path of the code:

Where would be the right place to put a bit of error checking to make sure that the files
--checkpoint_path, --vocab_file, --input_files exist and throw an error if they don't?

In the case of the checkpoint file it would be helpful to throw an error if "checkpoint state" is not found.
Where would this happen?

Thanks.

@cshallue

This comment has been minimized.

Contributor

cshallue commented Oct 16, 2016

There are already error checks for all those things.

If no checkpoint state or no checkpoint file is found in --checkpoint_path, it will fail the check here.

If --vocab_file doesn't exist it will fail the check here.

If no files match --input_files then you will get the message "Running caption generation on 0 files matching..." and inference will exit: see here.

@PredragBoksic

This comment has been minimized.

PredragBoksic commented Oct 16, 2016

I did not notice any meaningful error messages, for example when the image file was missing. I suppose that this functionality will be completed in the future.

@siavashk

This comment has been minimized.

siavashk commented Oct 25, 2016

@cshallue: I am running the finetuning step of the optimization. What I noticed was that the loss function is not changing much for the initial 22000 steps. The loss is pretty much stuck at 2.40.

I have attached the log file by pumping the stderr to a text file. Is the loss going to go significantly down in the remaining iterations? Or am I missing some "gotcha"?
log_finetune.txt

@cshallue

This comment has been minimized.

Contributor

cshallue commented Oct 25, 2016

@siavashk The loss reported by the training script is expected to be really noisy: it reports on single batches of only 32 examples.

Are you running the evaluation script on the validation files? We expect to see validation perplexity decreasing slowly. It decreases slowly because the model is already near optimal and because we use a smaller learning rate during finetuning.

@siavashk

This comment has been minimized.

siavashk commented Oct 25, 2016

@cshallue Maybe I am overly anxious, 22000 steps is about 1% of the optimization. I am just worried that it has been three weeks since I started training this model, and it seems it is going to take another two weeks for it to converge.
I am not running the validation script, since the training itself is taking too long (it's been three weeks now and I am at 1 million iterations). I thought running an additional validation step would make this even longer.

@cshallue

This comment has been minimized.

Contributor

cshallue commented Oct 25, 2016

You won't be able to tell much from the training losses for a single batch any more. They will keep jumping around.

You could always just use the model in its current form. It will probably be sensible. There is not much improvement after 1M steps of fine tuning.

Or you could use the model shared in this thread above.

@pcnfernando

This comment has been minimized.

pcnfernando commented Apr 30, 2017

When trying to upgrade checkpoint file for compatibly with TF 1.0, when using the above code by @cshallue use the relative paths. Use of absolute paths gives out an error at
tf.train.NewCheckpointReader(OLD_CHECKPOINT_FILE)

OLD_CHECKPOINT_FILE = "./model.ckpt-2000000"
NEW_CHECKPOINT_FILE = "./model.ckpt-2000000"

@withyou1771

This comment has been minimized.

withyou1771 commented May 11, 2017

I trained on the TF 1.0.1 and python2.7 without finetuned.
https://github.com/withyou1771/im2txt

Captions for image 1.jpg:
0) a cat laying on top of a grass covered field . (p=0.002806)

  1. a black and white cat laying on top of a grass covered field . (p=0.000498)
  2. a black and white cat laying on top of a green field . (p=0.000412)
@KranthiGV

This comment has been minimized.

KranthiGV commented May 13, 2017

I have released a version trained on the latest TF 1.0 on a GPU.
It has both 1M without finetuning and 2M with finetuning model checkpoints.
https://github.com/KranthiGV/Pretrained-Show-and-Tell-model
Open an issue on the (repository page) or email me at kranthi.gv@gmail.com in case
you have a problem setting it up.
Thank you!

@LEAAN

This comment has been minimized.

LEAAN commented May 29, 2017

@begongyal The latest version of tensorflow creates three files for checkpoints by default. Please do not delete or remove anything in your train_dir, and use:

tf.flags.DEFINE_string("train_dir","YOUR DIR OF TRAIN (seems to be ~/im2txt/model/train/)"
                      "Directory for saving and loading model checkpoints.") 

in im2txt/train.py, I managed to get rid of the error.

@pvthuy

This comment has been minimized.

pvthuy commented Aug 10, 2017

Hi, anyone knows how to convert those above pre-trained models to protobuf models (.pb)?
I want to use them for Tensorflow Mobile.

Also, I need some information (because I do not train the models) as follows:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/ios/camera/CameraExampleViewController.mm

From line 37 to line 44:

// These dimensions need to match those the model was trained with.
const int wanted_input_width = 224;
const int wanted_input_height = 224;
const int wanted_input_channels = 3;
const float input_mean = 117.0f;
const float input_std = 1.0f;
const std::string input_layer_name = "input";
const std::string output_layer_name = "softmax1";

Thanks in advanced!

@iAInNet

This comment has been minimized.

iAInNet commented Sep 6, 2017

@psycharo Thanks for sharing your checkpoint! excellent work!!

@RBirkeland

This comment has been minimized.

RBirkeland commented Sep 7, 2017

I have successfully used the 1M model (model.ckpt-1000000) However I'm still struggling to use the fine-tuned 2M or 3M posted here. I've tried the solutions already discussed, but with no luck.

I'm using: Tensorflow 1.3 (for gpu), CUDA 8, cudnn 5.1. (I have yet to try to downgrade to TF 1.0, could this work?).

When using for simplex using the fine-tuned 2M model, as described posted by @psycharo, I get the errors discussed earlier:

NotFoundError (see above for traceback): Tensor name "lstm/basic_lstm_cell/kernel" not found in checkpoint files /home/ubuntu/im2txt/data/model.ckpt-2000000

I can fix this issue by running the following code:

OLD_CHECKPOINT_FILE = "model.ckpt-2000000"
NEW_CHECKPOINT_FILE = "model2.ckpt-2000000"

import tensorflow as tf
vars_to_rename = {
    "lstm/BasicLSTMCell/Linear/Matrix": "lstm/basic_lstm_cell/weights",
    "lstm/BasicLSTMCell/Linear/Bias": "lstm/basic_lstm_cell/biases",
}
new_checkpoint_vars = {}
reader = tf.train.NewCheckpointReader(OLD_CHECKPOINT_FILE)
for old_name in reader.get_variable_to_shape_map():
  if old_name in vars_to_rename:
    new_name = vars_to_rename[old_name]
  else:
    new_name = old_name
  new_checkpoint_vars[new_name] = tf.Variable(reader.get_tensor(old_name))

init = tf.global_variables_initializer()
saver = tf.train.Saver(new_checkpoint_vars)

with tf.Session() as sess:
  sess.run(init)
  saver.save(sess, NEW_CHECKPOINT_FILE)

However, when I try to run the evaluation using the new model2, I get the following error:

NotFoundError (see above for traceback): Key lstm/basic_lstm_cell/kernel not found in checkpoint

Here is the full stacktrace

INFO:tensorflow:Loading model from checkpoint: /home/ubuntu/im2txt/data/model2.ckpt-2000000
INFO:tensorflow:Restoring parameters from /home/ubuntu/im2txt/data/model2.ckpt-2000000
2017-09-07 14:38:17.078647: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key lstm/basic_lstm_cell/kernel not found in checkpoint
2017-09-07 14:38:17.100193: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key lstm/basic_lstm_cell/bias not found in checkpoint
Traceback (most recent call last):
  File "/home/ubuntu/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/run_inference.py", line 89, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/ubuntu/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/run_inference.py", line 66, in main
    restore_fn(sess)
  File "/home/ubuntu/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/inference_utils/inference_wrapper_base.py", line 96, in _restore_fn
    saver.restore(sess, checkpoint_path)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1560, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1124, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key lstm/basic_lstm_cell/kernel not found in checkpoint
  [[Node: save/RestoreV2_381 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_381/tensor_names, save/RestoreV2_381/shape_and_slices)]]

Caused by op u'save/RestoreV2_381', defined at:
  File "/home/ubuntu/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/run_inference.py", line 89, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/ubuntu/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/run_inference.py", line 52, in main
    FLAGS.checkpoint_path)
  File "/home/ubuntu/bazel-bin/im2txt/run_inference.runfiles/im2txt/im2txt/inference_utils/inference_wrapper_base.py", line 116, in build_graph_from_config
    saver = tf.train.Saver()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1140, in __init__
    self.build()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1172, in build
    filename=self._filename)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 688, in build
    restore_sequentially, reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 247, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 663, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

NotFoundError (see above for traceback): Key lstm/basic_lstm_cell/kernel not found in checkpoint
  [[Node: save/RestoreV2_381 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_381/tensor_names, save/RestoreV2_381/shape_and_slices)]]

Anyone have any idea how to make the fine-tuned model work?

@Giribushan

This comment has been minimized.

Giribushan commented Oct 8, 2017

Thank you..

@RazinShaikh

This comment has been minimized.

RazinShaikh commented Oct 15, 2017

If someone is looking for the reformatted words_count file, here it is words_count.txt

@y734451909

This comment has been minimized.

y734451909 commented Oct 15, 2017

@tyler-lanigan-hs

This comment has been minimized.

tyler-lanigan-hs commented Dec 5, 2017

Has anyone figured out how to export the im2txt trained model as a TensorFlow SavedModelBundle to be served by Tensorflow Serving?

@yh0903

This comment has been minimized.

yh0903 commented Dec 9, 2017

Has anyone meet the problem of UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte?

Traceback (most recent call last):
File "/Users/hanyu/Downloads/models-master2/research/im2txt/im2txt/run_inference.py", line 153, in
im2txt()
File "/Users/hanyu/Downloads/models-master2/research/im2txt/im2txt/run_inference.py", line 140, in im2txt
image = f.read()
File "/Users/hanyu/anaconda/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 125, in read
pywrap_tensorflow.ReadFromStream(self._read_buf, length, status))
File "/Users/hanyu/anaconda/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 93, in _prepare_value
return compat.as_str_any(val)
File "/Users/hanyu/anaconda/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 106, in as_str_any
return as_str(value)
File "/Users/hanyu/anaconda/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 84, in as_text
return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

@LogicHolmes

This comment has been minimized.

LogicHolmes commented Dec 10, 2017

I don't know why I send the command:
bazel-bin/im2txt/train --input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" --inception_checkpoint_file="${INCEPTION_CHECKPOINT}" --train_dir="${MODEL_DIR}/train" --train_inception=false --number_of_steps=1000000

there have a error:
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcufft.so.8.0. LD_LIBRARY_PATH:
I tensorflow/stream_executor/cuda/cuda_fft.cc:344] Unable to load cuFFT DSO.
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
*** Error in `/usr/bin/python': double free or corruption (!prev): 0x000000000231f8e0 ***
I don't know the error

@NavneethS

This comment has been minimized.

NavneethS commented Dec 21, 2017

@yh0903 To solve the unicode error, make sure the file is being read in binary mode in run_inference.py :
with tf.gfile.GFile(filename, "rb") as f:

@vanpersie32

This comment has been minimized.

vanpersie32 commented Dec 25, 2017

@psycharo hi, thank you for providing us with so great model. I want to ask you some question. Have you noticed how the performance changes when you finetuen the model. Is the whole models' performance increasing or first the model performance(cider or bleu) drops a little, then it gradually increase.

@ksenyakor

This comment has been minimized.

ksenyakor commented Feb 11, 2018

Hi all,
I try to use pretrained models by @psycharo . When I test the model to get softmax output and LSTM states I get an error: "Key lstm/logits/biases not found in checkpoint" .

Tensorflow version is 1.0.1, python 2.7

This is console output:

universal@universal-ubuntu:~/anaconda3/envs/MyGAN$ python test.py
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: universal-ubuntu
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: universal-ubuntu
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 384.111.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:363] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.111 Tue Dec 19 23:51:45 PST 2017
GCC version: gcc version 4.9.3 (Ubuntu 4.9.3-13ubuntu2)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 384.111.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 384.111.0
INFO:tensorflow:Loading model from checkpoint: /home/universal/anaconda3/envs/MyGAN/im2txt/model/pre-trained/model-new-renamed.ckpt-2000000
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key lstm/logits/biases not found in checkpoint
W tensorflow/core/framework/op_kernel.cc:993] Not found: Key lstm/logits/weights not found in checkpoint
Traceback (most recent call last):
File "test.py", line 185, in
restore_fn(sess)
File "test.py", line 64, in _restore_fn
saver.restore(sess, checkpoint_path)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1428, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key lstm/logits/biases not found in checkpoint
[[Node: save/RestoreV2_379 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_379/tensor_names, save/RestoreV2_379/shape_and_slices)]]

Caused by op u'save/RestoreV2_379', defined at:
File "test.py", line 173, in
restore_fn = _create_restore_fn(checkpoint_path) # (inception_variables, inception_checkpoint_file)
File "test.py", line 55, in _create_restore_fn
saver = tf.train.Saver()
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1040, in init
self.build()
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1070, in build
restore_sequentially=self._restore_sequentially)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 675, in build
restore_sequentially, reshape)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 402, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 242, in restore_op
[spec.tensor.dtype])[0])
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 668, in restore_v2
dtypes=dtypes, name=name)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/universal/anaconda3/envs/MyGAN/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1226, in init
self._traceback = _extract_stack()

NotFoundError (see above for traceback): Key lstm/logits/biases not found in checkpoint
[[Node: save/RestoreV2_379 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_379/tensor_names, save/RestoreV2_379/shape_and_slices)]]

And this is my code for testing:

`

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import math
import os.path
import time


import numpy as np
import tensorflow as tf
import image_embedding
import image_processing
import inputs as input_ops
tf.logging.set_verbosity(tf.logging.INFO)

        # Dimensions of Inception v3 input images.
image_height = 299
image_width = 299
image_format = "jpeg"
train_inception=False
embedding_size = 512
vocab_size = 12000
num_lstm_units = 512
# To match the "Show and Tell" paper we initialize all variables with a
# random uniform initializer.
    # Scale used to initialize model variables.
initializer_scale = 0.08
initializer = tf.random_uniform_initializer(
        minval=-initializer_scale,
        maxval=initializer_scale)
    # Collection of variables from the inception submodel.
inception_variables = []
inception_checkpoint_file="/home/universal/anaconda3/envs/MyGAN/im2txt/model/inception_v3.ckpt"
checkpoint_path="/home/universal/anaconda3/envs/MyGAN/im2txt/model/pre-trained/model-new-renamed.ckpt-2000000"


def _create_restore_fn(checkpoint_path):
    """Creates a function that restores a model from checkpoint.

    Args:
      checkpoint_path: Checkpoint file or a directory containing a checkpoint
        file.
      saver: Saver for restoring variables from the checkpoint file.

    Returns:
      restore_fn: A function such that restore_fn(sess) loads model variables
        from the checkpoint file.

    Raises:
      ValueError: If checkpoint_path does not refer to a checkpoint file or a
        directory containing a checkpoint file.
    """

    saver = tf.train.Saver()

    if tf.gfile.IsDirectory(checkpoint_path):
        checkpoint_path = tf.train.latest_checkpoint(checkpoint_path)
        if not checkpoint_path:
            raise ValueError("No checkpoint file found in: %s" % checkpoint_path)

    def _restore_fn(sess):
        tf.logging.info("Loading model from checkpoint: %s", checkpoint_path)
        saver.restore(sess, checkpoint_path)
        tf.logging.info("Successfully loaded checkpoint: %s",
                        os.path.basename(checkpoint_path))

    return _restore_fn

def process_image(encoded_image, thread_id=0):
    """Decodes and processes an image string.

    Args:
      encoded_image: A scalar string Tensor; the encoded image.
      thread_id: Preprocessing thread id used to select the ordering of color
        distortions.

    Returns:
      A float32 Tensor of shape [height, width, 3]; the processed image.
    """
    return image_processing.process_image(encoded_image,
                                      is_training=False,
                                      height=image_height,
                                      width=image_width,
                                      thread_id=thread_id,
                                      image_format=image_format)


g = tf.Graph()
with g.as_default():
    image_feed = tf.placeholder(dtype=tf.string, shape=[], name="image_feed")
    input_feed = tf.placeholder(dtype=tf.int64,
                                shape=[None],  # batch_size
                                name="input_feed")
    # Process image and insert batch dimensions.
    # build_inputs
    images = tf.expand_dims(process_image(image_feed), 0)
    input_seqs = tf.expand_dims(input_feed, 1)

    # """Builds the image model subgraph and generates image embeddings.
    inception_output = image_embedding.inception_v3(
        images,
        trainable=train_inception,
        is_training=False)
    inception_variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="InceptionV3")

    # Map inception output into embedding space.
    with tf.variable_scope("image_embedding") as scope:
        image_embeddings = tf.contrib.layers.fully_connected(
            inputs=inception_output,
            num_outputs=embedding_size,
            activation_fn=None,
            weights_initializer=initializer,
            biases_initializer=None,
            scope=scope)

    # Save the embedding size in the graph.
    tf.constant(embedding_size, name="embedding_size")

    with tf.variable_scope("seq_embedding"), tf.device("/cpu:0"):
        embedding_map = tf.get_variable(
            name="map",
            shape=[vocab_size, embedding_size],
            initializer=initializer)
        seq_embeddings = tf.nn.embedding_lookup(embedding_map, input_seqs)

    # This LSTM cell has biases and outputs tanh(new_c) * sigmoid(o), but the
    # modified LSTM in the "Show and Tell" paper has no biases and outputs
    # new_c * sigmoid(o).
    lstm_cell = tf.contrib.rnn.BasicLSTMCell(
        num_units=num_lstm_units, state_is_tuple=True)

    with tf.variable_scope("lstm", initializer=initializer) as lstm_scope:
        # Feed the image embeddings to set the initial LSTM state.
        zero_state = lstm_cell.zero_state(
            batch_size=image_embeddings.get_shape()[0], dtype=tf.float32)
        _, initial_state = lstm_cell(image_embeddings, zero_state)

        # Allow the LSTM variables to be reused.
        lstm_scope.reuse_variables()

        # In inference mode, use concatenated states for convenient feeding and
        # fetching.
        tf.concat(axis=1, values=initial_state, name="initial_state")

        # Placeholder for feeding a batch of concatenated states.
        state_feed = tf.placeholder(dtype=tf.float32,
                                    shape=[None, sum(lstm_cell.state_size)],
                                    name="state_feed")
        state_tuple = tf.split(value=state_feed, num_or_size_splits=2, axis=1)

        # Run a single LSTM step.
        lstm_outputs, state_tuple = lstm_cell(
            inputs=tf.squeeze(seq_embeddings, axis=[1]),
            state=state_tuple)

        # Concatentate the resulting state.
        tf.concat(axis=1, values=state_tuple, name="state")

        # Stack batches vertically.
        lstm_outputs = tf.reshape(lstm_outputs, [-1, lstm_cell.output_size])

        with tf.variable_scope("logits") as logits_scope:
            logits = tf.contrib.layers.fully_connected(
                inputs=lstm_outputs,
                num_outputs=vocab_size,
                activation_fn=None,
                weights_initializer=initializer,
                scope=logits_scope)

        tf.nn.softmax(logits, name="softmax")

    restore_fn = _create_restore_fn(checkpoint_path)  # (inception_variables, inception_checkpoint_file)

g.finalize()


input_files= "/media/universal/264CB8084CB7D0B3/MSCOCO/raw-data/train2014/COCO_train2014_000000000009.jpg"
filenames = []
for file_pattern in input_files.split(","):
    filenames.extend(tf.gfile.Glob(file_pattern))

with tf.Session(graph=g) as sess:
    # Load the model from checkpoint.
    restore_fn(sess)
    for filename in filenames:
        with tf.gfile.GFile(filename, "rb") as f:
            image = f.read()

            #partial_captions_list = partial_captions.extract()
            #input_feed = np.array([c.sentence[-1] for c in partial_captions_list])
            # build_inputs
            # Test feeding a batch of inputs and LSTM states to get softmax output and
            # LSTM states.
            input_feed = np.random.randint(0, 10, size=3)
            state_feed = np.random.rand(3, 1024)
            feed_dict = {"input_feed:0": input_feed, "lstm/state_feed:0": state_feed, "image_feed:0": image}

            lstm_outputs_out = sess.run([softmax, lstm_outputs], feed_dict=feed_dict)
    print(lstm_outputs_out)
    """"""

`

What has gone wrong?
Are there any ckpt file with these vars?

When I generate cattions by runnung run_inference.py file, everything is OK. But I need to create my own model based on Im2Txt so I want to know how it works.

Thank you in advance

@JZakraoui

This comment has been minimized.

JZakraoui commented May 4, 2018

Hello,

I am running the script " bazel-bin\im2txt\run_inference --checkpoint_path=${CHECKPOINT_DIR} --vocab_file=${VOCAB_FILE} --input_files=${IMAGE_FILE}" using python 3.5.2 under windows 7.
Python crashs with the message "Python has stopped working", could you please advice what is wrong!
Thank you

@JZakraoui

This comment has been minimized.

JZakraoui commented May 4, 2018

@victoriastuart
@cshallue

I am running the script " bazel-bin\im2txt\run_inference --checkpoint_path=${CHECKPOINT_DIR} --vocab_file=${VOCAB_FILE} --input_files=${IMAGE_FILE}" using python 3.5.2 (downloaded with Anaconda 3) under windows 7.
Python crashs with the message "Python has stopped working", could you please advice what is wrong!
I need only to caption some images and use a pretained model.
Thank you

@victoriastuart

This comment has been minimized.

victoriastuart commented May 4, 2018

@JZakraoui

  1. I work in Linux, not Windows. ;-)

  2. I don't know your level of experience, but as a general suggestion I would suggest reading up on creating and using Python virtual environments (venv) anytime you are installing and working with new software/projects. In my opinion, it will save you a lot of headaches in the long-run (preserving, e.g., your system and it's "base" Python installation ...

  3. Not to be dismissive, but "Python crashes with the message 'Python has stopped working' ...", by itself, is not very helpful:

    • "My power went out! Why?" << Blown fuse? Powerline down? Hurricane? ...
    • "My stomach hurts -- why?" << Indigestion? Hunger pangs? Stress? Ulcers? ...

    Again -- as a general practice -- include the exact error message, and the preceding 10 or 50 or 100 lines of code/messages (whatever concisely encapsulates the issue, in your opinion) whenever you describe a problem plus relevant system details: operating system (as you did), programming language / environment, program versions ... anything relevant.

  4. Not to ask the obvious, but did you "Google" this issue. Although often very archaic, error messages often indicate the precise nature of the issue, so searchin on that topic(s) leads to greater understanding of the problem.

    Again (my opinion), indicating that you tried to understand your problem and that you searched for a solution carries much weight, when finally asking for help.

  5. NEVER give up! Seriously: we ALL start somewhere! Tthings that seem really complicated at the time often seem much less complicated in hindsight, with aquired knlowledge and experience.

Just my thoughts; I do hope you sort this out! Post back here with additional detail, and perhaps someone can help. :-)

@JZakraoui

This comment has been minimized.

JZakraoui commented May 9, 2018

@victoriastuart thank you
@psycharo @KranthiGV @cshallue
I am running the script
bazel-bin\im2txt\run_inference --checkpoint_path=%CHECKPOINT_PATH% --vocab_file=%VOCAB_FILE% --input_files=%IMAGE_FILE%

Python 3.5.5
tensorflow 1.8.0
windows 7(64 bit), CPU
@psycharo pre-trained model

I got the following error:
tensorflow.python.framework.errors_impl.NotFoundError: Tensor name "lstm/basic_lstm_cell/bias" not found in checkpoint files C:\Users\USER\Documents\models\pretrained1\model.ckpt-2000000

Any advice? thank you

@vpaharia

This comment has been minimized.

vpaharia commented May 23, 2018

@JZakraoui Seems like variable names for basic_lstm_cell were changed again. You can change the variable name as pointed out by @cshallue. Copying his code however notice the variable names

OLD_CHECKPOINT_FILE = ".../model.ckpt-2000000"
NEW_CHECKPOINT_FILE = ".../model.ckpt-2000000"

import tensorflow as tf
vars_to_rename = {
    "lstm/BasicLSTMCell/Linear/Matrix": "lstm/basic_lstm_cell/kernel",
    "lstm/BasicLSTMCell/Linear/Bias": "lstm/basic_lstm_cell/bias",
}
new_checkpoint_vars = {}
reader = tf.train.NewCheckpointReader(OLD_CHECKPOINT_FILE)
for old_name in reader.get_variable_to_shape_map():
  if old_name in vars_to_rename:
    new_name = vars_to_rename[old_name]
  else:
    new_name = old_name
  new_checkpoint_vars[new_name] = tf.Variable(reader.get_tensor(old_name))

init = tf.global_variables_initializer()
saver = tf.train.Saver(new_checkpoint_vars)

with tf.Session() as sess:
  sess.run(init)
  saver.save(sess, NEW_CHECKPOINT_FILE)

It works for me with
Python 3.6.4
tensorflow 1.7.0
@psycharo's pre-trained model

@ds2268

This comment has been minimized.

ds2268 commented Jun 7, 2018

Can confirm that @vpaharia latest fix works. Steps to follow on Python 3.5.5, TF-gpu 1.8:

  • Download @psycharo pre-trained model (e.g fine-tuned) and words_count.txt
  • fix words_count.txt by replacing line 49 in vocabulary.py with:
    reverse_vocab = [eval(line.split()[0]).decode() for line in reverse_vocab]
    or use already fixed file that was provided by @RazinShaikh without changing the code
  • use the script provided by @vpaharia - replace paths for checkpoint files correctly (e.g. ./model.ckpt-2000000 if the files are in current directory).
  • run inference, example:
    python3 im2txt/run_inference.py --checkpoint_path=models/model.ckpt-2000000 --vocab_file=models/word_counts.txt --input_files images/image1.jpg

In my case I created models directory where I extracted @psycharo learned models, I have also put the above mentioned script in this directory to fix the models (replaced paths with ./model.ckpt-2000000). I hope that this helps others, so that they don't have to look through all the posts :)

@Gharibim

This comment has been minimized.

Gharibim commented Jul 16, 2018

@cshallue Thank you so much for your help!

Here is a 5000000 step model using TF 1.9:
https://github.com/Gharibim/Tensorflow_im2txt_5M_Step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment