-
Couldn't load subscription status.
- Fork 45.4k
Description
System information
- What is the top-level directory of the model you are using models/Im2Txt:
- OS Platform and Distribution : Linux CentOS 7:
- TensorFlow installed from Binary:
- TensorFlow version:('v1.0.0-rc1-102-g1536a84-dirty', '1.0.0-rc2'):
I am trying to use im2txt model with MSCOCO dataset modified to use hindi caption. When I tried to pre-process dataset using bazel-bin/im2txt/download_and_preprocess_mscoco "${MSCOCO_DIR}" command, it will work until generating word_count file. In step of generating TFrecord it throws following error:
Exception in thread Thread-14:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "/root/.cache/bazel/_bazel_root/5d96493270bd616258f6b3292edd375b/execroot/im2txt/bazel-out/local-fastbuild/bin/im2txt/build_mscoco_data.runfiles/im2txt/im2txt/data/build_mscoco_data.py", line 281, in _process_image_files
sequence_example = _to_sequence_example(image, decoder, vocab)
File "/root/.cache/bazel/_bazel_root/5d96493270bd616258f6b3292edd375b/execroot/im2txt/bazel-out/local-fastbuild/bin/im2txt/build_mscoco_data.runfiles/im2txt/im2txt/data/build_mscoco_data.py", line 234, in _to_sequence_example
"image/caption": _bytes_feature_list(caption),
File "/root/.cache/bazel/_bazel_root/5d96493270bd616258f6b3292edd375b/execroot/im2txt/bazel-out/local-fastbuild/bin/im2txt/build_mscoco_data.runfiles/im2txt/im2txt/data/build_mscoco_data.py", line 202, in _bytes_feature_list
return tf.train.FeatureList(feature=[_bytes_feature(v) for v in values])
File "/root/.cache/bazel/_bazel_root/5d96493270bd616258f6b3292edd375b/execroot/im2txt/bazel-out/local-fastbuild/bin/im2txt/build_mscoco_data.runfiles/im2txt/im2txt/data/build_mscoco_data.py", line 192, in _bytes_feature
return tf.train.Feature(bytes_list=tf.train.BytesList(value=str(value)))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
It seems error in encoding data to tfrecord. I have also tried to modify str(value) using unicode encode function.
Sample caption file is available at: https://drive.google.com/file/d/0B8Rng3ofk0uAbDBJekpPR3BXVzA/view?usp=sharing