Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load large model fail #24

Closed
doubler opened this issue Mar 25, 2016 · 20 comments

Comments

@doubler
Copy link

@doubler doubler commented Mar 25, 2016

When I start my application, the serving load one big model fail. The error message is clear, but I don't know how to resolve it in the tensorflow_serving.
I tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:149] Aspiring 1 versions for servable common1 [libprotobuf ERROR external/tf/google/protobuf/src/google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes). To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.

@chrisolston

This comment has been minimized.

Copy link
Contributor

@chrisolston chrisolston commented Mar 25, 2016

Hi there,

I believe the workaround described at tensorflow/tensorflow#582, i.e. increasing the protocol message size limit, will work for TF-Serving as well.

I'm not sure exactly what caused you to exceed the limit, but FYI a common cause of large models is if you are serializing the model weights as part of the graph-def (vs. a separate parameter saver file).

-Chris

@doubler

This comment has been minimized.

Copy link
Author

@doubler doubler commented Mar 26, 2016

Thanks @chrisolston
I used the TF-Serving to export the model successful. The error message generate when loading the exported model. Yes I serialize all the model weights as part of the graph and the exported model file is less than 200M.
I don't know why export is OK while load fail without changing the protobuf limit.

@kirilg

This comment has been minimized.

Copy link
Member

@kirilg kirilg commented Apr 7, 2016

TensorFlow uses use_fast_cpp_protos=true and allow_oversize_protos=true by default. You can try running bazel build -c opt --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true tensorflow_serving/... which should work with protos >64MB.

I think since they put it in their bazel.rc by default, we'll probably do the same, but need to double check.

@vrv

This comment has been minimized.

Copy link
Contributor

@vrv vrv commented Apr 7, 2016

@kirilg: Users need to install a protobuf package we've prepared here to get the >64MiB python protobuf support: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/get_started/os_setup.md#protobuf-library-related-issues.

Also, those flags only apply to python protobuf parsing, not C++, which has a higher limit.

@doubler

This comment has been minimized.

Copy link
Author

@doubler doubler commented Apr 19, 2016

@vrv After installed the >64MB version python protobuf https://storage.googleapis.com/tensorflow/linux/cpu/protobuf-3.0.0b2.post2-cp27-none-linux_x86_64.whl
the tensorflow(v0.8.0 RC0) cannot be used.
import tensorflow will have core dump after install the specific protobuf version.

The core info is:
(gdb) bt #0 0x00007f9606e2a2f1 in std::__detail::_Map_base<google::protobuf::Descriptor const*, std::pair<google::protobuf::Descriptor const* const, google::protobuf::DynamicMessage::TypeInfo const*>, std::allocator<std::pair<google::protobuf::Descriptor const* const, google::protobuf::DynamicMessage::TypeInfo const*> >, std::__detail::_Select1st, std::equal_to<google::protobuf::Descriptor const*>, google::protobuf::hash<google::protobuf::Descriptor const*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](google::protobuf::Descriptor const* const&) () from /usr/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so #1 0x00007f9606e2a3d3 in google::protobuf::DynamicMessageFactory::GetPrototypeNoLock(google::protobuf::Descriptor const*) () from /usr/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so #2 0x00007f9606e2b02a in google::protobuf::DynamicMessageFactory::GetPrototype(google::protobuf::Descriptor const*) () from /usr/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so #3 0x00007f95ee4f5129 in google::protobuf::python::cmessage::New (cls=<optimized out>, unused_args=<optimized out>, unused_kwargs=<optimized out>) at google/protobuf/pyext/message.cc:1255 #4 0x00007f9618131d23 in type_call () from /lib64/libpython2.7.so.1.0 #5 0x00007f96180dc0b3 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #6 0x00007f961817025c in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #7 0x00007f96181740bd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #8 0x00007f96181741c2 in PyEval_EvalCode () from /lib64/libpython2.7.so.1.0 #9 0x00007f9618183fac in PyImport_ExecCodeModuleEx () from /lib64/libpython2.7.so.1.0 #10 0x00007f9618184228 in load_source_module () from /lib64/libpython2.7.so.1.0

@vrv

This comment has been minimized.

Copy link
Contributor

@vrv vrv commented Apr 20, 2016

@keveman : any ideas?

@doubler

This comment has been minimized.

Copy link
Author

@doubler doubler commented Apr 21, 2016

By the way, If I upgrade the protobuf in mac, the import tensorflow also has error.

`In [1]: import tensorflow

KeyError Traceback (most recent call last)
in ()
----> 1 import tensorflow

/Library/Python/2.7/site-packages/tensorflow/init.py in ()
21 from future import print_function
22
---> 23 from tensorflow.python import *

/Library/Python/2.7/site-packages/tensorflow/python/init.py in ()
47
48 try:
---> 49 from tensorflow.core.framework.graph_pb2 import *
50 except ImportError:
51 msg = """%s\n\nError importing tensorflow. Unless you are using bazel,

/Library/Python/2.7/site-packages/tensorflow/core/framework/graph_pb2.py in ()
8 from google.protobuf import reflection as _reflection
9 from google.protobuf import symbol_database as _symbol_database
---> 10 from google.protobuf import descriptor_pb2
11 # @@protoc_insertion_point(imports)
12

/Library/Python/2.7/site-packages/google/protobuf/descriptor_pb2.py in ()
1493 message_type=None, enum_type=None, containing_type=None,
1494 is_extension=False, extension_scope=None,
-> 1495 options=None),
1496 _descriptor.FieldDescriptor(
1497 name='source_file', full_name='google.protobuf.GeneratedCodeInfo.Annotation.source_file', index=1,

/Library/Python/2.7/site-packages/google/protobuf/descriptor.pyc in new(cls, name, full_name, index, number, type, cpp_type, label, default_value, message_type, enum_type, containing_type, is_extension, extension_scope, options, has_default_value, containing_oneof)
503 return _message.default_pool.FindExtensionByName(full_name)
504 else:
--> 505 return _message.default_pool.FindFieldByName(full_name)
506
507 def init(self, name, full_name, index, number, type, cpp_type, label,

KeyError: "Couldn't find field google.protobuf.GeneratedCodeInfo.Annotation.path"`

@dzhyeon

This comment has been minimized.

Copy link

@dzhyeon dzhyeon commented Jun 29, 2016

I've tried solutions listed on tensorflow/tensorflow#582: remove tensorflow, protobuf and reinstall with source, reinstall protobuf with 'pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/protobuf-3.0.0b2.post2-cp27-none-linux_x86_64.whl'. There is no problem on importing tensorflow. However, when following serving tutorial on https://tensorflow.github.io/serving/serving_basic with my own network, same error occured: [libprotobuf ERROR external/protobuf/src/google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes). To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.

@ank286

This comment has been minimized.

Copy link

@ank286 ank286 commented Jul 15, 2016

I agree with @dzhyeon, I reinstalled protobuf-3.0.0b2.post2 but that did not help.

The way I fixed the issue was go to the path external/protobuf/src/google/protobuf/io/coded_stream.h and changing the variable "kDefaultTotalBytesLimit" from 64 to 256. -- for 256MB.
The file coded_stream.h is located within the ~/.cache/bazel/_bazel_root//execroot/serving/ folder.

Now you can bazel build the inference server and do a bazel-bin to run it according to the tutorials

Ankur

@haibarasiao

This comment has been minimized.

Copy link

@haibarasiao haibarasiao commented Aug 19, 2016

@ank286 I changed "kDefaultTotalBytesLimit" from 64 to 256,and reinstalled protobuf from source, but it didn't work.
Do you know the dir of the libprotobuf that the tensorflow used is where? And what did you mean that"the file coded_stream.h is located within the ~/.cache/bazel/_bazel_root//execroot/serving/ folder."?Please help, I am crazy now...:(

@haibarasiao

This comment has been minimized.

Copy link

@haibarasiao haibarasiao commented Aug 19, 2016

@dzhyeon Haved you fixed the issue now?I faced the same problem as yours...I reinstalled the protobuf with changing 256 << 20 in coded_stream.h,but It didn't work

@ank286

This comment has been minimized.

Copy link

@ank286 ank286 commented Aug 19, 2016

Find all instances of coded_stream.h on the machine, if you installed Tensorflow it will be linked to one version of coded_stream.h, but you may have changed another. From my experience, Tensorflow will use bazel and bazel will place a version of coded_stream.h in a (temporary) cache folder, so that is the one that needs to be changed to 256

@vrv

This comment has been minimized.

@vrv vrv closed this Aug 19, 2016
@haibarasiao

This comment has been minimized.

Copy link

@haibarasiao haibarasiao commented Aug 20, 2016

@vrv I have tried to update protobuf as https://www.tensorflow.org/versions/r0.10/get_started/os_setup.html#protobuf-library-related-issues
said, but get the Segment fault like others. My step is pip install tensorflow, pip install --upgrade protobuf.
Is there some thing wrong?

@haibarasiao

This comment has been minimized.

Copy link

@haibarasiao haibarasiao commented Aug 20, 2016

@ank286 Do you mean that I have to reinstalled tensorflow by bazel from source after I changed 256limited in coded_stream.h? I only reinstalled protobuf from source after changed 256limited in every coded_stream.h.

@ank286

This comment has been minimized.

Copy link

@ank286 ank286 commented Aug 20, 2016

I would find all instances of coded_stream.h on your machine and see if
value has been changed to 256. For me, TF was reading coded_stream.h from a
different location that was not in the protobuf source

On Sat, Aug 20, 2016 at 5:17 AM, haibarasiao notifications@github.com
wrote:

@ank286 https://github.com/ank286 Do you mean that I have to
reinstalled tensorflow by bazel from source after I changed 256limited in
coded_stream.h? I only reinstalled protobuf from source after changed
256limited in every coded_stream.h.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#24 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABgU_INIwfCsMXWwpRjq-MHbTAQwH88nks5qhsYmgaJpZM4H4gQU
.

@haibarasiao

This comment has been minimized.

Copy link

@haibarasiao haibarasiao commented Aug 22, 2016

@ank286
I changed coded_stream.h of dir of
"/home/usr/.cache/bazel/_bazel_scw4150/9f70318cfa7ecd7a7b579a16191209d1/external/protobuf/src/google/protobuf/io/coded_stream.h
/home/usr/.cache/bazel/_bazel_scw4150/9f70318cfa7ecd7a7b579a16191209d1/external/grpc/third_party/protobuf/src/google/protobuf/io/coded_stream.h
/home/usr/.local/share/Trash/files/protobuf-3.0.0/src/google/protobuf/io/coded_stream.h
/home/scw4150/.local/share/Trash/files/Untitled Folder.2/protobuf-3.0.0/protobuf-3.0.0/src/google/protobuf/io/coded_stream.h
/usr/lib/python2.7/site-packages/tensorflow/include/google/protobuf/io/coded_stream.h
/usr/include/google/protobuf/io/coded_stream.h"

but nothing helped. I think I need to uninstall tensorflow and reinstalled it from source.But I don't know if this will be work.
sad.

@ank286

This comment has been minimized.

Copy link

@ank286 ank286 commented Aug 24, 2016

Did the reinstallation work?

On Aug 22, 2016 4:17 AM, "haibarasiao" notifications@github.com wrote:

@ank286 https://github.com/ank286
I changed coded_stream.h of dir of
"/home/usr/.cache/bazel/_bazel_scw4150/9f70318cfa7ecd7a7b579a16191209
d1/external/protobuf/src/google/protobuf/io/coded_stream.h
/home/usr/.cache/bazel/_bazel_scw4150/9f70318cfa7ecd7a7b579a16191209
d1/external/grpc/third_party/protobuf/src/google/protobuf/
io/coded_stream.h
/home/usr/.local/share/Trash/files/protobuf-3.0.0/src/
google/protobuf/io/coded_stream.h
/home/scw4150/.local/share/Trash/files/Untitled Folder.2/protobuf-3.0.0/
protobuf-3.0.0/src/google/protobuf/io/coded_stream.h
/usr/lib/python2.7/site-packages/tensorflow/include/
google/protobuf/io/coded_stream.h
/usr/include/google/protobuf/io/coded_stream.h"

but nothing helped. I think I need to uninstall tensorflow and reinstalled
it from source.But I don't know if this will be work.
sad.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#24 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABgU_IUCKd19cKA9t8gmeMCDfNma3IkCks5qiVsogaJpZM4H4gQU
.

@dzhyeon

This comment has been minimized.

Copy link

@dzhyeon dzhyeon commented Nov 3, 2016

As @ank286 suggested, changing coded_stream from .cache solved the problem.
Just for sure, I've also run bazel clean and rebuild it.

@laotao

This comment has been minimized.

Copy link

@laotao laotao commented Dec 14, 2017

The latest protobuf version has already raised the hard limit to 2GB. But there's another place to modify if the model is over 1GB. It is in (tfserving root)/tensorflow/tensorflow/core/platform/env.cc line 422:
coded_stream.SetTotalBytesLimit(1024LL << 20, 512LL << 20);

I changed 2014LL to 1500LL and my model was successfully loaded. (But 2048LL caused a 0 limit error. You may try 2047LL or INT_MAX if your model is as big as 2GB)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.