Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protobuf message overflow on trying distributed #2233

Closed
gaoteng-git opened this issue May 5, 2016 · 8 comments
Closed

protobuf message overflow on trying distributed #2233

gaoteng-git opened this issue May 5, 2016 · 8 comments
Assignees
Labels

Comments

@gaoteng-git
Copy link

gaoteng-git commented May 5, 2016

I'm trying to build an RNN on multi-machines following the Distributed Tensorflow.
when I use "with sv.managed_session(server.target) as sess:", it shows error:
AttributeError: 'Supervisor' object has no attribute 'managed_session'
So I follow the code of "Inception":
with sv.prepare_or_wait_for_session(server.target, config = sess_config) as sess :

Then it starts to run, but hangs immediately after reporting the following error:

[libprotobuf WARNING google/protobuf/src/google/protobuf/io/coded_stream.cc:569] Reading dangerously large protocol message. If the message turns out to be larger than 67108864 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf ERROR google/protobuf/src/google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes). To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/src/google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 67108864
E tensorflow/core/framework/tensor.cc:105] Input size was 67108839 and expected 72000800

Would you please help me on this?
Thanks a lot in advance!

@mrry mrry added the bug label May 5, 2016
@mrry mrry self-assigned this May 5, 2016
@mrry
Copy link
Contributor

mrry commented May 5, 2016

Thanks for letting us know about this - it sounds like a bug in the tensor serialization code, but to figure out what's going wrong, we're going to need more information:

  1. Can you pinpoint the line of Python on which it hangs? (Does it hang on the sv.prepare_or_wait_for_session() or does it hang when you try to run something?)
  2. Is there an obviously-large tensor in your program? Does it exist as a tf.constant(), a numpy array that is implicitly converted to a tf.constant(), a value that is fed into the graph, or could it be a result that you're fetching? (Looking at the expected size, 72000800 bytes, I'm guessing that it's a tensor with size 90001 in one dimension, maybe a 200 x 90001 matrix of floats?) All of these cases should be handled, but this will help to construct a minimal failing test case.
  3. Can you try running the code in the following tests from server_lib_test.py in your environment: testLargeConstant(), testLargeFetch(), and testLargeFeed().

@gaoteng-git
Copy link
Author

gaoteng-git commented May 6, 2016

Thanks very much for your quick response!

  1. It passes sv.prepare_or_wait_for_session() and hangs in session.run(...)

  2. Yes. I just follow the tutorial ptb_word_lm.py RNN model and made just a little modification to make it do classfication work. The "embedding" is 90001*200 floats. The embedding definition is just same as ptb_word_lm.py:

    with tf.device("/cpu:0"):
    embedding = tf.get_variable("embedding", [vocab_size, size]) # [90001, 200] in my code.
    inputs = tf.nn.embedding_lookup(embedding, self._input_data)

  3. I run your code with "python ./server_lib_test.py", after showing a lot of log messages, it shows the following message:

.......

Ran 12 tests in 4.028s

OK

When I try to decease the "embedding" matrix to smaller size such as [101, 200], it never hangs and run correctly.
I run the [90001, 200] again today, the warning message "Input size was 67108839 and expected 72000800" still shows but it don't hang again. I don't known why...

BTW, is there any RNN sample code that could run on multi-cards(multi-tower style) or distributed multi-machines? I tried to port multi-tower from CIFAR-10 example code to my RNN code, after 1 week's work I failed(I'll report the bugs in another issue). Then I gave up and turn to distributed style. The "Inception" example code needs BIG data set ILSVRC2012 and it's time consuming to download the data(especially from China...). Since RNN is especially usefull in many application domains, would you mind show me the multi-cards or multi-machines existing code if you know?

When I decrease the embedding size to [101, 200], it starts running. But still it can't fully use my 4 GPU cards in the machine. There is only 1 card with more than 0% usage. Would you mind giving me an email then I can send my code to you?

Thanks a lot again for your time!

@gaoteng-git
Copy link
Author

gaoteng-git commented May 18, 2016

@mrry
In order to make you easier to reproduce the bug, I just take ptb_word_lm.py as example. I added Async-Distributed tensorflow code to it.
I run it by using 1 ps-server and 2 workers, all of them on the same node:

CUDA_VISIBLE_DEVICES='' python ./ptb_word_lm.py --ps_hosts="10.141.33.61:2222" --worker_hosts="10.141.33.61:2223,10.141.33.61:2224" --job_name=ps --task_index=0
python ./ptb_word_lm.py --ps_hosts="10.141.33.61:2222" --worker_hosts="10.141.33.61:2223,10.141.33.61:2224" --job_name=worker --task_index=0 --data_path=.
python ./ptb_word_lm.py --ps_hosts="10.141.33.61:2222" --worker_hosts="10.141.33.61:2223,10.141.33.61:2224" --job_name=worker --task_index=1 --data_path=.

When I set "vocab_size" to bigger than 83887, (838872004>67108864), it shows the following error(job crash by most cases, but not always crash):

I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2223
Epoch: 1 Learning rate: 0.000
[libprotobuf WARNING google/protobuf/src/google/protobuf/io/coded_stream.cc:569] Reading dangerously large protocol message.  If the message turns out to be larger than 67108864 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf ERROR google/protobuf/src/google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes).  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/src/google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 67108864
E tensorflow/core/framework/tensor.cc:105] Input size was 67108839 and expected 67109600
[libprotobuf WARNING google/protobuf/src/google/protobuf/io/coded_stream.cc:569] Reading dangerously large protocol message.  If the message turns out to be larger than 67108864 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf ERROR google/protobuf/src/google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes).  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/src/google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 67108864
[libprotobuf WARNING google/protobuf/src/google/protobuf/io/coded_stream.cc:569] Reading dangerously large protocol message.  If the message turns out to be larger than 67108864 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf ERROR google/protobuf/src/google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes).  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/src/google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 67108864
E tensorflow/core/framework/tensor.cc:105] Input size was 67108839 and expected 67109600
Traceback (most recent call last):
  File "./ptb_word_lm.py", line 358, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "./ptb_word_lm.py", line 355, in main
    run_worker(server, cluster)
  File "./ptb_word_lm.py", line 335, in run_worker
    verbose=True)
  File "./ptb_word_lm.py", line 264, in run_epoch
    m.lr : cur_lr })
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 340, in run
    run_metadata_ptr)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 564, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 637, in _do_run
    target_list, options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 659, in _do_call
    e.code)
tensorflow.python.framework.errors.AbortedError: Step 79814040532157923
     [[Node: model/clip_by_global_norm/model/clip_by_global_norm/_5_S67 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/cpu:0", send_device="/job:worker/replica:0/task:0/gpu:0", send_device_incarnation=67222584234197827, tensor_name="edge_9213_model/clip_by_global_norm/model/clip_by_global_norm/_5", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/cpu:0"]()]]

When I set "vocab_size" to less than but near to 83886, (838862004<67108864), it shows the warning:

[libprotobuf WARNING google/protobuf/src/google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 67108834
[libprotobuf WARNING google/protobuf/src/google/protobuf/io/coded_stream.cc:569] Reading dangerously large protocol message.  If the message turns out to be larger than 67108864 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.

In our large-scale web mining job, we need to increase the word-embedding vocabulary size to millions of words. So we have no way but to solve this problem.

My code is here: (reader.py is just from github and unchanged)
ptb_word_lm.py.txt

@swlsw
Copy link

swlsw commented May 19, 2016

Did you solve this problem? I have similar problem with my async-distributed word2vec.(79840 vocab, 300 dim)

I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {localhost:2422}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {50.1.100.102:2432, 50.1.100.108:2432}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2422
[libprotobuf WARNING google/protobuf/src/google/protobuf/io/coded_stream.cc:569] Reading dangerously large protocol message.  If the message turns out to be larger than 67108864 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf ERROR google/protobuf/src/google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes).  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/src/google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 67108864
E tensorflow/core/framework/tensor.cc:105] Input size was 67108839 and expected 95808000

@VDalibard
Copy link

I think the solution is in #582, the protobuf python package solution as the bottom of the thread doesn't seem to work for everyone as mentioned in #2046.

@swlsw
Copy link

swlsw commented May 24, 2016

Is it working for distributed version?
I have checked that the protobuf warning/error message has gone for word2vec_optimized example.
But same problem occurs while running my own distributed word2vec_optimized.

@mrry
Copy link
Contributor

mrry commented May 24, 2016

Hi folks, sorry for the delay on this one. We've tracked down the issue: the generated code for gRPC uses a protobuf parsing routine that doesn't override the default 64MB limit.

Generally speaking, if your trainer is transferring large tensors in every step, there might be a more efficient way to implement it at the Python level. For example, instead of fetching a large fully connected layer to compute the logits, you might use a sampled loss function, and you can use tf.nn.embedding_lookup() instead of tf.gather() to more efficiently access sparse rows in an embedding matrix. Alternatively, you can shard your variables manually to avoid the limit.

Clearly, this isn't ideal, and we're working on a more robust fix.

@hbhuang
Copy link

hbhuang commented Oct 21, 2016

@swlsw , I use your codes(ptb_word_lm.py.txt) and run it on two machine. but i get the errror:RuntimeError: Graph is finalized and cannot be modified. I also run it by using 1 ps-server and 2 workers. hope to get yours help?

@aselle aselle added type:bug Bug and removed bug labels Feb 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants