-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
protobuf message overflow on trying distributed #2233
Comments
Thanks for letting us know about this - it sounds like a bug in the tensor serialization code, but to figure out what's going wrong, we're going to need more information:
|
Thanks very much for your quick response!
.......Ran 12 tests in 4.028s OK When I try to decease the "embedding" matrix to smaller size such as [101, 200], it never hangs and run correctly. BTW, is there any RNN sample code that could run on multi-cards(multi-tower style) or distributed multi-machines? I tried to port multi-tower from CIFAR-10 example code to my RNN code, after 1 week's work I failed(I'll report the bugs in another issue). Then I gave up and turn to distributed style. The "Inception" example code needs BIG data set ILSVRC2012 and it's time consuming to download the data(especially from China...). Since RNN is especially usefull in many application domains, would you mind show me the multi-cards or multi-machines existing code if you know? When I decrease the embedding size to [101, 200], it starts running. But still it can't fully use my 4 GPU cards in the machine. There is only 1 card with more than 0% usage. Would you mind giving me an email then I can send my code to you? Thanks a lot again for your time! |
@mrry
When I set "vocab_size" to bigger than 83887, (838872004>67108864), it shows the following error(job crash by most cases, but not always crash):
When I set "vocab_size" to less than but near to 83886, (838862004<67108864), it shows the warning:
In our large-scale web mining job, we need to increase the word-embedding vocabulary size to millions of words. So we have no way but to solve this problem. My code is here: (reader.py is just from github and unchanged) |
Did you solve this problem? I have similar problem with my async-distributed word2vec.(79840 vocab, 300 dim)
|
Is it working for distributed version? |
Hi folks, sorry for the delay on this one. We've tracked down the issue: the generated code for gRPC uses a protobuf parsing routine that doesn't override the default 64MB limit. Generally speaking, if your trainer is transferring large tensors in every step, there might be a more efficient way to implement it at the Python level. For example, instead of fetching a large fully connected layer to compute the logits, you might use a sampled loss function, and you can use Clearly, this isn't ideal, and we're working on a more robust fix. |
@swlsw , I use your codes(ptb_word_lm.py.txt) and run it on two machine. but i get the errror:RuntimeError: Graph is finalized and cannot be modified. I also run it by using 1 ps-server and 2 workers. hope to get yours help? |
I'm trying to build an RNN on multi-machines following the Distributed Tensorflow.
when I use "with sv.managed_session(server.target) as sess:", it shows error:
AttributeError: 'Supervisor' object has no attribute 'managed_session'
So I follow the code of "Inception":
with sv.prepare_or_wait_for_session(server.target, config = sess_config) as sess :
Then it starts to run, but hangs immediately after reporting the following error:
[libprotobuf WARNING google/protobuf/src/google/protobuf/io/coded_stream.cc:569] Reading dangerously large protocol message. If the message turns out to be larger than 67108864 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf ERROR google/protobuf/src/google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes). To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/src/google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 67108864
E tensorflow/core/framework/tensor.cc:105] Input size was 67108839 and expected 72000800
Would you please help me on this?
Thanks a lot in advance!
The text was updated successfully, but these errors were encountered: