New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gRPC: terminate called after throwing an instance of 'std::bad_alloc' #31245
Comments
This is not a memory allocation issue and therefore not related to this issue: #9487. Consider the memory profile for a single node run: |
Since you are able to execute code on parallel version by using lower image size and encounter error for higher image sizes looks like its a failure to allocate memory. |
Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks! |
@ymodak Can we reopen this issue? I took over this work from @karkadad and I'm facing this issue. With the same size of the image (1,900,900,900,1), a single node can complete the run, but the parallel version with 2 nodes couldn't because of std::bad_alloc. I think this means that there's enough memory in 1 node and the error happens somewhere in the communication. |
Can you run the code under |
#0 0x00007ffff71281f7 in raise () from /lib64/libc.so.6 @mrry Thanks for jumping on this issue. Do you think this back trace is helpful? Please do let me know if you need any further info. |
Thank you! This back trace is perfect, and I think it gives us enough information to find a fix. I'll ping this thread when we have more information. |
yes, this is still an issue. @mrry Any updates? |
Still waiting. Any help would be appreciated. |
I saw the same error on CPU, and reducing batch_size addressed the problem. |
@asterisk37n Thanks for your comment. Unfortunately, we are facing this issue with batch_size 1 because the input size is really big. |
We are closing this issue for now due to lack of activity. Please comment if this is still an issue for you. Thanks! |
I am facing the same issue. When I tried to train my input embedding with shape [20K, 128] the training phase looked nice, but when the embedding shape was changed into [20M, 128] the std::bad_alloc error occured in the parameter server. At that time the memory used was less than 50% of total. |
#6 0x00007fffed2490c1 in operator new (sz=18446744069414584364) I think there is a bug in the method |
System information
• OS Platform and Distribution: CentOS Linux release 7.4.1708 (Core)
• TensorFlow version : Tensorflow-1.12.0 built from source
Describe the problem:
Running a model parallel implementation results in the following error:
The code runs fine on a single node i.e. with a single worker, but distributing across 2 nodes/2 workers results in the above error. Suspect it is related to grpc.
Source code / logs
On the chief worker:
On the non-chief worker (on another node with a different IP address):
The text was updated successfully, but these errors were encountered: