-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SyncBatchNormalization layer segfaults on multi-worker with NCCL #41113
Comments
I have tried in colab with TF version 2.2 and i am seeing the error message |
@ravikyram Oops, sorry about that. I removed the line that was the problem and it should work now. Please keep in mind that the code must run on multiple workers to reproduce the issue. As far as I know, this is not possible on Google Colab. |
Hi @MinasTyuru, thanks for providing very detailed debugging information. BTW you can use virtual GPUs in colab to simulate a multiworker environment, but NCCL is not supported in this case, so probably not helpful for debugging this particular issue. Can you confirm that you do not see this issue if the CollectiveCommunication is set to AUTO? Additionally, can you try and run on TF nightly and let me know what happens? |
@nikitamaia Thank you for the information! Yes, that's correct. Using AUTO or RING seems to work fine. Running on TF nightly may be a bit tricky on our infrastructure, I will get back to you on that. |
@nikitamaia Hi Nikita, I was able to reproduce on TF-nightly |
@MinasTyuru thank you for the report. It seems to be the case from the title, but I would like to confirm: does this issue only show up when using SyncBatchNormalization layer? (And goes away when not using it, or using a regular batch norm). Getting a segfault is terrible, we will look into it. We have used SyncBatchNormalization for some models before with multi worker, so I am trying to see what's different here since your repro is quite simple. Would you mind trying one thing: Can you make the batch size bigger? Currently seems 1. Usually we try to split the batch across the replicas but since batch size of 1 cannot be split, the other replicas will end up getting empty batches. I am wondering if that's what is triggering this. Although, it would not explain why it works for some steps and then fails.. |
After some more investigation, it looks like we can reproduce the issue in some contexts but not others, so it's possible that it is some sort of configuration issue. I thought that I was able to reproduce on TF-nightly, but I believe that was due to a mistake I made when installing CUDA, and now I think it may work fine on TF-nightly. It may be related to how we build TensorFlow from source or something of that nature. I will investigate some more and let you know if it seems to be a problem on our end or not. |
thanks for the update @MinasTyuru . Re: nightly, yes, your logs indicated that the issue with the nightly was that it was not able to find the GPUs and hence failed right away. I would guess that once you fix that issue, it may probably run into the segfault at some point later like with 2.2. |
Yes, we are seeing something similar. When I used a debugger, it changed the behavior. For example, it seems like adding a breakpoint after line 176 of We tried to reproduce on TF2.2 and TF-nightly, and on a clean build (i.e. installing TensorFlow from pip, etc.), it seems like it trains for a long time without any problems. But in our production environment (which changes some of the dependencies in the TensorFlow build and other build differences), we encounter the segfault, so perhaps subtle timing differences are triggering the underlying issue. |
Attached are a stacktrace from when |
@MinasTyuru I just submitted a change that should help with this issue. Feel free to reopen if you encounter the segfault again. |
Should |
Thanks Ayush! I will test this out and see if it works.
|
@dubey It seems like the change fixes the
I think that maybe Also I think I don't have the required permissions to reopen this issue, so I'm just commenting on it. |
Thanks for the update. Let me follow up internally. |
Shimin implemented the change that he suggested with
I'm not sure what the relationship is between |
Thanks, I was working on a similar change internally, this time with a unit test that can reproduce the issue. Yes please do keep me posted. |
Apologies for the delay, it looks like everything works without segfaulting now. |
System information
Describe the current behavior
When training models with the
tf.keras.layers.experimental.SyncBatchNormalization
layer, and usingtf.distribute.experimental.MultiWorkerMirroredStrategy
to train across multiple workers withtf.distribute.experimental.CollectiveCommunication.NCCL
communication, the model trains for some amount of time (e.g. several thousand steps), then crashes with a segfault.Describe the expected behavior
The model should train without segfaulting.
Standalone code to reproduce the issue
An example is below. Please note that this code must run on multiple workers. The TF_CONFIG environment variable must be set appropriately for your specific multi-worker configuration.
This is reproducible across a wide array of contexts, for example a Keras model, an Estimator model, different GPU types, etc.
Other info / logs
I used
gdb
to inspect a coredump from the crashed process. The backtrace is:Disassembling the function showed that this was the offending instruction:
And printing out the registers shows
rax 0x0 0
, so some sort of pointer is set to 0. It therefore looks like there is some sort of null pointer exception in line 185 ofcollective_nccl_reducer.cc
, which I believe is the linecol_ctx_->col_exec->UnblockDependencies(*col_params_);
. I don't have any idea why it would segfault there, however. The same line appears shortly above on line 176, so it's strange that it would segfault the second time.Also, a log is attached here, however it is not very interesting as it just runs for a while and then segfaults.
The text was updated successfully, but these errors were encountered: