New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCm] This change replaces the original assert for detecting multiple #49232
[ROCm] This change replaces the original assert for detecting multiple #49232
Conversation
NCCL managers in favor of a warning log message. The original assert was added to protect against a potential deadlock scenario (that is specific to ROCm) when multiple hosts are emulated using a single host. This can happen in this scenario because two different NCCL streams can map to the same hardware queue and block each other from making forward progress. This deadlock scenario seems to only occur in this specific scenario. However, the assert was preventing other scenarios involving multiple NCCL managers from executing even though they did not cause a deadlock. For example, TF creates a single eager context (which creates a NCCL manager) prior to running a model in eager mode. However, other eager contexts can be created for single use purposes which results in additional NCCL managers being created but not used duing the model run. In these scenarios, deadlock is impossible and thus the creation of multiple NCCL managers should be allowed to proceed. The specific case which is prompting this change is that in addition to the regular eager context, an additional context is created by the MLIR constant folding optimization to compute the constant at compile time so that it can be used as a replacement later on the compilation process. This additional context is only used once and cannot cause a deadlock. This change enables these cases to execute successfully. In the future, the original assert will have to be reinstated with additional logic restricting its execution to the specific scenario(s) which cause the deadlock.
@chsigg Gentle ping. |
1 similar comment
@chsigg Gentle ping. |
@cheshire gentle ping. |
@jurahul any thoughts on this? |
@rsanthanam-amd sorry I am confused. In general, we try to avoid warning log messages: if it's an error, we should check at least dynamically, and if it's not an error, we should not print anything. @jpienaar can we avoid creating the NCCL manager from the TF eager context used by the MLIR bridge? Could you clarify the connection with ROCm? |
I'm surprised a NCCL manager is created given GPU count is set to 0 there - at least I can't think of where a NCCL manager would be useful without any GPU devices enabled. I think we should be able to avoid that. |
This is a PR from ~21 days ago, is it possible that the bug is no longer there? |
@cheshire, @jpienaar, @rsanthanam-amd Any update on this PR? Please. Thanks! |
@rsanthanam-amd do you still hit the original issue? |
@cheshire I recently tried to reproduce the problem but the model I am using is experiencing problems because of the keras decoupling. I will look into this as soon as I can sort out that problem. |
@rsanthanam-amd is this the issue about checking |
@cheshire Yes, exactly! |
@rsanthanam-amd could you please describe exactly how did you hit it? |
@cheshire the model i am working with is using some lars optimizer. Here is the error message: "TypeError: "inner_optimizer" must be an instance of OptimizerV2, but got: <lars_optimizer.LARSOptimizer object at 0x7f6b390250f0>" |
so the underlying issue is that you are mixing Keras classes from |
@cheshire Thank you so much! Your suggestion worked and I got the model running. This Larsoptimizer seems to be part of the model source. I synced our ROCm fork with this repo on Monday and I can confirm that in our synced repo, this multiple nccl manager problem still exists. |
Could you report the call stacks of where the NcclManagers are created? |
Call stack No. 1#0 0x00007f2ef0b2cfb7 in raise () from /lib/x86_64-linux-gnu/libc.so.6 Call stack No. 2#0 0x00007f2ef0b2cfb7 in raise () from /lib/x86_64-linux-gnu/libc.so.6 |
Pending change here for review to avoid the init where there are 0 GPU devices configured, just waiting for reviews to complete. |
fc158eb should avoid creating these now where we have 0 GPUs set. |
@rsanthanam-amd Can you please check @jpienaar's comments and keep us posted ? Thanks! |
NCCL managers in favor of a warning log message.
The original assert was added to protect against a potential deadlock
scenario (that is specific to ROCm) when multiple hosts are emulated
using a single host. This can happen in this scenario because two
different NCCL streams can map to the same hardware queue and block
each other from making forward progress.
This deadlock scenario seems to only occur in this specific scenario.
However, the assert was preventing other scenarios involving multiple
NCCL managers from executing even though they did not cause a deadlock.
For example, TF creates a single eager context (which creates a NCCL manager)
prior to running a model in eager mode. However, other eager contexts can be
created for single use purposes which results in additional NCCL managers being
created but not used duing the model run. In these scenarios, deadlock is
impossible and thus the creation of multiple NCCL managers should be allowed
to proceed.
The specific case which is prompting this change is that in addition to the
regular eager context, an additional context is created by the MLIR constant
folding optimization to compute the constant at compile time so that it can
be used as a replacement later on the compilation process. This additional
context is only used once and cannot cause a deadlock.
This change enables these cases to execute successfully.
In the future, the original assert will have to be reinstated with
additional logic restricting its execution to the specific scenario(s)
which cause the deadlock.