Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Pipe + DDP for unused parameters, static graph #60118

Closed
wants to merge 5 commits into from

Commits on Jun 16, 2021

  1. Fix Pipe + DDP for unused parameters, static graph

    Pipe + DDP has a few issues:
    
    1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
    2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since #57081
    
    
    The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.
    
    To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.
    
    to test:
    All tests in pipe_with_ddp_test pass.
    The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.
    
    Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)
    
    [ghstack-poisoned]
    rohan-varma committed Jun 16, 2021
    Copy the full SHA
    00b3804 View commit details
    Browse the repository at this point in the history
  2. Update on "Fix Pipe + DDP for unused parameters, static graph"

    Pipe + DDP has a few issues:
    
    1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
    2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081
    
    
    The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.
    
    To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.
    
    to test:
    All tests in pipe_with_ddp_test pass.
    The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.
    
    Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)
    
    [ghstack-poisoned]
    rohan-varma committed Jun 16, 2021
    Copy the full SHA
    4e0aa23 View commit details
    Browse the repository at this point in the history
  3. Update on "Fix Pipe + DDP for unused parameters, static graph"

    Pipe + DDP has a few issues:
    
    1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
    2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081
    
    
    The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.
    
    To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.
    
    to test:
    All tests in pipe_with_ddp_test pass.
    The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.
    
    Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)
    
    [ghstack-poisoned]
    rohan-varma committed Jun 16, 2021
    Copy the full SHA
    12360bf View commit details
    Browse the repository at this point in the history

Commits on Jun 17, 2021

  1. Update on "Fix Pipe + DDP for unused parameters, static graph"

    Pipe + DDP has a few issues:
    
    1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
    2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081
    
    
    The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.
    
    To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.
    
    to test:
    All tests in pipe_with_ddp_test pass.
    The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.
    
    Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)
    
    [ghstack-poisoned]
    rohan-varma committed Jun 17, 2021
    Copy the full SHA
    5db4162 View commit details
    Browse the repository at this point in the history
  2. Update on "Fix Pipe + DDP for unused parameters, static graph"

    Pipe + DDP has a few issues:
    
    1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since #55248
    2) when find_unused_parameters=True, also does not result in gradient synchronization. does not work since #57081
    
    
    The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.
    
    To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in #49908.
    
    to test:
    All tests in pipe_with_ddp_test pass.
    The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.
    
    Differential Revision: [D29167283](https://our.internmc.facebook.com/intern/diff/D29167283/)
    
    [ghstack-poisoned]
    rohan-varma committed Jun 17, 2021
    Copy the full SHA
    1d9a066 View commit details
    Browse the repository at this point in the history