Skip to content

backward() on multiple streams and devices is not recording streams properly #33909

@dmitrivainbrand

Description

@dmitrivainbrand

🐛 Bug

@mcarilli recently fixed a bug in streaming backward with cross-device synchronization.
However, running my multi-GPU, multi-Stream code for model pipelining I encountered with the flowing behavior:
It seems that producer-consumer dependencies were all kept and respected, but the diffs between no-stream and multi-stream version of the same code were consistently large. @mcarilli suspected that the problem is with the cache allocator and proposed to add .recodStream() to synchronize the allocators with the consumer stream. It worked like a charm! The no-stream vs. with-streams are now bit-exact... Almost always... Except for once in 20-30 runs or so, I see some non-deterministic behavior that causes diffs with order of magnitude of 1E-9. I'm not sure it's even related...

The change:
added the line: c10::cuda::CUDACachingAllocator::recordStream(var.storage().data_ptr(), c10::cuda::CUDAStream(*opt_accumulate_stream));
after each wait(event); in autograd/input_buffer.cpp

  if (device_of(var)->is_cuda()) {
    const auto on_producer = opt_producer_stream
                        && device_of(var) == opt_producer_stream->device();
    const auto on_consumer = opt_consumer_stream
                        && device_of(var) == opt_consumer_stream->device();
    if (on_producer && on_consumer) {
      // (2a)
      opt_accumulate_stream = opt_consumer_stream;
      if (opt_accumulate_stream != opt_producer_stream) {
        // (2b)
        auto event = c10::Event{c10::DeviceType::CUDA};
        event.record(*opt_producer_stream);
        opt_accumulate_stream->wait(event);
        c10::cuda::CUDACachingAllocator::recordStream(var.storage().data_ptr(), c10::cuda::CUDAStream(*opt_accumulate_stream));
      }
    } else {
      c10::optional<c10::Stream> opt_sync_stream = c10::nullopt;
      const auto guard = c10::impl::VirtualGuardImpl{c10::DeviceType::CUDA};
      if (on_consumer && !on_producer) {
        // (3a)
        opt_accumulate_stream = opt_consumer_stream;
        opt_sync_stream = guard.getDefaultStream(opt_consumer_stream->device());
        TORCH_INTERNAL_ASSERT(opt_sync_stream == guard.getStream(*device_of(var)));
      } else if (on_producer && !on_consumer) {
        // (4a)
        opt_accumulate_stream = guard.getDefaultStream(opt_producer_stream->device());
        opt_sync_stream = opt_producer_stream;
      } else {
        // (5)
        opt_accumulate_stream = guard.getDefaultStream(*device_of(var));
      }
      if (opt_sync_stream && (opt_accumulate_stream != opt_sync_stream)) {
        // (3b), (4b)
        c10::OptionalDeviceGuard device_guard{opt_sync_stream->device()};
        auto event = c10::Event{c10::DeviceType::CUDA};
        event.record(*opt_sync_stream);
        opt_accumulate_stream->wait(event);
        c10::cuda::CUDACachingAllocator::recordStream(var.storage().data_ptr(), c10::cuda::CUDAStream(*opt_accumulate_stream));
      }
    }

To Reproduce

Adding my test pipeline_mlp_bug.py that runs the same network with and without streams and compares the results.
To reproduce the behavior

  1. run: python pipeline_mlp_bug.py --num-chunks=8 --num-microbatches=5 --num-mp-gpus 4 --num-batches=4

Expected behavior

Before the change expect to see significant diffs at the end:

______________Pipelined w/o Streams vs. With Streams ____________
out equals? tensor(False)
out max diff: tensor(0.0015, grad_fn=)
weight 0 equals?: tensor(False, device='cuda:0')
weight 0 max diff? tensor(3.7253e-09, device='cuda:0')
weight last equals?: tensor(False, device='cuda:0')
weight last max diff tensor(6.4563e-05, device='cuda:0')
weight gradient 0 equals?: tensor(False, device='cuda:0')
weight gradient 0 max diff? tensor(1.1085e-08, device='cuda:0')
weight gradient last equals?: tensor(False, device='cuda:0')
weight gradient last max diff tensor(0.0215, device='cuda:0')
loss pipelined 7.636113166809082, loss w streams 7.63759708404541

After the change expect to see no differences between the two runs:

______________Pipelined w/o Streams vs. With Streams ____________
out equals? tensor(True)
out max diff: tensor(0., grad_fn=)
weight 0 equals?: tensor(True, device='cuda:0')
weight 0 max diff? tensor(0., device='cuda:0')
weight last equals?: tensor(True, device='cuda:0')
weight last max diff tensor(0., device='cuda:0')
weight gradient 0 equals?: tensor(True, device='cuda:0')
weight gradient 0 max diff? tensor(0., device='cuda:0')
weight gradient last equals?: tensor(True, device='cuda:0')
weight gradient last max diff tensor(0., device='cuda:0')
loss pipelined 7.636113166809082, loss w streams 7.636113166809082

Environment

(also discussed here #7601 (comment))
Test attached:
pipeline_mlp_bug.zip

cc @ezyang @ssnl @albanD @zou3519 @gqchen @ngimel

Metadata

Metadata

Assignees

Labels

module: autogradRelated to torch.autograd, and the autograd engine in generalmodule: cudaRelated to torch.cuda, and CUDA support in generalmodule: determinismtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions