backward() on multiple streams and devices is not recording streams properly

## 🐛 Bug

@mcarilli recently fixed a bug in streaming backward with cross-device synchronization. 
However, running my multi-GPU, multi-Stream code for model pipelining I encountered with the flowing behavior: 
It seems that producer-consumer dependencies were all kept and respected, but the diffs between no-stream and multi-stream version of the same code were consistently large. @mcarilli suspected that the problem is with the cache allocator and proposed to add .recodStream() to synchronize the allocators with the consumer stream. It worked like a charm! The no-stream vs. with-streams are now bit-exact... Almost always... Except for once in 20-30 runs or so, I see some non-deterministic behavior that causes diffs with order of magnitude of 1E-9. I'm not sure it's even related...

**The change:**
added the line: `c10::cuda::CUDACachingAllocator::recordStream(var.storage().data_ptr(), c10::cuda::CUDAStream(*opt_accumulate_stream));`
after each `wait(event);` in `autograd/input_buffer.cpp`
```c++ 
  if (device_of(var)->is_cuda()) {
    const auto on_producer = opt_producer_stream
                        && device_of(var) == opt_producer_stream->device();
    const auto on_consumer = opt_consumer_stream
                        && device_of(var) == opt_consumer_stream->device();
    if (on_producer && on_consumer) {
      // (2a)
      opt_accumulate_stream = opt_consumer_stream;
      if (opt_accumulate_stream != opt_producer_stream) {
        // (2b)
        auto event = c10::Event{c10::DeviceType::CUDA};
        event.record(*opt_producer_stream);
        opt_accumulate_stream->wait(event);
        c10::cuda::CUDACachingAllocator::recordStream(var.storage().data_ptr(), c10::cuda::CUDAStream(*opt_accumulate_stream));
      }
    } else {
      c10::optional<c10::Stream> opt_sync_stream = c10::nullopt;
      const auto guard = c10::impl::VirtualGuardImpl{c10::DeviceType::CUDA};
      if (on_consumer && !on_producer) {
        // (3a)
        opt_accumulate_stream = opt_consumer_stream;
        opt_sync_stream = guard.getDefaultStream(opt_consumer_stream->device());
        TORCH_INTERNAL_ASSERT(opt_sync_stream == guard.getStream(*device_of(var)));
      } else if (on_producer && !on_consumer) {
        // (4a)
        opt_accumulate_stream = guard.getDefaultStream(opt_producer_stream->device());
        opt_sync_stream = opt_producer_stream;
      } else {
        // (5)
        opt_accumulate_stream = guard.getDefaultStream(*device_of(var));
      }
      if (opt_sync_stream && (opt_accumulate_stream != opt_sync_stream)) {
        // (3b), (4b)
        c10::OptionalDeviceGuard device_guard{opt_sync_stream->device()};
        auto event = c10::Event{c10::DeviceType::CUDA};
        event.record(*opt_sync_stream);
        opt_accumulate_stream->wait(event);
        c10::cuda::CUDACachingAllocator::recordStream(var.storage().data_ptr(), c10::cuda::CUDAStream(*opt_accumulate_stream));
      }
    }
```
## To Reproduce
Adding my test `pipeline_mlp_bug.py` that runs the same network with and without streams and compares the results.  
To reproduce the behavior 

1. run: `python pipeline_mlp_bug.py --num-chunks=8 --num-microbatches=5 --num-mp-gpus 4 --num-batches=4`



## Expected behavior
Before the change expect to see significant diffs at the end:

> ______________Pipelined w/o Streams vs. With Streams ____________
> out equals? tensor(False)
> out max diff: tensor(0.0015, grad_fn=<MaxBackward1>)
> weight 0 equals?: tensor(False, device='cuda:0')
> weight 0 max diff? tensor(3.7253e-09, device='cuda:0')
> weight last equals?: tensor(False, device='cuda:0')
> weight last max diff tensor(6.4563e-05, device='cuda:0')
> weight gradient 0 equals?: tensor(False, device='cuda:0')
> weight gradient 0 max diff? tensor(1.1085e-08, device='cuda:0')
> weight gradient last equals?: tensor(False, device='cuda:0')
> weight gradient last max diff tensor(0.0215, device='cuda:0')
> loss pipelined 7.636113166809082, loss w streams 7.63759708404541

After the change expect to see no differences between the two runs:

> ______________Pipelined w/o Streams vs. With Streams ____________
> out equals? tensor(True)
> out max diff: tensor(0., grad_fn=<MaxBackward1>)
> weight 0 equals?: tensor(True, device='cuda:0')
> weight 0 max diff? tensor(0., device='cuda:0')
> weight last equals?: tensor(True, device='cuda:0')
> weight last max diff tensor(0., device='cuda:0')
> weight gradient 0 equals?: tensor(True, device='cuda:0')
> weight gradient 0 max diff? tensor(0., device='cuda:0')
> weight gradient last equals?: tensor(True, device='cuda:0')
> weight gradient last max diff tensor(0., device='cuda:0')
> loss pipelined 7.636113166809082, loss w streams 7.636113166809082
> 



## Environment


(also discussed here https://github.com/pytorch/pytorch/issues/7601#issuecomment-588471016)
Test attached:
[pipeline_mlp_bug.zip](https://github.com/pytorch/pytorch/files/4264470/pipeline_mlp_bug.zip)


cc @ezyang @SsnL @albanD @zou3519 @gqchen @ngimel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

backward() on multiple streams and devices is not recording streams properly #33909

🐛 Bug

To Reproduce

Expected behavior

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

backward() on multiple streams and devices is not recording streams properly #33909

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions