-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
🐛 Bug
@mcarilli recently fixed a bug in streaming backward with cross-device synchronization.
However, running my multi-GPU, multi-Stream code for model pipelining I encountered with the flowing behavior:
It seems that producer-consumer dependencies were all kept and respected, but the diffs between no-stream and multi-stream version of the same code were consistently large. @mcarilli suspected that the problem is with the cache allocator and proposed to add .recodStream() to synchronize the allocators with the consumer stream. It worked like a charm! The no-stream vs. with-streams are now bit-exact... Almost always... Except for once in 20-30 runs or so, I see some non-deterministic behavior that causes diffs with order of magnitude of 1E-9. I'm not sure it's even related...
The change:
added the line: c10::cuda::CUDACachingAllocator::recordStream(var.storage().data_ptr(), c10::cuda::CUDAStream(*opt_accumulate_stream));
after each wait(event);
in autograd/input_buffer.cpp
if (device_of(var)->is_cuda()) {
const auto on_producer = opt_producer_stream
&& device_of(var) == opt_producer_stream->device();
const auto on_consumer = opt_consumer_stream
&& device_of(var) == opt_consumer_stream->device();
if (on_producer && on_consumer) {
// (2a)
opt_accumulate_stream = opt_consumer_stream;
if (opt_accumulate_stream != opt_producer_stream) {
// (2b)
auto event = c10::Event{c10::DeviceType::CUDA};
event.record(*opt_producer_stream);
opt_accumulate_stream->wait(event);
c10::cuda::CUDACachingAllocator::recordStream(var.storage().data_ptr(), c10::cuda::CUDAStream(*opt_accumulate_stream));
}
} else {
c10::optional<c10::Stream> opt_sync_stream = c10::nullopt;
const auto guard = c10::impl::VirtualGuardImpl{c10::DeviceType::CUDA};
if (on_consumer && !on_producer) {
// (3a)
opt_accumulate_stream = opt_consumer_stream;
opt_sync_stream = guard.getDefaultStream(opt_consumer_stream->device());
TORCH_INTERNAL_ASSERT(opt_sync_stream == guard.getStream(*device_of(var)));
} else if (on_producer && !on_consumer) {
// (4a)
opt_accumulate_stream = guard.getDefaultStream(opt_producer_stream->device());
opt_sync_stream = opt_producer_stream;
} else {
// (5)
opt_accumulate_stream = guard.getDefaultStream(*device_of(var));
}
if (opt_sync_stream && (opt_accumulate_stream != opt_sync_stream)) {
// (3b), (4b)
c10::OptionalDeviceGuard device_guard{opt_sync_stream->device()};
auto event = c10::Event{c10::DeviceType::CUDA};
event.record(*opt_sync_stream);
opt_accumulate_stream->wait(event);
c10::cuda::CUDACachingAllocator::recordStream(var.storage().data_ptr(), c10::cuda::CUDAStream(*opt_accumulate_stream));
}
}
To Reproduce
Adding my test pipeline_mlp_bug.py
that runs the same network with and without streams and compares the results.
To reproduce the behavior
- run:
python pipeline_mlp_bug.py --num-chunks=8 --num-microbatches=5 --num-mp-gpus 4 --num-batches=4
Expected behavior
Before the change expect to see significant diffs at the end:
______________Pipelined w/o Streams vs. With Streams ____________
out equals? tensor(False)
out max diff: tensor(0.0015, grad_fn=)
weight 0 equals?: tensor(False, device='cuda:0')
weight 0 max diff? tensor(3.7253e-09, device='cuda:0')
weight last equals?: tensor(False, device='cuda:0')
weight last max diff tensor(6.4563e-05, device='cuda:0')
weight gradient 0 equals?: tensor(False, device='cuda:0')
weight gradient 0 max diff? tensor(1.1085e-08, device='cuda:0')
weight gradient last equals?: tensor(False, device='cuda:0')
weight gradient last max diff tensor(0.0215, device='cuda:0')
loss pipelined 7.636113166809082, loss w streams 7.63759708404541
After the change expect to see no differences between the two runs:
______________Pipelined w/o Streams vs. With Streams ____________
out equals? tensor(True)
out max diff: tensor(0., grad_fn=)
weight 0 equals?: tensor(True, device='cuda:0')
weight 0 max diff? tensor(0., device='cuda:0')
weight last equals?: tensor(True, device='cuda:0')
weight last max diff tensor(0., device='cuda:0')
weight gradient 0 equals?: tensor(True, device='cuda:0')
weight gradient 0 max diff? tensor(0., device='cuda:0')
weight gradient last equals?: tensor(True, device='cuda:0')
weight gradient last max diff tensor(0., device='cuda:0')
loss pipelined 7.636113166809082, loss w streams 7.636113166809082
Environment
(also discussed here #7601 (comment))
Test attached:
pipeline_mlp_bug.zip