Make CUDAFuture remember and restore current device in callback #48789

lw · 2020-12-03T15:46:16Z

Stack from ghstack:

Add support for async callbacks in ivalue::Future #48790 Add support for async callbacks in ivalue::Future
Drop FutureNCCL in favor of vanilla CUDAFuture #49014 Drop FutureNCCL in favor of vanilla CUDAFuture
Make CUDAFuture remember and restore current device in callback #48789 Make CUDAFuture remember and restore current device in callback
Remove DataPtr extractor from CUDAFuture #48840 Remove DataPtr extractor from CUDAFuture
Cache the DataPtrs in CUDAFuture #48788 Cache the DataPtrs in CUDAFuture
Split out reusable CUDAFuture from FutureNCCL #48506 Split out reusable CUDAFuture from FutureNCCL
Merge common parts of FutureNCCL into at::ivalue::Future #48505 Merge common parts of FutureNCCL into at::ivalue::Future
Split FutureNCCL's CUDA-specific parts from generic future logic #48504 Split FutureNCCL's CUDA-specific parts from generic future logic
Support wider range of types in FutureNCCL #48502 Support wider range of types in FutureNCCL
Don't store device indices separately on FutureNCCL #48501 Don't store device indices separately on FutureNCCL
Add multi-GPU support to FutureNCCL #48500 Add multi-GPU support to FutureNCCL
Fix FutureNCCL not recording dataptrs with caching alloc in wait() #48563 Fix FutureNCCL not recording dataptrs with caching alloc in wait()
Fix FutureNCCL's completed() disagreeing with wait() #48503 Fix FutureNCCL's completed() disagreeing with wait()
Record CUDA events for "follow-up" FutureNCCL inside markCompleted #48499 Record CUDA events for "follow-up" FutureNCCL inside markCompleted
Use fresh stream from pool for each FutureNCCL callback #48498 Use fresh stream from pool for each FutureNCCL callback
Make FutureNCCL record events in current stream #48497 Make FutureNCCL record events in current stream
Have FutureNCCL record streams w/ allocator in addCallback #48496 Have FutureNCCL record streams w/ allocator in addCallback
Add some safeguards to FutureNCCL #48562 Add some safeguards to FutureNCCL
Remove NCCL dependency from PythonFutureWrapper #48495 Remove NCCL dependency from PythonFutureWrapper
Avoid using FutureNCCL before it's ready #48561 Avoid using FutureNCCL before it's ready

CUDAFuture aims to "capture" the current state of CUDA-related stuff when the future is marked complete (e.g., by looking at current streams and recording events on them) and then "replicate" a similar state when users synchronize with the result of the future (by synchronizing the current streams with these events).

However, one "contextual" aspect of CUDA that we weren't capturing/replicating was the current device. This diff tries to fix that. I must mention that we can only do this for callbacks, while we cannot do it for the wait() method. I don't know if such a discrepancy between the two actually makes the overall behavior worse. I'd love to hear people's opinions on this.

Differential Revision: D25210335

CUDAFuture aims to "capture" the current state of CUDA-related stuff when the future is marked complete (e.g., by looking at current streams and recording events on them) and then "replicate" a similar state when users synchronize with the result of the future (by synchronizing the current streams with these events). However, one "contextual" aspect of CUDA that we weren't capturing/replicating was the current device. This diff tries to fix that. I must mention that we can only do this for callbacks, while we cannot do it for the wait() method. I don't know if such a discrepancy between the two actually makes the overall behavior _worse_. I'd love to hear people's opinions on this. Differential Revision: [D25210335](https://our.internmc.facebook.com/intern/diff/D25210335/) [ghstack-poisoned]

codecov · 2020-12-03T19:23:56Z

Codecov Report

Merging #48789 (ae4e7f3) into gh/lw/101/base (b726a1b) will increase coverage by 0.02%.
The diff coverage is 64.86%.

@@                Coverage Diff                 @@
##           gh/lw/101/base   #48789      +/-   ##
==================================================
+ Coverage           80.79%   80.81%   +0.02%     
==================================================
  Files                1865     1863       -2     
  Lines              201074   200922     -152     
==================================================
- Hits               162456   162383      -73     
+ Misses              38618    38539      -79

…lback" CUDAFuture aims to "capture" the current state of CUDA-related stuff when the future is marked complete (e.g., by looking at current streams and recording events on them) and then "replicate" a similar state when users synchronize with the result of the future (by synchronizing the current streams with these events). However, one "contextual" aspect of CUDA that we weren't capturing/replicating was the current device. This diff tries to fix that. I must mention that we can only do this for callbacks, while we cannot do it for the wait() method. I don't know if such a discrepancy between the two actually makes the overall behavior _worse_. I'd love to hear people's opinions on this. Differential Revision: [D25210335](https://our.internmc.facebook.com/intern/diff/D25210335/) [ghstack-poisoned]

dr-ci · 2020-12-04T19:24:49Z

💊 CI failures summary and remediations

As of commit 4f76484 (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_clang7_onnx_ort_test2 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Dec 09 17:55:31 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_keypoint_rcnn FAILED [ 55%]

Dec 09 17:55:11 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_index_select_scaler_index PASSED [ 53%] 
Dec 09 17:55:11 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_inplace_arithmetic PASSED [ 54%] 
Dec 09 17:55:11 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_inplace_fill PASSED [ 54%] 
Dec 09 17:55:11 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_inplace_list PASSED [ 54%] 
Dec 09 17:55:11 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_inplace_zero PASSED [ 54%] 
Dec 09 17:55:11 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_interpolate_adaptive_pooling_error PASSED [ 54%] 
Dec 09 17:55:12 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_interpolate_downsample PASSED [ 55%] 
Dec 09 17:55:12 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_interpolate_function_substitution PASSED [ 55%] 
Dec 09 17:55:12 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_interpolate_no_shape PASSED [ 55%] 
Dec 09 17:55:13 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_interpolate_upsample PASSED [ 55%] 
Dec 09 17:55:31 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_keypoint_rcnn FAILED [ 55%] 
Dec 09 17:55:31 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_kldiv_loss PASSED [ 56%] 
Dec 09 17:55:31 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_l1_norm PASSED [ 56%] 
Dec 09 17:55:31 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_l2_norm PASSED [ 56%] 
Dec 09 17:55:31 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_layer_norm PASSED [ 56%] 
Dec 09 17:55:31 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_le PASSED [ 56%] 
Dec 09 17:55:31 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_le_scalar PASSED [ 56%] 
Dec 09 17:55:31 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_len PASSED [ 57%] 
Dec 09 17:55:31 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_len_list PASSED [ 57%] 
Dec 09 17:55:31 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_list PASSED [ 57%] 
Dec 09 17:55:31 test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset12_onnx_shape_inference::test_list_pass PASSED [ 57%]

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 7 times.

…lback" CUDAFuture aims to "capture" the current state of CUDA-related stuff when the future is marked complete (e.g., by looking at current streams and recording events on them) and then "replicate" a similar state when users synchronize with the result of the future (by synchronizing the current streams with these events). However, one "contextual" aspect of CUDA that we weren't capturing/replicating was the current device. This diff tries to fix that. I must mention that we can only do this for callbacks, while we cannot do it for the wait() method. I don't know if such a discrepancy between the two actually makes the overall behavior _worse_. I'd love to hear people's opinions on this. Differential Revision: [D25210335](https://our.internmc.facebook.com/intern/diff/D25210335/) [ghstack-poisoned]

mrshenli · 2020-12-08T19:58:21Z

aten/src/ATen/cuda/CUDAFuture.h

@@ -31,6 +31,8 @@ struct TORCH_CUDA_API CUDAFuture : at::ivalue::Future {
  }

  void postMarkCompletedHook(const at::IValue& value) override {
+    currentDevice_ = c10::cuda::current_device();


hmm, why we are recording the device when the future is marked as completed, instead of remember the device when the callback was inserted (through then/addCallback)?

I guess those are two different approaches to this. The "philosophy" I was following is this: then() and addCallback() are used to perform a computation after another one is complete. If we were dealing with with sync operations, one would do this:

do_sth_sync() do_sth_later()

but with async ops this needs to change and become

fut = do_sth_async() fut.then(do_sth_later)

In the sync scenario, do_sth_later() runs in the same "environment"/"context" as do_sth_sync() (same current device, same current streams, ...). In this diff I was trying to recreate this in the async case, by "recording" the environment's state at the end of the async operation, and then "recreating" it in the callback.

Note that the approach we have for streams is somewhat similar. We do not run the callback in the streams that were current when the callback was inserted. (In honesty, we also do not run it in the streams that were current when the async op finished (as that would mean running computations in the I/O streams), but we do synchronize the "fresh" streams with those streams).

Note also that I am not opposed to changing this behavior, and replicate the "environment" that was current when the user inserted the callback. That has its own sets of advantages. For streams, for example, it allows the user to very precisely control what streams the callback will use.

For streams, for example, it allows the user to very precisely control what streams the callback will use.

Yep, same for device as well, especially when the callback fn is an imported function and users cannot easily change it. The current behavior should be sufficient to unblock RPC use cases. So I am OK to land this PR and modify it later if necessary.

…lback" CUDAFuture aims to "capture" the current state of CUDA-related stuff when the future is marked complete (e.g., by looking at current streams and recording events on them) and then "replicate" a similar state when users synchronize with the result of the future (by synchronizing the current streams with these events). However, one "contextual" aspect of CUDA that we weren't capturing/replicating was the current device. This diff tries to fix that. I must mention that we can only do this for callbacks, while we cannot do it for the wait() method. I don't know if such a discrepancy between the two actually makes the overall behavior _worse_. I'd love to hear people's opinions on this. Differential Revision: [D25210335](https://our.internmc.facebook.com/intern/diff/D25210335/) [ghstack-poisoned]

facebook-github-bot · 2020-12-11T13:15:51Z

This pull request has been merged in 5ab90b2.

lw requested review from mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners December 3, 2020 15:46

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 3, 2020

lw mentioned this pull request Dec 4, 2020

Remove DataPtr extractor from CUDAFuture #48840

Closed

lw mentioned this pull request Dec 8, 2020

Drop FutureNCCL in favor of vanilla CUDAFuture #49014

Closed

mrshenli reviewed Dec 8, 2020

View reviewed changes

mrshenli approved these changes Dec 9, 2020

View reviewed changes

facebook-github-bot closed this in 5ab90b2 Dec 11, 2020

facebook-github-bot added the Merged label Dec 11, 2020

facebook-github-bot deleted the gh/lw/101/head branch December 14, 2020 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make CUDAFuture remember and restore current device in callback #48789

Make CUDAFuture remember and restore current device in callback #48789

lw commented Dec 3, 2020 •

edited

codecov bot commented Dec 3, 2020 •

edited

dr-ci bot commented Dec 4, 2020 •

edited

mrshenli Dec 8, 2020

lw Dec 8, 2020

mrshenli Dec 9, 2020

facebook-github-bot commented Dec 11, 2020

Make CUDAFuture remember and restore current device in callback #48789

Make CUDAFuture remember and restore current device in callback #48789

Conversation

lw commented Dec 3, 2020 • edited

codecov bot commented Dec 3, 2020 • edited

Codecov Report

dr-ci bot commented Dec 4, 2020 • edited

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_xenial_py3_clang7_onnx_ort_test2 (1/1)

mrshenli Dec 8, 2020

Choose a reason for hiding this comment

lw Dec 8, 2020

Choose a reason for hiding this comment

mrshenli Dec 9, 2020

Choose a reason for hiding this comment

facebook-github-bot commented Dec 11, 2020

lw commented Dec 3, 2020 •

edited

codecov bot commented Dec 3, 2020 •

edited

dr-ci bot commented Dec 4, 2020 •

edited