Completely remove FutureMessage type #50029

mrshenli · 2021-01-04T02:52:32Z

Stack from ghstack:

Apply clang-format to rpc cpp files #50236 Apply clang-format to rpc cpp files
Completely remove FutureMessage type #50029 Completely remove FutureMessage type
Completely Remove FutureMessage from RPC agents #50028 Completely Remove FutureMessage from RPC agents
Completely Remove FutureMessage from RPC cpp tests #50027 Completely Remove FutureMessage from RPC cpp tests
Remove FutureMessage from RPC request callback logic #50026 Remove FutureMessage from RPC request callback logic
Completely remove FutureMessage from FaultyProcessGroupAgent #50025 Completely remove FutureMessage from FaultyProcessGroupAgent
Remove FutureMessage from sender TensorPipeAgent #50024 Remove FutureMessage from sender TensorPipeAgent
Remove FutureMessage from sender ProcessGroupAgent #50023 Remove FutureMessage from sender ProcessGroupAgent
Completely remove FutureMessage from distributed autograd #50020 Completely remove FutureMessage from distributed autograd
Completely remove FutureMessage from RPC TorchScript implementations #50005 Completely remove FutureMessage from RPC TorchScript implementations
Completely remove FutureMessage from RRef Implementations #50004 Completely remove FutureMessage from RRef Implementations
Replace FutureMessage with ivalue::Future in RpcAgent retry logic #49995 Replace FutureMessage with ivalue::Future in RpcAgent retry logic
Replace FutureMessage with ivalue::Future in RRefContext #49960 Replace FutureMessage with ivalue::Future in RRefContext
Replace FutureMessage with ivalue::Future in distributed/autograd/utils.* #49927 Replace FutureMessage with ivalue::Future in distributed/autograd/utils.*
Let RpcAgent::send() return JitFuture #49906 Let RpcAgent::send() return JitFuture

Differential Revision: D25774915

[ghstack-poisoned]

facebook-github-bot · 2021-01-04T02:52:45Z

💊 CI failures summary and remediations

As of commit fccd346 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 12 times.

ghstack-source-id: 07beb5b3c97682ab8218e25ed9947b7a43e13738 Pull Request resolved: #50029

[ghstack-poisoned]

ghstack-source-id: 4ab7002feaf7989018d79b10c55e45455a95de32 Pull Request resolved: #50029

lw

Awesome work! Thanks so much for this undertaking!

And thanks for such a well split stack, it made reviewing it a pleasure!

I left some minor comments sprinkled around, but I'm listing here some broader thoughts that apply to the whole stack. They can all be addressed as follow-up commits.

Does it make sense to change the addCallback method of ivalue::Future so that it passes a reference to itself as first argument to the callback? This would remove all those times we create, capture and then lock a weak_ptr. This is already what we do in the Python wrapper for ivalue::Future, so it also brings a benefit of API alignment.
There's some mix-up of shared_ptr and intrusive_ptr. I wonder if we should try to consolidate on the latter?
JitFuture is an alias to ivalue::Future, which I guess was introduced for clarity to better separate it from FutureMessage. Does it make sense now to remove that alias and avoid the extra level of indirection that may hurt readability?
Futures now are created with type at::AnyClassType::get(), although they then only contain Message instances. I know nothing about JIT but is there a way to use a narrower type, both for type safety and readability?

mrshenli · 2021-01-06T20:17:33Z

@lw thanks a lot for reviewing!

Does it make sense to change the addCallback method of ivalue::Future so that it passes a reference to itself as first argument to the callback? This would remove all those times we create, capture and then lock a weak_ptr. This is already what we do in the Python wrapper for ivalue::Future, so it also brings a benefit of API alignment.

Yep, this makes sense to me and I can add that. cc @wanchaol please let us know if there are concerns.

There's some mix-up of shared_ptr and intrusive_ptr. I wonder if we should try to consolidate on the latter?

Yep, this is on my TODO list

JitFuture is an alias to ivalue::Future, which I guess was introduced for clarity to better separate it from FutureMessage. Does it make sense now to remove that alias and avoid the extra level of indirection that may hurt readability?

Yep, also added this to my TODOs.

Futures now are created with type at::AnyClassType::get(), although they then only contain Message instances. I know nothing about JIT but is there a way to use a narrower type, both for type safety and readability?

Good point. @wanchaol if we want more explicit IValue type on Message, does it mean we need to make Message an IValue type. Or is there any lighter solutions for custom classes?

Differential Revision: [D25774915](https://our.internmc.facebook.com/intern/diff/D25774915) [ghstack-poisoned]

ghstack-source-id: 67b8cf76dcda356f98fefd1af42d3ece4ac1858c Pull Request resolved: #50029

mrshenli · 2021-01-07T02:34:25Z

Micro benchmark results:

Process Group Agent (After):

test/distributed/rpc/test_process_group_agent.py::ProcessGroupRpcTestWithSpawn::test_stress_heavy_rpc 
Rank 3 finished testing 20 times in 0.09874725341796875 seconds.
Rank 2 finished testing 20 times in 0.11173415184020996 seconds.                                                          
Rank 1 finished testing 20 times in 0.11910843849182129 seconds.                                                           
Rank 0 finished testing 20 times in 0.11921072006225586 seconds. 

test/distributed/rpc/test_process_group_agent.py::ProcessGroupRpcTestWithSpawn::test_stress_heavy_rpc_torchscript
Rank 1 finished testing 20 times in 0.028783082962036133 seconds.
Rank 3 finished testing 20 times in 0.030823945999145508 seconds.
Rank 0 finished testing 20 times in 0.031084299087524414 seconds.
Rank 2 finished testing 20 times in 0.032097578048706055 seconds.

test/distributed/rpc/test_process_group_agent.py::ProcessGroupRpcTestWithSpawn::test_stress_light_rpc
Rank 3 finished testing 1000 times in 0.4663074016571045 seconds.
Rank 2 finished testing 1000 times in 0.5232148170471191 seconds.
Rank 1 finished testing 1000 times in 0.5746622085571289 seconds.
Rank 0 finished testing 1000 times in 0.5750923156738281 seconds.

Process Group Agent (Before):

test/distributed/rpc/test_process_group_agent.py::ProcessGroupRpcTestWithSpawn::test_stress_heavy_rpc
Rank 2 finished testing 20 times in 0.12031865119934082 seconds. 
Rank 1 finished testing 20 times in 0.12074995040893555 seconds.
Rank 0 finished testing 20 times in 0.12139439582824707 seconds.
Rank 3 finished testing 20 times in 0.12353968620300293 seconds. 

test/distributed/rpc/test_process_group_agent.py::ProcessGroupRpcTestWithSpawn::test_stress_heavy_rpc_torchscript
Rank 3 finished testing 20 times in 0.035836219787597656 seconds.
Rank 1 finished testing 20 times in 0.03717660903930664 seconds.
Rank 0 finished testing 20 times in 0.03726530075073242 seconds.
Rank 2 finished testing 20 times in 0.038858652114868164 seconds.

test/distributed/rpc/test_process_group_agent.py::ProcessGroupRpcTestWithSpawn::test_stress_light_rpc
Rank 3 finished testing 1000 times in 0.530799150466919 seconds.
Rank 2 finished testing 1000 times in 0.5381615161895752 seconds.
Rank 1 finished testing 1000 times in 0.5415563583374023 seconds.
Rank 0 finished testing 1000 times in 0.5458028316497803 seconds.

TensorPipe Agent (After):

test/distributed/rpc/test_tensorpipe_agent.py::TensorPipeRpcTestWithSpawn::test_stress_heavy_rpc 
Rank 1 finished testing 20 times in 0.11632132530212402 seconds.
Rank 0 finished testing 20 times in 0.12914705276489258 seconds.
Rank 3 finished testing 20 times in 0.13401508331298828 seconds.
Rank 2 finished testing 20 times in 0.13787078857421875 seconds.

test/distributed/rpc/test_tensorpipe_agent.py::TensorPipeRpcTestWithSpawn::test_stress_heavy_rpc_torchscript 
Rank 0 finished testing 20 times in 0.032578468322753906 seconds.
Rank 3 finished testing 20 times in 0.035418033599853516 seconds.
Rank 1 finished testing 20 times in 0.039852142333984375 seconds.
Rank 2 finished testing 20 times in 0.04088330268859863 seconds.

test/distributed/rpc/test_tensorpipe_agent.py::TensorPipeRpcTestWithSpawn::test_stress_light_rpc
Rank 3 finished testing 1000 times in 0.1573326587677002 seconds.
Rank 2 finished testing 1000 times in 0.2936713695526123 seconds.
Rank 1 finished testing 1000 times in 0.2966134548187256 seconds.
Rank 0 finished testing 1000 times in 0.2967543601989746 seconds.

TensorPipe Agent (Before):

test/distributed/rpc/test_tensorpipe_agent.py::TensorPipeRpcTestWithSpawn::test_stress_heavy_rpc
Rank 1 finished testing 20 times in 0.12186193466186523 seconds.
Rank 0 finished testing 20 times in 0.12729144096374512 seconds.
Rank 2 finished testing 20 times in 0.1270914077758789 seconds.
Rank 3 finished testing 20 times in 0.13142681121826172 seconds.

test/distributed/rpc/test_tensorpipe_agent.py::TensorPipeRpcTestWithSpawn::test_stress_heavy_rpc_torchscript
Rank 1 finished testing 20 times in 0.04239463806152344 seconds.
Rank 3 finished testing 20 times in 0.04551196098327637 seconds.
Rank 2 finished testing 20 times in 0.04797840118408203 seconds.
Rank 0 finished testing 20 times in 0.06020164489746094 seconds.

test/distributed/rpc/test_tensorpipe_agent.py::TensorPipeRpcTestWithSpawn::test_stress_light_rpc
Rank 1 finished testing 1000 times in 0.26510000228881836 seconds.
Rank 3 finished testing 1000 times in 0.3550393581390381 seconds.
Rank 2 finished testing 1000 times in 0.35509538650512695 seconds.
Rank 0 finished testing 1000 times in 0.36699509620666504 seconds.

mrshenli · 2021-01-07T18:43:20Z

ci-all tests in #50216

facebook-github-bot · 2021-01-08T03:51:52Z

@mrshenli merged this pull request in c480eeb.

Summary: Pull Request resolved: pytorch#50029 Test Plan: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_april_2020 -- local-preset --flow-entitlement pytorch_ftw_gpu --secure-group oncall_pytorch_distributed Before: ``` ... I0107 11:03:10.434000 3831111 print_publisher.py:23 master ] Publishing batch metrics: qps-qps|total_examples 14000.0 I0107 11:03:10.434000 3831111 print_publisher.py:23 master ] Publishing batch metrics: qps-qps|window_qps 74.60101318359375 I0107 11:03:10.434000 3831111 print_publisher.py:23 master ] Publishing batch metrics: qps-qps|lifetime_qps 74.60101318359375 ... I0107 11:05:12.132000 3831111 print_publisher.py:23 master ] Publishing batch metrics: qps-qps|total_examples 20000.0 I0107 11:05:12.132000 3831111 print_publisher.py:23 master ] Publishing batch metrics: qps-qps|window_qps 64.0 I0107 11:05:12.132000 3831111 print_publisher.py:23 master ] Publishing batch metrics: qps-qps|lifetime_qps 64.64917755126953 ... ``` After: ``` ... I0107 11:53:03.858000 53693 print_publisher.py:23 master ] Publishing batch metrics: qps-qps|total_examples 14000.0 I0107 11:53:03.858000 53693 print_publisher.py:23 master ] Publishing batch metrics: qps-qps|window_qps 72.56404876708984 I0107 11:53:03.858000 53693 print_publisher.py:23 master ] Publishing batch metrics: qps-qps|lifetime_qps 72.56404876708984 ... I0107 11:54:24.612000 53693 print_publisher.py:23 master ] Publishing batch metrics: qps-qps|total_examples 20000.0 I0107 11:54:24.612000 53693 print_publisher.py:23 master ] Publishing batch metrics: qps-qps|window_qps 73.07617950439453 I0107 11:54:24.612000 53693 print_publisher.py:23 master ] Publishing batch metrics: qps-qps|lifetime_qps 73.07617950439453 ... ``` Reviewed By: lw Differential Revision: D25774915 Pulled By: mrshenli fbshipit-source-id: 1128c3c2df9d76e36beaf171557da86e82043eb9

Completely remove FutureMessage type

fb8596a

[ghstack-poisoned]

mrshenli requested review from mingzhe09088, pritamdamania87, rohan-varma and zhaojuanmao as code owners January 4, 2021 02:52

facebook-github-bot added the cla signed label Jan 4, 2021

This was referenced Jan 4, 2021

Completely remove FutureMessage from RPC TorchScript implementations #50005

Closed

Completely remove FutureMessage from distributed autograd #50020

Closed

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Jan 4, 2021

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jan 4, 2021

mrshenli added a commit that referenced this pull request Jan 4, 2021

Completely remove FutureMessage type

70db96c

ghstack-source-id: 07beb5b3c97682ab8218e25ed9947b7a43e13738 Pull Request resolved: #50029

Update on "Completely remove FutureMessage type"

6ed1bd3

[ghstack-poisoned]

mrshenli added a commit that referenced this pull request Jan 5, 2021

Completely remove FutureMessage type

c1e1d41

ghstack-source-id: 4ab7002feaf7989018d79b10c55e45455a95de32 Pull Request resolved: #50029

lw approved these changes Jan 5, 2021

View reviewed changes

Update on "Completely remove FutureMessage type"

d87f5fa

Differential Revision: [D25774915](https://our.internmc.facebook.com/intern/diff/D25774915) [ghstack-poisoned]

Update on "Completely remove FutureMessage type"

fccd346

Differential Revision: [D25774915](https://our.internmc.facebook.com/intern/diff/D25774915) [ghstack-poisoned]

mrshenli added a commit that referenced this pull request Jan 6, 2021

Completely remove FutureMessage type

a6eb3a1

ghstack-source-id: 67b8cf76dcda356f98fefd1af42d3ece4ac1858c Pull Request resolved: #50029

mrshenli mentioned this pull request Jan 7, 2021

Apply clang-format to rpc cpp files #50236

Closed

facebook-github-bot closed this in c480eeb Jan 8, 2021

facebook-github-bot added the Merged label Jan 8, 2021

mrshenli mentioned this pull request Jan 10, 2021

DISABLED test_rpc_profiling_remote_record_function (__main__.TensorPipeRpcTestWithSpawn) #45067

Closed

facebook-github-bot deleted the gh/mrshenli/281/head branch January 11, 2021 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Completely remove FutureMessage type #50029

Completely remove FutureMessage type #50029

mrshenli commented Jan 4, 2021 •

edited

Loading

facebook-github-bot commented Jan 4, 2021 •

edited

Loading

lw left a comment

mrshenli commented Jan 6, 2021

mrshenli commented Jan 7, 2021

mrshenli commented Jan 7, 2021

facebook-github-bot commented Jan 8, 2021

Completely remove FutureMessage type #50029

Completely remove FutureMessage type #50029

Conversation

mrshenli commented Jan 4, 2021 • edited Loading

facebook-github-bot commented Jan 4, 2021 • edited Loading

💊 CI failures summary and remediations

lw left a comment

Choose a reason for hiding this comment

mrshenli commented Jan 6, 2021

mrshenli commented Jan 7, 2021

mrshenli commented Jan 7, 2021

facebook-github-bot commented Jan 8, 2021

mrshenli commented Jan 4, 2021 •

edited

Loading

facebook-github-bot commented Jan 4, 2021 •

edited

Loading