Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completely remove FutureMessage type #50029

Closed
wants to merge 4 commits into from

Conversation

mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Jan 4, 2021

Stack from ghstack:

Differential Revision: D25774915

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jan 4, 2021

💊 CI failures summary and remediations

As of commit fccd346 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 12 times.

mrshenli added a commit that referenced this pull request Jan 5, 2021
ghstack-source-id: 4ab7002feaf7989018d79b10c55e45455a95de32
Pull Request resolved: #50029
Copy link
Contributor

@lw lw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! Thanks so much for this undertaking!

And thanks for such a well split stack, it made reviewing it a pleasure!

I left some minor comments sprinkled around, but I'm listing here some broader thoughts that apply to the whole stack. They can all be addressed as follow-up commits.

  • Does it make sense to change the addCallback method of ivalue::Future so that it passes a reference to itself as first argument to the callback? This would remove all those times we create, capture and then lock a weak_ptr. This is already what we do in the Python wrapper for ivalue::Future, so it also brings a benefit of API alignment.
  • There's some mix-up of shared_ptr and intrusive_ptr. I wonder if we should try to consolidate on the latter?
  • JitFuture is an alias to ivalue::Future, which I guess was introduced for clarity to better separate it from FutureMessage. Does it make sense now to remove that alias and avoid the extra level of indirection that may hurt readability?
  • Futures now are created with type at::AnyClassType::get(), although they then only contain Message instances. I know nothing about JIT but is there a way to use a narrower type, both for type safety and readability?

@mrshenli
Copy link
Contributor Author

mrshenli commented Jan 6, 2021

@lw thanks a lot for reviewing!

Does it make sense to change the addCallback method of ivalue::Future so that it passes a reference to itself as first argument to the callback? This would remove all those times we create, capture and then lock a weak_ptr. This is already what we do in the Python wrapper for ivalue::Future, so it also brings a benefit of API alignment.

Yep, this makes sense to me and I can add that. cc @wanchaol please let us know if there are concerns.

There's some mix-up of shared_ptr and intrusive_ptr. I wonder if we should try to consolidate on the latter?

Yep, this is on my TODO list

JitFuture is an alias to ivalue::Future, which I guess was introduced for clarity to better separate it from FutureMessage. Does it make sense now to remove that alias and avoid the extra level of indirection that may hurt readability?

Yep, also added this to my TODOs.

Futures now are created with type at::AnyClassType::get(), although they then only contain Message instances. I know nothing about JIT but is there a way to use a narrower type, both for type safety and readability?

Good point. @wanchaol if we want more explicit IValue type on Message, does it mean we need to make Message an IValue type. Or is there any lighter solutions for custom classes?

mrshenli added a commit that referenced this pull request Jan 6, 2021
ghstack-source-id: 67b8cf76dcda356f98fefd1af42d3ece4ac1858c
Pull Request resolved: #50029
@mrshenli
Copy link
Contributor Author

mrshenli commented Jan 7, 2021

Micro benchmark results:

Process Group Agent (After):

test/distributed/rpc/test_process_group_agent.py::ProcessGroupRpcTestWithSpawn::test_stress_heavy_rpc 
Rank 3 finished testing 20 times in 0.09874725341796875 seconds.
Rank 2 finished testing 20 times in 0.11173415184020996 seconds.                                                          
Rank 1 finished testing 20 times in 0.11910843849182129 seconds.                                                           
Rank 0 finished testing 20 times in 0.11921072006225586 seconds. 

test/distributed/rpc/test_process_group_agent.py::ProcessGroupRpcTestWithSpawn::test_stress_heavy_rpc_torchscript
Rank 1 finished testing 20 times in 0.028783082962036133 seconds.
Rank 3 finished testing 20 times in 0.030823945999145508 seconds.
Rank 0 finished testing 20 times in 0.031084299087524414 seconds.
Rank 2 finished testing 20 times in 0.032097578048706055 seconds.

test/distributed/rpc/test_process_group_agent.py::ProcessGroupRpcTestWithSpawn::test_stress_light_rpc
Rank 3 finished testing 1000 times in 0.4663074016571045 seconds.
Rank 2 finished testing 1000 times in 0.5232148170471191 seconds.
Rank 1 finished testing 1000 times in 0.5746622085571289 seconds.
Rank 0 finished testing 1000 times in 0.5750923156738281 seconds.

Process Group Agent (Before):

test/distributed/rpc/test_process_group_agent.py::ProcessGroupRpcTestWithSpawn::test_stress_heavy_rpc
Rank 2 finished testing 20 times in 0.12031865119934082 seconds. 
Rank 1 finished testing 20 times in 0.12074995040893555 seconds.
Rank 0 finished testing 20 times in 0.12139439582824707 seconds.
Rank 3 finished testing 20 times in 0.12353968620300293 seconds. 

test/distributed/rpc/test_process_group_agent.py::ProcessGroupRpcTestWithSpawn::test_stress_heavy_rpc_torchscript
Rank 3 finished testing 20 times in 0.035836219787597656 seconds.
Rank 1 finished testing 20 times in 0.03717660903930664 seconds.
Rank 0 finished testing 20 times in 0.03726530075073242 seconds.
Rank 2 finished testing 20 times in 0.038858652114868164 seconds.

test/distributed/rpc/test_process_group_agent.py::ProcessGroupRpcTestWithSpawn::test_stress_light_rpc
Rank 3 finished testing 1000 times in 0.530799150466919 seconds.
Rank 2 finished testing 1000 times in 0.5381615161895752 seconds.
Rank 1 finished testing 1000 times in 0.5415563583374023 seconds.
Rank 0 finished testing 1000 times in 0.5458028316497803 seconds.

TensorPipe Agent (After):

test/distributed/rpc/test_tensorpipe_agent.py::TensorPipeRpcTestWithSpawn::test_stress_heavy_rpc 
Rank 1 finished testing 20 times in 0.11632132530212402 seconds.
Rank 0 finished testing 20 times in 0.12914705276489258 seconds.
Rank 3 finished testing 20 times in 0.13401508331298828 seconds.
Rank 2 finished testing 20 times in 0.13787078857421875 seconds.

test/distributed/rpc/test_tensorpipe_agent.py::TensorPipeRpcTestWithSpawn::test_stress_heavy_rpc_torchscript 
Rank 0 finished testing 20 times in 0.032578468322753906 seconds.
Rank 3 finished testing 20 times in 0.035418033599853516 seconds.
Rank 1 finished testing 20 times in 0.039852142333984375 seconds.
Rank 2 finished testing 20 times in 0.04088330268859863 seconds.

test/distributed/rpc/test_tensorpipe_agent.py::TensorPipeRpcTestWithSpawn::test_stress_light_rpc
Rank 3 finished testing 1000 times in 0.1573326587677002 seconds.
Rank 2 finished testing 1000 times in 0.2936713695526123 seconds.
Rank 1 finished testing 1000 times in 0.2966134548187256 seconds.
Rank 0 finished testing 1000 times in 0.2967543601989746 seconds.

TensorPipe Agent (Before):

test/distributed/rpc/test_tensorpipe_agent.py::TensorPipeRpcTestWithSpawn::test_stress_heavy_rpc
Rank 1 finished testing 20 times in 0.12186193466186523 seconds.
Rank 0 finished testing 20 times in 0.12729144096374512 seconds.
Rank 2 finished testing 20 times in 0.1270914077758789 seconds.
Rank 3 finished testing 20 times in 0.13142681121826172 seconds.

test/distributed/rpc/test_tensorpipe_agent.py::TensorPipeRpcTestWithSpawn::test_stress_heavy_rpc_torchscript
Rank 1 finished testing 20 times in 0.04239463806152344 seconds.
Rank 3 finished testing 20 times in 0.04551196098327637 seconds.
Rank 2 finished testing 20 times in 0.04797840118408203 seconds.
Rank 0 finished testing 20 times in 0.06020164489746094 seconds.

test/distributed/rpc/test_tensorpipe_agent.py::TensorPipeRpcTestWithSpawn::test_stress_light_rpc
Rank 1 finished testing 1000 times in 0.26510000228881836 seconds.
Rank 3 finished testing 1000 times in 0.3550393581390381 seconds.
Rank 2 finished testing 1000 times in 0.35509538650512695 seconds.
Rank 0 finished testing 1000 times in 0.36699509620666504 seconds.

@mrshenli
Copy link
Contributor Author

mrshenli commented Jan 7, 2021

ci-all tests in #50216

@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in c480eeb.

@facebook-github-bot facebook-github-bot deleted the gh/mrshenli/281/head branch January 11, 2021 15:17
hwangdeyu pushed a commit to hwangdeyu/pytorch that referenced this pull request Jan 14, 2021
Summary: Pull Request resolved: pytorch#50029

Test Plan:
buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_april_2020 -- local-preset --flow-entitlement pytorch_ftw_gpu --secure-group oncall_pytorch_distributed

Before:

```
...

I0107 11:03:10.434000 3831111 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|total_examples 14000.0
I0107 11:03:10.434000 3831111 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|window_qps 74.60101318359375
I0107 11:03:10.434000 3831111 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|lifetime_qps 74.60101318359375

...

I0107 11:05:12.132000 3831111 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|total_examples 20000.0
I0107 11:05:12.132000 3831111 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|window_qps 64.0
I0107 11:05:12.132000 3831111 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|lifetime_qps 64.64917755126953

...
```

After:

```
...

I0107 11:53:03.858000 53693 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|total_examples 14000.0
I0107 11:53:03.858000 53693 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|window_qps 72.56404876708984
I0107 11:53:03.858000 53693 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|lifetime_qps 72.56404876708984

...

I0107 11:54:24.612000 53693 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|total_examples 20000.0
I0107 11:54:24.612000 53693 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|window_qps 73.07617950439453
I0107 11:54:24.612000 53693 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|lifetime_qps 73.07617950439453

...
```

Reviewed By: lw

Differential Revision: D25774915

Pulled By: mrshenli

fbshipit-source-id: 1128c3c2df9d76e36beaf171557da86e82043eb9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue oncall: jit Add this issue/PR to JIT oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants