Skip to content

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Apr 7, 2022

Stack from ghstack (oldest at bottom):

All load_state_dict implementations call synchronize(), so put this into hook for preparation of doing all FSDP-specific state_dict loading logic in pre/post hooks.

Differential Revision: D35439351

All load_state_dict implementations call synchronize(), so put this into hook for preparation of doing all FSDP-specific state_dict loading logic in pre/post hooks.

Differential Revision: [D35439351](https://our.internmc.facebook.com/intern/diff/D35439351/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Apr 7, 2022
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Apr 7, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 8e84c5d (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build pull / linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge) (1/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-04-15T22:57:36.7782138Z AssertionError: 17 unit test(s) failed:
2022-04-15T22:57:36.5946024Z FAILED (errors=1, expected failures=3)
2022-04-15T22:57:36.5946255Z 
2022-04-15T22:57:36.5946400Z Generating XML reports...
2022-04-15T22:57:36.5969756Z Generated XML report: test-reports/python-unittest/distributed.rpc.test_tensorpipe_agent/TEST-TensorPipeThreeWorkersRemoteModuleTest-20220415225731.xml
2022-04-15T22:57:36.7778385Z Test exited with non-zero exitcode 1. Command to reproduce: /opt/conda/bin/python distributed/rpc/test_tensorpipe_agent.py --import-disabled-tests --import-slow-tests -v TensorPipeThreeWorkersRemoteModuleTest.test_send_remote_module_over_the_wire_script_not_supported
2022-04-15T22:57:36.7779311Z Traceback (most recent call last):
2022-04-15T22:57:36.7779685Z   File "distributed/rpc/test_tensorpipe_agent.py", line 38, in <module>
2022-04-15T22:57:36.7780291Z     run_tests()
2022-04-15T22:57:36.7780843Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 634, in run_tests
2022-04-15T22:57:36.7781380Z     len(failed_tests), '\n\t'.join(failed_tests))
2022-04-15T22:57:36.7782138Z AssertionError: 17 unit test(s) failed:
2022-04-15T22:57:36.7782556Z 	TensorPipeDdpComparisonTest.test_ddp_dist_autograd_local_vs_remote
2022-04-15T22:57:36.7782995Z 	TensorPipeRemoteModuleTest.test_bad_module
2022-04-15T22:57:36.7783389Z 	TensorPipeRemoteModuleTest.test_forward_async
2022-04-15T22:57:36.7783826Z 	TensorPipeRemoteModuleTest.test_forward_async_script
2022-04-15T22:57:36.7784230Z 	TensorPipeRemoteModuleTest.test_forward_sync
2022-04-15T22:57:36.7784666Z 	TensorPipeRemoteModuleTest.test_forward_sync_script
2022-04-15T22:57:36.7785100Z 	TensorPipeRemoteModuleTest.test_forward_with_kwargs
2022-04-15T22:57:36.7785557Z 	TensorPipeRemoteModuleTest.test_get_module_rref
2022-04-15T22:57:36.7786044Z 	TensorPipeRemoteModuleTest.test_remote_module_py_pickle_not_supported
2022-04-15T22:57:36.7786595Z 	TensorPipeRemoteModuleTest.test_remote_module_py_pickle_not_supported_script

See GitHub Actions build pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu) (2/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-04-15T23:21:08.2776776Z AssertionError: 4 unit test(s) failed:
2022-04-15T23:21:08.0158651Z 
2022-04-15T23:21:08.0158768Z OK (skipped=1)
2022-04-15T23:21:08.0158925Z 
2022-04-15T23:21:08.0159054Z Generating XML reports...
2022-04-15T23:21:08.0203862Z Generated XML report: test-reports/python-unittest/distributed.rpc.cuda.test_tensorpipe_agent/TEST-TensorPipeTensorPipeCudaDistAutogradTest-20220415232104.xml
2022-04-15T23:21:08.2771950Z Traceback (most recent call last):
2022-04-15T23:21:08.2772781Z   File "distributed/rpc/cuda/test_tensorpipe_agent.py", line 34, in <module>
2022-04-15T23:21:08.2773484Z     run_tests()
2022-04-15T23:21:08.2774614Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 634, in run_tests
2022-04-15T23:21:08.2776187Z     len(failed_tests), '\n\t'.join(failed_tests))
2022-04-15T23:21:08.2776776Z AssertionError: 4 unit test(s) failed:
2022-04-15T23:21:08.2777496Z 	TensorPipeCudaRemoteModuleTest.test_input_moved_to_cuda_device
2022-04-15T23:21:08.2778338Z 	TensorPipeCudaRemoteModuleTest.test_input_moved_to_cuda_device_script
2022-04-15T23:21:08.2779111Z 	TensorPipeCudaRemoteModuleTest.test_invalid_devices
2022-04-15T23:21:08.2779875Z 	TensorPipeCudaRemoteModuleTest.test_valid_device
2022-04-15T23:21:08.4188878Z Traceback (most recent call last):
2022-04-15T23:21:08.4189252Z   File "test/run_test.py", line 1058, in <module>
2022-04-15T23:21:08.4193163Z     main()
2022-04-15T23:21:08.4194076Z   File "test/run_test.py", line 1036, in main
2022-04-15T23:21:08.4196264Z     raise RuntimeError(err_message)
2022-04-15T23:21:08.4196983Z RuntimeError: distributed/rpc/cuda/test_tensorpipe_agent failed!

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Collaborator

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor!

Copy link
Contributor

@zhaojuanmao zhaojuanmao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some lint error

All load_state_dict implementations call synchronize(), so put this into hook for preparation of doing all FSDP-specific state_dict loading logic in pre/post hooks.

Differential Revision: [D35439351](https://our.internmc.facebook.com/intern/diff/D35439351/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 15, 2022
Pull Request resolved: #75424

All load_state_dict implementations call synchronize(), so put this into hook for preparation of doing all FSDP-specific state_dict loading logic in pre/post hooks.
ghstack-source-id: 154042197

Differential Revision: [D35439351](https://our.internmc.facebook.com/intern/diff/D35439351/)
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge this

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge failed due to Matched rule superuser, but it was not reviewed yet by any of:kyulee-com,kausv,jackm321,erichan1,jingsh, ...
Raised by https://github.com/pytorch/pytorch/actions/runs/2177652167

@malfet
Copy link
Contributor

malfet commented Apr 18, 2022

@pytorchbot merge this

@pytorchmergebot
Copy link
Collaborator

Merge failed due to Matched rule superuser, but it was not reviewed yet by any of:kumpera,tvalentius,ZolotukhinM,xinyang0,ccongge, ...
Raised by https://github.com/pytorch/pytorch/actions/runs/2184509704

malfet pushed a commit that referenced this pull request Apr 18, 2022
Pull Request resolved: #75424

All load_state_dict implementations call synchronize(), so put this into hook for preparation of doing all FSDP-specific state_dict loading logic in pre/post hooks.
ghstack-source-id: 154042197

Differential Revision: [D35439351](https://our.internmc.facebook.com/intern/diff/D35439351/)
@malfet
Copy link
Contributor

malfet commented Apr 18, 2022

Manually edited orig branch to match with internal ordering changes

@malfet
Copy link
Contributor

malfet commented Apr 18, 2022

@pytorchbot merge this

@github-actions
Copy link
Contributor

Hey @rohan-varma.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Apr 18, 2022
Summary:
Pull Request resolved: #75424

All load_state_dict implementations call synchronize(), so put this into hook for preparation of doing all FSDP-specific state_dict loading logic in pre/post hooks.
ghstack-source-id: 154042197

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D35439351

fbshipit-source-id: 7236de6af9aaf2d91c4fe9d29b600a56b520897b
malfet pushed a commit that referenced this pull request Apr 20, 2022
Pull Request resolved: #75424

All load_state_dict implementations call synchronize(), so put this into hook for preparation of doing all FSDP-specific state_dict loading logic in pre/post hooks.

Differential Revision: [D35439351](https://our.internmc.facebook.com/intern/diff/D35439351/)

Approved by: https://github.com/awgu, https://github.com/zhaojuanmao, https://github.com/malfet

(cherry picked from commit 9e864f4)
@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/542/head branch April 22, 2022 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants