Refactor State API's persistence methods #4

kiukchung · 2019-11-24T20:19:10Z

Summary:
Makes state API's persistence (rollback and serialization) more coherent, consistent, and natural. Does the following:

Renames deep_copy and rollback to snapshot and apply
The semantics of snapshot and apply is that the state is recoverable by:
any_user_defined_snapshot_obj = state.snapshot() modify_state(state) state.apply(any_user_defined_snapshot_obj) state.sync()
Renames serialize and deserialize to save and load (to be consistent with torch)
State provides a default implementation of save and load using snapshot and apply.
Removes the redundant supports_rollback() method from State. By not implementing snapshot/apply the user indicates that rollback is not supported on the State object. If the user wants to checkpoint but not rollback they can implement the save/load and not implement snapshot/apply. If the user wants rollback support, they lose no performance (in comparison) in doing checkpoints so they might as well get checkpoint for free.
Makes changes to the test_mock and elastic classy_vision code to be compliant with the new API.
Makes imagenet example compliant with the new API.

NOTE: This change renders the imagenet example under //fblearner/flow/projects/pytorch/elastic/imagenet broken. However this example was already broken and has zero users. The task to fix this is T57831531.

Differential Revision: D18672302

fbshipit-source-id: 2da47843f74b324adb4e620cc7927feff02e7145

Summary: This is my first pass at describing the torchelastic rendezvous concept. Obviously it's not finished, but getting this out earlier for initial RFC. Reviewed By: kiukchung Differential Revision: D18613818 fbshipit-source-id: 235281ad965a42caa0b477a444b790b18fe7dc40

…to collectives_test not destroying process groups properly Summary: 1. Sets up etcd server properly in circle ci so that etcd_server_fixture can use the binary to spin up a standalone etcd server for running end-to-end unittests. 2. Properly destroys process groups in `collectives_test.py` which was causing the `etcd_elastic_trainer_test` to hang. Reviewed By: vladbelous Differential Revision: D18650293 fbshipit-source-id: a1c16ab7ab2a4d116ce205bf7fc4f5f2379777c2

…in circleci Summary: Title says it all Reviewed By: vladbelous Differential Revision: D18652743 fbshipit-source-id: b7732f80a037d322aed337a33ec9f2987b114927

Fix CircleCI badge in README.md

… Add docker_build script Summary: See title. Reviewed By: mehta-vikas Differential Revision: D18654271 fbshipit-source-id: 37ca77fb399dd3fd333dc3aa4162706c6d423376

Summary: Makes state API's persistence (rollback and serialization) more coherent, consistent, and natural. Does the following: * Renames `deep_copy` and `rollback` to `snapshot` and `apply` * The semantics of `snapshot` and `apply` is that the state is recoverable by: ``` any_user_defined_snapshot_obj = state.snapshot() modify_state(state) state.apply(any_user_defined_snapshot_obj) state.sync() ``` * Renames `serialize` and `deserialize` to `save` and `load` (to be consistent with torch) * `State` provides a default implementation of `save` and `load` using `snapshot` and `apply`. * Removes the redundant `supports_rollback()` method from `State`. By not implementing `snapshot/apply` the user indicates that rollback is not supported on the `State` object. If the user wants to checkpoint but not rollback they can implement the `save/load` and not implement `snapshot/apply`. If the user wants rollback support, they lose no performance (in comparison) in doing checkpoints so they might as well get checkpoint for free. * Makes changes to the `test_mock` and `elastic classy_vision` code to be compliant with the new API. * Makes imagenet example compliant with the new API. NOTE: This change renders the imagenet example under `//fblearner/flow/projects/pytorch/elastic/imagenet` broken. However this example was already broken and has zero users. The task to fix this is T57831531. Differential Revision: D18672302 fbshipit-source-id: 28ecb437b8308f97f6839e43cf51823c6b45fda4

facebook-github-bot · 2019-11-24T20:19:29Z

This pull request was exported from Phabricator. Differential Revision: D18672302

Summary: Pull Request resolved: pytorch/elastic#4 Makes state API's persistence (rollback and serialization) more coherent, consistent, and natural. Does the following: * Renames `deep_copy` and `rollback` to `snapshot` and `apply` * The semantics of `snapshot` and `apply` is that the state is recoverable by: ``` any_user_defined_snapshot_obj = state.snapshot() modify_state(state) state.apply(any_user_defined_snapshot_obj) state.sync() ``` * Renames `serialize` and `deserialize` to `save` and `load` (to be consistent with torch) * `State` provides a default implementation of `save` and `load` using `snapshot` and `apply`. * Removes the redundant `supports_rollback()` method from `State`. By not implementing `snapshot/apply` the user indicates that rollback is not supported on the `State` object. If the user wants to checkpoint but not rollback they can implement the `save/load` and not implement `snapshot/apply`. If the user wants rollback support, they lose no performance (in comparison) in doing checkpoints so they might as well get checkpoint for free. * Makes changes to the `test_mock` and `elastic classy_vision` code to be compliant with the new API. * Makes imagenet example compliant with the new API. NOTE: This change renders the imagenet example under `//fblearner/flow/projects/pytorch/elastic/imagenet` broken. However this example was already broken and has zero users. The task to fix this is T57831531. (Note: this ignores all push blocking failures!) Reviewed By: vreis Differential Revision: D18672302 fbshipit-source-id: 88718b790f8f0fe7ae0381a2e40455af7e2ba0ce

Summary: Pull Request resolved: pytorch/elastic#4 Makes state API's persistence (rollback and serialization) more coherent, consistent, and natural. Does the following: * Renames `deep_copy` and `rollback` to `snapshot` and `apply` * The semantics of `snapshot` and `apply` is that the state is recoverable by: ``` any_user_defined_snapshot_obj = state.snapshot() modify_state(state) state.apply(any_user_defined_snapshot_obj) state.sync() ``` * Renames `serialize` and `deserialize` to `save` and `load` (to be consistent with torch) * `State` provides a default implementation of `save` and `load` using `snapshot` and `apply`. * Removes the redundant `supports_rollback()` method from `State`. By not implementing `snapshot/apply` the user indicates that rollback is not supported on the `State` object. If the user wants to checkpoint but not rollback they can implement the `save/load` and not implement `snapshot/apply`. If the user wants rollback support, they lose no performance (in comparison) in doing checkpoints so they might as well get checkpoint for free. * Makes changes to the `test_mock` and `elastic classy_vision` code to be compliant with the new API. * Makes imagenet example compliant with the new API. NOTE: This change renders the imagenet example under `//fblearner/flow/projects/pytorch/elastic/imagenet` broken. However this example was already broken and has zero users. The task to fix this is T57831531. (Note: this ignores all push blocking failures!) Reviewed By: vreis Differential Revision: D18672302 fbshipit-source-id: 849b6cdcc5cb21e95406b42fd22d5b3d6d9a6f66

…nt, fix bug in petctl setp where None was being passed to cfn param, pump docker logs to cloudwatch Summary: 1. Uses docker log-driver == awslogs to make docker output go to cloud watch (see screenshots below) 2. #1 creates a log group called `torchelastic/$USER` in CW and creates log streams (one per worker) called `$job_name/$instance_id` 3. Fixes a bug in `petctl setup` where if no efs and s3 buckets are specified the `NoneType` is passed to the cfn template param which throws a validation error because it expects a string 4. Fixes an issue with cfn template where the CloudWatch IAM managed policy was being created with a specific name hence preventing multiple stacks from being created in the same account. #thanks Vinicius Reis for testing `petctl` and reporting bugs #3 and #4. {F223965947} {F223965943} Reviewed By: vreis Differential Revision: D18826855 fbshipit-source-id: 2d75f607734135ab6d5301fc636501a38cfee9d9

facebook-github-bot and others added 8 commits November 20, 2019 08:00

Initial commit

85beb6b

fbshipit-source-id: 2da47843f74b324adb4e620cc7927feff02e7145

download etcd server to /tmp/etcd since path expansion does not work …

5dbd924

…in circleci Summary: Title says it all Reviewed By: vladbelous Differential Revision: D18652743 fbshipit-source-id: b7732f80a037d322aed337a33ec9f2987b114927

Fix CircleCI badge in README.md

c468aaf

Merge pull request #2 from pytorch/kiukchung-patch-1

19c9a4d

Fix CircleCI badge in README.md

Add VERSION file, change setup.py to read the version from this file.…

8db77bd

… Add docker_build script Summary: See title. Reviewed By: mehta-vikas Differential Revision: D18654271 fbshipit-source-id: 37ca77fb399dd3fd333dc3aa4162706c6d423376

facebook-github-bot added the fb-exported label Nov 24, 2019

kiukchung closed this Nov 26, 2019

kiukchung force-pushed the master branch from 8db77bd to 7e35f9b Compare November 26, 2019 05:45

dev777-create mentioned this pull request Dec 30, 2019

training hang when remove/add instances #25

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor State API's persistence methods #4

Refactor State API's persistence methods #4

kiukchung commented Nov 24, 2019

facebook-github-bot commented Nov 24, 2019

Refactor State API's persistence methods #4

Refactor State API's persistence methods #4

Conversation

kiukchung commented Nov 24, 2019

facebook-github-bot commented Nov 24, 2019