Skip to content
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.

Refactor State API's persistence methods #4

Closed
wants to merge 8 commits into from

Conversation

kiukchung
Copy link
Contributor

Summary:
Makes state API's persistence (rollback and serialization) more coherent, consistent, and natural. Does the following:

  • Renames deep_copy and rollback to snapshot and apply
  • The semantics of snapshot and apply is that the state is recoverable by:
    any_user_defined_snapshot_obj = state.snapshot() modify_state(state) state.apply(any_user_defined_snapshot_obj) state.sync()
  • Renames serialize and deserialize to save and load (to be consistent with torch)
  • State provides a default implementation of save and load using snapshot and apply.
  • Removes the redundant supports_rollback() method from State. By not implementing snapshot/apply the user indicates that rollback is not supported on the State object. If the user wants to checkpoint but not rollback they can implement the save/load and not implement snapshot/apply. If the user wants rollback support, they lose no performance (in comparison) in doing checkpoints so they might as well get checkpoint for free.
  • Makes changes to the test_mock and elastic classy_vision code to be compliant with the new API.
  • Makes imagenet example compliant with the new API.

NOTE: This change renders the imagenet example under //fblearner/flow/projects/pytorch/elastic/imagenet broken. However this example was already broken and has zero users. The task to fix this is T57831531.

Differential Revision: D18672302

fbshipit-source-id: 2da47843f74b324adb4e620cc7927feff02e7145
Summary:
This is my first pass at describing the torchelastic rendezvous concept.

Obviously it's not finished, but getting this out earlier for initial RFC.

Reviewed By: kiukchung

Differential Revision: D18613818

fbshipit-source-id: 235281ad965a42caa0b477a444b790b18fe7dc40
…to collectives_test not destroying process groups properly

Summary:
1. Sets up etcd server properly in circle ci so that etcd_server_fixture can use the binary to spin up a standalone etcd server for running end-to-end unittests.

2. Properly destroys process groups in `collectives_test.py` which was causing the `etcd_elastic_trainer_test` to hang.

Reviewed By: vladbelous

Differential Revision: D18650293

fbshipit-source-id: a1c16ab7ab2a4d116ce205bf7fc4f5f2379777c2
…in circleci

Summary: Title says it all

Reviewed By: vladbelous

Differential Revision: D18652743

fbshipit-source-id: b7732f80a037d322aed337a33ec9f2987b114927
… Add docker_build script

Summary: See title.

Reviewed By: mehta-vikas

Differential Revision: D18654271

fbshipit-source-id: 37ca77fb399dd3fd333dc3aa4162706c6d423376
Summary:
Makes state API's persistence (rollback and serialization) more coherent, consistent, and natural. Does the following:

* Renames `deep_copy` and `rollback` to `snapshot` and `apply`
* The semantics of `snapshot` and `apply` is that the state is recoverable by:
      ```
           any_user_defined_snapshot_obj = state.snapshot()
           modify_state(state)
           state.apply(any_user_defined_snapshot_obj)
           state.sync()
      ```
* Renames `serialize` and `deserialize` to `save` and `load` (to be consistent with torch)
* `State` provides a default implementation of `save` and `load` using `snapshot` and `apply`.
* Removes the redundant `supports_rollback()` method from `State`. By not implementing `snapshot/apply` the user indicates that rollback is not supported on the `State` object. If the user wants to checkpoint but not rollback they can implement the `save/load` and not implement `snapshot/apply`. If the user wants rollback support, they lose no performance (in comparison) in doing checkpoints so they might as well get checkpoint for free.
* Makes changes to the `test_mock` and `elastic classy_vision` code to be compliant with the new API.
* Makes imagenet example compliant with the new API.

NOTE: This change renders the imagenet example under `//fblearner/flow/projects/pytorch/elastic/imagenet` broken. However this example was already broken and has zero users. The task to fix this is T57831531.

Differential Revision: D18672302

fbshipit-source-id: 28ecb437b8308f97f6839e43cf51823c6b45fda4
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18672302

@kiukchung kiukchung closed this Nov 26, 2019
facebook-github-bot pushed a commit to facebookresearch/ClassyVision that referenced this pull request Nov 26, 2019
Summary:
Pull Request resolved: pytorch/elastic#4

Makes state API's persistence (rollback and serialization) more coherent, consistent, and natural. Does the following:

* Renames `deep_copy` and `rollback` to `snapshot` and `apply`
* The semantics of `snapshot` and `apply` is that the state is recoverable by:
      ```
           any_user_defined_snapshot_obj = state.snapshot()
           modify_state(state)
           state.apply(any_user_defined_snapshot_obj)
           state.sync()
      ```
* Renames `serialize` and `deserialize` to `save` and `load` (to be consistent with torch)
* `State` provides a default implementation of `save` and `load` using `snapshot` and `apply`.
* Removes the redundant `supports_rollback()` method from `State`. By not implementing `snapshot/apply` the user indicates that rollback is not supported on the `State` object. If the user wants to checkpoint but not rollback they can implement the `save/load` and not implement `snapshot/apply`. If the user wants rollback support, they lose no performance (in comparison) in doing checkpoints so they might as well get checkpoint for free.
* Makes changes to the `test_mock` and `elastic classy_vision` code to be compliant with the new API.
* Makes imagenet example compliant with the new API.

NOTE: This change renders the imagenet example under `//fblearner/flow/projects/pytorch/elastic/imagenet` broken. However this example was already broken and has zero users. The task to fix this is T57831531.

(Note: this ignores all push blocking failures!)

Reviewed By: vreis

Differential Revision: D18672302

fbshipit-source-id: 88718b790f8f0fe7ae0381a2e40455af7e2ba0ce
facebook-github-bot pushed a commit to facebookresearch/ClassyVision that referenced this pull request Dec 1, 2019
Summary:
Pull Request resolved: pytorch/elastic#4

Makes state API's persistence (rollback and serialization) more coherent, consistent, and natural. Does the following:

* Renames `deep_copy` and `rollback` to `snapshot` and `apply`
* The semantics of `snapshot` and `apply` is that the state is recoverable by:
      ```
           any_user_defined_snapshot_obj = state.snapshot()
           modify_state(state)
           state.apply(any_user_defined_snapshot_obj)
           state.sync()
      ```
* Renames `serialize` and `deserialize` to `save` and `load` (to be consistent with torch)
* `State` provides a default implementation of `save` and `load` using `snapshot` and `apply`.
* Removes the redundant `supports_rollback()` method from `State`. By not implementing `snapshot/apply` the user indicates that rollback is not supported on the `State` object. If the user wants to checkpoint but not rollback they can implement the `save/load` and not implement `snapshot/apply`. If the user wants rollback support, they lose no performance (in comparison) in doing checkpoints so they might as well get checkpoint for free.
* Makes changes to the `test_mock` and `elastic classy_vision` code to be compliant with the new API.
* Makes imagenet example compliant with the new API.

NOTE: This change renders the imagenet example under `//fblearner/flow/projects/pytorch/elastic/imagenet` broken. However this example was already broken and has zero users. The task to fix this is T57831531.

(Note: this ignores all push blocking failures!)

Reviewed By: vreis

Differential Revision: D18672302

fbshipit-source-id: 849b6cdcc5cb21e95406b42fd22d5b3d6d9a6f66
facebook-github-bot pushed a commit that referenced this pull request Dec 5, 2019
…nt, fix bug in petctl setp where None was being passed to cfn param, pump docker logs to cloudwatch

Summary:
1. Uses docker log-driver == awslogs to make docker output go to cloud watch (see screenshots below)
2. #1 creates a log group called `torchelastic/$USER` in CW and creates log streams (one per worker) called `$job_name/$instance_id`
3. Fixes a bug in `petctl setup` where if no efs and s3 buckets are specified the `NoneType` is passed to the cfn template param which throws a validation error because it expects a string
4. Fixes an issue with cfn template where the CloudWatch IAM managed policy was being created with a specific name hence preventing multiple stacks from being created in the same account.

#thanks Vinicius Reis for testing `petctl` and reporting bugs #3 and #4.

{F223965947}
{F223965943}

Reviewed By: vreis

Differential Revision: D18826855

fbshipit-source-id: 2d75f607734135ab6d5301fc636501a38cfee9d9
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants