This repository has been archived by the owner on Jan 6, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 97
Refactor State API's persistence methods #4
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
fbshipit-source-id: 2da47843f74b324adb4e620cc7927feff02e7145
Summary: This is my first pass at describing the torchelastic rendezvous concept. Obviously it's not finished, but getting this out earlier for initial RFC. Reviewed By: kiukchung Differential Revision: D18613818 fbshipit-source-id: 235281ad965a42caa0b477a444b790b18fe7dc40
…to collectives_test not destroying process groups properly Summary: 1. Sets up etcd server properly in circle ci so that etcd_server_fixture can use the binary to spin up a standalone etcd server for running end-to-end unittests. 2. Properly destroys process groups in `collectives_test.py` which was causing the `etcd_elastic_trainer_test` to hang. Reviewed By: vladbelous Differential Revision: D18650293 fbshipit-source-id: a1c16ab7ab2a4d116ce205bf7fc4f5f2379777c2
…in circleci Summary: Title says it all Reviewed By: vladbelous Differential Revision: D18652743 fbshipit-source-id: b7732f80a037d322aed337a33ec9f2987b114927
Fix CircleCI badge in README.md
… Add docker_build script Summary: See title. Reviewed By: mehta-vikas Differential Revision: D18654271 fbshipit-source-id: 37ca77fb399dd3fd333dc3aa4162706c6d423376
Summary: Makes state API's persistence (rollback and serialization) more coherent, consistent, and natural. Does the following: * Renames `deep_copy` and `rollback` to `snapshot` and `apply` * The semantics of `snapshot` and `apply` is that the state is recoverable by: ``` any_user_defined_snapshot_obj = state.snapshot() modify_state(state) state.apply(any_user_defined_snapshot_obj) state.sync() ``` * Renames `serialize` and `deserialize` to `save` and `load` (to be consistent with torch) * `State` provides a default implementation of `save` and `load` using `snapshot` and `apply`. * Removes the redundant `supports_rollback()` method from `State`. By not implementing `snapshot/apply` the user indicates that rollback is not supported on the `State` object. If the user wants to checkpoint but not rollback they can implement the `save/load` and not implement `snapshot/apply`. If the user wants rollback support, they lose no performance (in comparison) in doing checkpoints so they might as well get checkpoint for free. * Makes changes to the `test_mock` and `elastic classy_vision` code to be compliant with the new API. * Makes imagenet example compliant with the new API. NOTE: This change renders the imagenet example under `//fblearner/flow/projects/pytorch/elastic/imagenet` broken. However this example was already broken and has zero users. The task to fix this is T57831531. Differential Revision: D18672302 fbshipit-source-id: 28ecb437b8308f97f6839e43cf51823c6b45fda4
This pull request was exported from Phabricator. Differential Revision: D18672302 |
facebook-github-bot
pushed a commit
to facebookresearch/ClassyVision
that referenced
this pull request
Nov 26, 2019
Summary: Pull Request resolved: pytorch/elastic#4 Makes state API's persistence (rollback and serialization) more coherent, consistent, and natural. Does the following: * Renames `deep_copy` and `rollback` to `snapshot` and `apply` * The semantics of `snapshot` and `apply` is that the state is recoverable by: ``` any_user_defined_snapshot_obj = state.snapshot() modify_state(state) state.apply(any_user_defined_snapshot_obj) state.sync() ``` * Renames `serialize` and `deserialize` to `save` and `load` (to be consistent with torch) * `State` provides a default implementation of `save` and `load` using `snapshot` and `apply`. * Removes the redundant `supports_rollback()` method from `State`. By not implementing `snapshot/apply` the user indicates that rollback is not supported on the `State` object. If the user wants to checkpoint but not rollback they can implement the `save/load` and not implement `snapshot/apply`. If the user wants rollback support, they lose no performance (in comparison) in doing checkpoints so they might as well get checkpoint for free. * Makes changes to the `test_mock` and `elastic classy_vision` code to be compliant with the new API. * Makes imagenet example compliant with the new API. NOTE: This change renders the imagenet example under `//fblearner/flow/projects/pytorch/elastic/imagenet` broken. However this example was already broken and has zero users. The task to fix this is T57831531. (Note: this ignores all push blocking failures!) Reviewed By: vreis Differential Revision: D18672302 fbshipit-source-id: 88718b790f8f0fe7ae0381a2e40455af7e2ba0ce
facebook-github-bot
pushed a commit
to facebookresearch/ClassyVision
that referenced
this pull request
Dec 1, 2019
Summary: Pull Request resolved: pytorch/elastic#4 Makes state API's persistence (rollback and serialization) more coherent, consistent, and natural. Does the following: * Renames `deep_copy` and `rollback` to `snapshot` and `apply` * The semantics of `snapshot` and `apply` is that the state is recoverable by: ``` any_user_defined_snapshot_obj = state.snapshot() modify_state(state) state.apply(any_user_defined_snapshot_obj) state.sync() ``` * Renames `serialize` and `deserialize` to `save` and `load` (to be consistent with torch) * `State` provides a default implementation of `save` and `load` using `snapshot` and `apply`. * Removes the redundant `supports_rollback()` method from `State`. By not implementing `snapshot/apply` the user indicates that rollback is not supported on the `State` object. If the user wants to checkpoint but not rollback they can implement the `save/load` and not implement `snapshot/apply`. If the user wants rollback support, they lose no performance (in comparison) in doing checkpoints so they might as well get checkpoint for free. * Makes changes to the `test_mock` and `elastic classy_vision` code to be compliant with the new API. * Makes imagenet example compliant with the new API. NOTE: This change renders the imagenet example under `//fblearner/flow/projects/pytorch/elastic/imagenet` broken. However this example was already broken and has zero users. The task to fix this is T57831531. (Note: this ignores all push blocking failures!) Reviewed By: vreis Differential Revision: D18672302 fbshipit-source-id: 849b6cdcc5cb21e95406b42fd22d5b3d6d9a6f66
facebook-github-bot
pushed a commit
that referenced
this pull request
Dec 5, 2019
…nt, fix bug in petctl setp where None was being passed to cfn param, pump docker logs to cloudwatch Summary: 1. Uses docker log-driver == awslogs to make docker output go to cloud watch (see screenshots below) 2. #1 creates a log group called `torchelastic/$USER` in CW and creates log streams (one per worker) called `$job_name/$instance_id` 3. Fixes a bug in `petctl setup` where if no efs and s3 buckets are specified the `NoneType` is passed to the cfn template param which throws a validation error because it expects a string 4. Fixes an issue with cfn template where the CloudWatch IAM managed policy was being created with a specific name hence preventing multiple stacks from being created in the same account. #thanks Vinicius Reis for testing `petctl` and reporting bugs #3 and #4. {F223965947} {F223965943} Reviewed By: vreis Differential Revision: D18826855 fbshipit-source-id: 2d75f607734135ab6d5301fc636501a38cfee9d9
11 tasks
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Makes state API's persistence (rollback and serialization) more coherent, consistent, and natural. Does the following:
deep_copy
androllback
tosnapshot
andapply
snapshot
andapply
is that the state is recoverable by:any_user_defined_snapshot_obj = state.snapshot() modify_state(state) state.apply(any_user_defined_snapshot_obj) state.sync()
serialize
anddeserialize
tosave
andload
(to be consistent with torch)State
provides a default implementation ofsave
andload
usingsnapshot
andapply
.supports_rollback()
method fromState
. By not implementingsnapshot/apply
the user indicates that rollback is not supported on theState
object. If the user wants to checkpoint but not rollback they can implement thesave/load
and not implementsnapshot/apply
. If the user wants rollback support, they lose no performance (in comparison) in doing checkpoints so they might as well get checkpoint for free.test_mock
andelastic classy_vision
code to be compliant with the new API.NOTE: This change renders the imagenet example under
//fblearner/flow/projects/pytorch/elastic/imagenet
broken. However this example was already broken and has zero users. The task to fix this is T57831531.Differential Revision: D18672302