Skip to content
This repository was archived by the owner on Jan 6, 2023. It is now read-only.
This repository was archived by the owner on Jan 6, 2023. It is now read-only.

Redefine should_save_checkpoint in state #8

@kiukchung

Description

@kiukchung

🚀 Feature

Currently state API has a should_save_checkpoint which has a couple of issues:

  1. CheckpointUtil assumes that all workers will return the same value from should_save_checkpoint
  2. CheckpointUtil chooses worker with rank == 0 to be the "representative" to load the checkpoint, then leans on sync() to broadcast the state to other workers.
  3. Fix CircleCI badge in README.md #2 may not be the correct choice that generalizes to different use-cases. The "correct" logic should be to chose the worker with the "most-tenured" state (e.g. the most up to date state) to broadcast the state.

Motivation

The checkpoint feature in torchelastic has many caveats (see above). Cleaning this logic up would make it clear for users on how to implement their state objects and also make it easier for users to reason about loading and saving of checkpoints and how that interacts with how they should be implementing sync() and load and save methods in the state class.

Pitch

Here's one way we could achieve this:

  1. Define a get_most_tenured API that the user has to implement to return the rank of the worker with the most "up to date" state that should be shared with other workers on a rendezvous event.

  2. Add helpers to broadcast state objects to the workers, this helper can be called in the sync() method. For instance:

def get_most_tenured_rank():
     # get the rank that has the most up to date state
     # or just return a consistent rank
     pass

def sync():
     most_tenured_rank = get_most_tenured_rank()
     dist_util.broadcast_state(state, most_tenured_rank)

Alternatives

  1. Can bake most_tenured_rank concept into the checkpoint util

Additional context

  1. https://github.com/pytorch/elastic/blob/master/torchelastic/checkpoint/api.py#L146
  2. https://github.com/pytorch/elastic/blob/master/torchelastic/state.py#L153
  3. https://github.com/pytorch/elastic/blob/master/torchelastic/checkpoint/api.py#L103

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions