[Train] Update run status and actor status for train runs. #46395

woshiyyya · 2024-07-02T23:26:31Z

Why are these changes needed?

Added two entries for TrainRunInfo:

TrainRunInfo.run_status: STARTED, FINISHED, ERRORED
TrainRunInfo.controller_actor_status: ALIVE, DEAD
WorkerInfo.status: ALIVE, DEAD

Update Train run status when training finished or errored.

Update controller_actor_status and worker status on demand while calling TrainHead endpoint.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

dashboard/modules/train/train_head.py

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

…state_on_finish_and_error

dashboard/modules/train/train_head.py

python/ray/train/_internal/backend_executor.py

python/ray/train/_internal/state/schema.py

Co-authored-by: Alan Guo <aguo@aguo.software> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

dashboard/modules/train/train_head.py

python/ray/train/_internal/state/schema.py

…state_on_finish_and_error

dashboard/modules/train/train_head.py

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

…state_on_finish_and_error

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

python/ray/dashboard/modules/train/train_head.py

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

matthewdeng · 2024-07-12T00:24:02Z

python/ray/dashboard/modules/train/train_head.py

+        try:
+            from ray.train._internal.state.schema import ActorStatusEnum, RunStatusEnum
+        except ImportError:
+            logger.exception(
+                "Train is not installed. Please run `pip install ray[train]` "
+                "when setting up Ray on your cluster."
+            )


Do we need this in this method since it's already checked at the start of get_train_runs? Also if this ever does happen it'll just error out when trying to use these imports later in the code.

python/ray/dashboard/modules/train/train_head.py

matthewdeng · 2024-07-12T00:39:49Z

python/ray/dashboard/modules/train/train_head.py

+            # If the controller died but the run status is not updated,
+            # mark the train run as aborted
+            controller_actor_status = actor_status_table.get(
+                train_run.controller_actor_id, None
+            )
+            if (
+                controller_actor_status == ActorStatusEnum.DEAD
+                and train_run.run_status == RunStatusEnum.STARTED
+            ):
+                train_run.run_status = RunStatusEnum.ABORTED


nit: Feels a bit messy to update the run status here. If we keep this here I'd at least rename the method to _add_actor_and_update_run_status or add a docstring that documents this behavior.

This one is doing some post processing to handle some abnormally terminated cases. Let me update the function name here.

matthewdeng · 2024-07-12T00:40:40Z

python/ray/dashboard/modules/train/train_head.py

+            # If the controller died but the run status is not updated,
+            # mark the train run as aborted
+            controller_actor_status = actor_status_table.get(
+                train_run.controller_actor_id, None
+            )
+            if (
+                controller_actor_status == ActorStatusEnum.DEAD
+                and train_run.run_status == RunStatusEnum.STARTED
+            ):
+                train_run.run_status = RunStatusEnum.ABORTED


Should there be a status_detail for this one?

matthewdeng · 2024-07-12T00:41:40Z

python/ray/train/_internal/state/schema.py

+    ERRORED = "ERRORED"
+    ABORTED = "ABORTED"


What is the difference between ERRORED and ABORTED?

We only mark it as ERRORED when Trainer detected an error and actively report it to StateActor.

We mark it as ABORTED when the TrainRun abnormally failed (e.g. due to node failure) and we passively set it in TrainHead.

python/ray/train/_internal/state/schema.py

matthewdeng · 2024-07-12T00:42:55Z

python/ray/train/_internal/state/schema.py

+    status_detail: str = Field(
+        description="Detailed information about the current run status, "
+        "such as error messages."
+    )


What's the purpose of this? We only ever have one "User Error" message right now.

This is for tracking the error reason for now, and can be extend to track the details of current run status in the future(e.g. scaling up / down/ recovering when doing elastic training.

matthewdeng · 2024-07-12T00:58:20Z

python/ray/train/_internal/state/state_manager.py

+        self.train_run_info_dict = dict(
            id=run_id,
            job_id=job_id,
            name=run_name,
            controller_actor_id=controller_actor_id,
            workers=worker_info_list,
            datasets=dataset_info_list,
            start_time_ms=start_time_ms,
+            run_status=run_status,
+            status_detail=status_detail,
        )
+        train_run_info = TrainRunInfo(**self.train_run_info_dict)
+        ray.get(self.state_actor.register_train_run.remote(train_run_info))

+    def update_train_run_info(self, updates: Dict[str, Any]) -> None:
+        """Update specific fields of a registered TrainRunInfo instance."""
+        self.train_run_info_dict.update(updates)
+        train_run_info = TrainRunInfo(**self.train_run_info_dict)
        ray.get(self.state_actor.register_train_run.remote(train_run_info))


This can cause problems if not careful in the future e.g. if somewhere we call update_train_run_info with a different id. One way to limit this is to restrict the public API and use a private method instead.

Something like this:

Suggested change

self.train_run_info_dict = dict(

id=run_id,

job_id=job_id,

name=run_name,

controller_actor_id=controller_actor_id,

workers=worker_info_list,

datasets=dataset_info_list,

start_time_ms=start_time_ms,

run_status=run_status,

status_detail=status_detail,

)

train_run_info = TrainRunInfo(**self.train_run_info_dict)

ray.get(self.state_actor.register_train_run.remote(train_run_info))

def update_train_run_info(self, updates: Dict[str, Any]) -> None:

"""Update specific fields of a registered TrainRunInfo instance."""

self.train_run_info_dict.update(updates)

train_run_info = TrainRunInfo(**self.train_run_info_dict)

ray.get(self.state_actor.register_train_run.remote(train_run_info))

updates = dict(

id=run_id,

job_id=job_id,

name=run_name,

controller_actor_id=controller_actor_id,

workers=worker_info_list,

datasets=dataset_info_list,

start_time_ms=start_time_ms,

run_status=run_status,

status_detail=status_detail,

)

# you can first assert that the info dict is empty here I guess

_update_train_run_info(updates)

def end_train_run(self, run_status, status_detail, end_time_ms):

updates=dict(

run_status=run_status,

status_detail=status_detail,

end_time_ms=end_time_ms,

)

_update_train_run_info(updates)

def _update_train_run_info(self, updates: Dict[str, Any]) -> None:

"""Update specific fields of a registered TrainRunInfo instance."""

self.train_run_info_dict.update(updates)

train_run_info = TrainRunInfo(**self.train_run_info_dict)

ray.get(self.state_actor.register_train_run.remote(train_run_info))

This is a good point! Let me update this

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

matthewdeng · 2024-07-13T01:06:25Z

python/ray/train/_internal/state/state_manager.py

        )

-        ray.get(self.state_actor.register_train_run.remote(train_run_info))
+        # Clear the cached info to avoid registering the same run twice
+        self.train_run_info_dict[run_id].clear()


Will this populate the dictionary on the first call? Since the first line in _update_train_run_info will check if the run_id is present.

Yes it will populate the dict. But yeah I think it's a bit implicit, let me update it to

self.train_run_info_dict[run_id] = {}

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

woshiyyya added 4 commits July 2, 2024 23:25

update

5ce0cc1

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

update

8057587

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

udpate

54c1106

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

add tests

1ed6114

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

woshiyyya changed the title ~~[Train] Update TrainRunStatus Upon Finish or Error~~ [Train] Update run status and actor status for Train State API. Jul 5, 2024

woshiyyya changed the title ~~[Train] Update run status and actor status for Train State API.~~ [Train] Update run status and actor status for train runs. Jul 5, 2024

woshiyyya requested a review from alanwguo July 5, 2024 06:38

woshiyyya commented Jul 5, 2024

View reviewed changes

dashboard/modules/train/train_head.py Outdated Show resolved Hide resolved

update

8312b2b

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

woshiyyya marked this pull request as ready for review July 5, 2024 06:48

woshiyyya requested review from hongpeng-guo, justinvyu, matthewdeng and raulchen as code owners July 5, 2024 06:48

woshiyyya assigned alanwguo and matthewdeng Jul 5, 2024

woshiyyya added 4 commits July 5, 2024 20:03

fix tests

28f2437

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

update

bf0cdeb

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

udpate

423d650

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

Merge remote-tracking branch 'upstream/master' into train/update_run_…

5478d35

…state_on_finish_and_error

alanwguo reviewed Jul 8, 2024

View reviewed changes

woshiyyya and others added 7 commits July 8, 2024 14:44

Apply suggestions from code review

cc44bb4

Co-authored-by: Alan Guo <aguo@aguo.software> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

address comments

e09f818

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

add GCS call

1abe4f0

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

change test small to medium

ef866e8

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

test

7687e61

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

fix ci

1e6fb9c

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

udpate to endpoint v2

c7caa36

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

alanwguo reviewed Jul 9, 2024

View reviewed changes

dashboard/modules/train/train_head.py Outdated Show resolved Hide resolved

python/ray/train/_internal/state/schema.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/master' into train/update_run_…

2820020

…state_on_finish_and_error

alanwguo reviewed Jul 9, 2024

View reviewed changes

dashboard/modules/train/train_head.py Outdated Show resolved Hide resolved

dashboard/modules/train/train_head.py Outdated Show resolved Hide resolved

woshiyyya added 2 commits July 10, 2024 05:28

mark abnormally terminated run as aborted

e129ab4

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

Merge remote-tracking branch 'upstream/master' into train/update_run_…

578f746

…state_on_finish_and_error

woshiyyya added the go add ONLY when ready to merge, run all tests label Jul 10, 2024

fix import

15af2bc

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

alanwguo reviewed Jul 10, 2024

View reviewed changes

python/ray/dashboard/modules/train/train_head.py Outdated Show resolved Hide resolved

python/ray/dashboard/modules/train/train_head.py Outdated Show resolved Hide resolved

check controller status

e350ef4

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

alanwguo approved these changes Jul 11, 2024

View reviewed changes

matthewdeng reviewed Jul 12, 2024

View reviewed changes

woshiyyya added 2 commits July 12, 2024 06:28

address comments

31ab9a7

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

update docstring

17c4c1c

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

woshiyyya requested a review from matthewdeng July 12, 2024 20:18

address more comments

a17415a

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

matthewdeng reviewed Jul 13, 2024

View reviewed changes

upddate

cd0965f

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>

matthewdeng approved these changes Jul 13, 2024

View reviewed changes

matthewdeng merged commit 1a369e1 into ray-project:master Jul 13, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Update run status and actor status for train runs. #46395

[Train] Update run status and actor status for train runs. #46395

woshiyyya commented Jul 2, 2024 •

edited

Loading

matthewdeng Jul 12, 2024

woshiyyya Jul 12, 2024

matthewdeng Jul 12, 2024

woshiyyya Jul 12, 2024

matthewdeng Jul 12, 2024

matthewdeng Jul 12, 2024

woshiyyya Jul 12, 2024

matthewdeng Jul 12, 2024

woshiyyya Jul 12, 2024

matthewdeng Jul 12, 2024

woshiyyya Jul 12, 2024

matthewdeng Jul 13, 2024

woshiyyya Jul 13, 2024

[Train] Update run status and actor status for train runs. #46395

[Train] Update run status and actor status for train runs. #46395

Conversation

woshiyyya commented Jul 2, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woshiyyya commented Jul 2, 2024 •

edited

Loading