[Core] Put pg state to kv store when pg rescheduling by larrylian · Pull Request #34467 · ray-project/ray

larrylian · 2023-04-17T12:20:12Z

Why are these changes needed?

When a PG fails over but has not been scheduled successfully, the restart of gcs will cause the PG to no longer be rescheduled.

A node is down, triggering the rescheduling of the PG bundle on this node
However, due to insufficient resources, this PG bunlde cannot be scheduled successfully
The gcs server sent FO
In the end, even if the resources are sufficient, the PG bundle is still not rescheduled.

Reproduce command:

pytest -sv python/ray/tests/test_placement_group_failover.py::test_gcs_restart_when_placement_group_failover

Because the rescheduling state of PG is lost when gcs restarts.
solution:
It is necessary to save the PG to kvstore when the PG is changed to the rescheduling state.

Related issue number

When a PG fails over but has not been scheduled successfully, the restart of gcs will cause the PG to no longer be rescheduled. #34468

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: LarryLian <554538252@qq.com>

rkooo567

This makes sense! Btw can you also add a cpp test to simulate both cases?

rescheduling -> node dead -> GCS restarts -> rescheduling succeeds
rescheduling started -> GCS restarts -> rescheduling succeeds

larrylian · 2023-04-25T03:07:48Z

Btw can you also add a cpp test to simulate both cases?

@rkooo567
In fact, my personal opinion is not to add this use case to gcs_placement_group_manager_test.

My current python "test_gcs_restart_when_placement_group_failover" use case has been able to fully cover this scenario, and it is also a really effective use case.
Many operations in the current gcs_placement_group_manager_test use case are mocked and very complicated. After I add a use case with "gcs restart" logic to it, the later maintenance cost will be even greater.
Now many operations in the gcs_placement_group_manager_test use case are mocked. Even if I add the use case you mentioned, its protection effect is not as good as the python use case.

rkooo567 · 2023-04-26T13:07:39Z

I don't think it is complicated to test it? It seems like we already have the way to test GCS restart

TEST_F(GcsPlacementGroupManagerTest, TestSchedulerReinitializeAfterGcsRestart) {
  // Create a placement group and make sure it has been created successfully.
  auto request = Mocker::GenCreatePlacementGroupRequest();
  std::atomic<int> registered_placement_group_count(0);
  RegisterPlacementGroup(request, [&registered_placement_group_count](Status status) {
    ++registered_placement_group_count;
  });
  ASSERT_EQ(registered_placement_group_count, 1);
  WaitForExpectedPgCount(1);

  auto placement_group = mock_placement_group_scheduler_->placement_groups_.back();
  placement_group->GetMutableBundle(0)->set_node_id(NodeID::FromRandom().Binary());
  placement_group->GetMutableBundle(1)->set_node_id(NodeID::FromRandom().Binary());
  mock_placement_group_scheduler_->placement_groups_.pop_back();
  OnPlacementGroupCreationSuccess(placement_group);
  ASSERT_EQ(placement_group->GetState(), rpc::PlacementGroupTableData::CREATED);
  // Reinitialize the placement group manager and test the node dead case.
  auto gcs_init_data = LoadDataFromDataStorage();
  ASSERT_EQ(1, gcs_init_data->PlacementGroups().size());
  EXPECT_TRUE(
      gcs_init_data->PlacementGroups().find(placement_group->GetPlacementGroupID()) !=
      gcs_init_data->PlacementGroups().end());
  EXPECT_CALL(*mock_placement_group_scheduler_, ReleaseUnusedBundles(_)).Times(1);
  EXPECT_CALL(
      *mock_placement_group_scheduler_,
      Initialize(testing::Contains(testing::Key(placement_group->GetPlacementGroupID()))))
      .Times(1);
  gcs_placement_group_manager_->Initialize(*gcs_init_data);
}

The pro of testing it in cpp is we can test the exact edge cases. The python level test depends on the timing, and it is easy to miss edge cases.

larrylian · 2023-04-27T02:14:16Z

The pro of testing it in cpp is we can test the exact edge cases. The python level test depends on the timing, and it is easy to miss edge cases.

@rkooo567
I agree with this point of view.
Thank you for the example. I'll resubmit a new PR to complement this C++ test case.
This PR has now been run through CI, I think it will be more efficient to merge it first.

rkooo567 · 2023-04-27T02:20:07Z

We will merge this PR first and @larrylian will submit a follow up PR for unit test!

rkooo567 · 2023-04-27T02:20:32Z

Failures are unlikely related

…ject#34467)" This reverts commit af018f6.

#34914) This reverts commit af018f6.

When a PG fails over but has not been scheduled successfully, the restart of gcs will cause the PG to no longer be rescheduled. A node is down, triggering the rescheduling of the PG bundle on this node However, due to insufficient resources, this PG bunlde cannot be scheduled successfully The gcs server sent FO In the end, even if the resources are sufficient, the PG bundle is still not rescheduled. Reproduce command: pytest -sv python/ray/tests/test_placement_group_failover.py::test_gcs_restart_when_placement_group_failover Because the rescheduling state of PG is lost when gcs restarts. solution: It is necessary to save the PG to kvstore when the PG is changed to the rescheduling state. Signed-off-by: Jack He <jackhe2345@gmail.com>

When a PG fails over but has not been scheduled successfully, the restart of gcs will cause the PG to no longer be rescheduled. A node is down, triggering the rescheduling of the PG bundle on this node However, due to insufficient resources, this PG bunlde cannot be scheduled successfully The gcs server sent FO In the end, even if the resources are sufficient, the PG bundle is still not rescheduled. Reproduce command: pytest -sv python/ray/tests/test_placement_group_failover.py::test_gcs_restart_when_placement_group_failover Because the rescheduling state of PG is lost when gcs restarts. solution: It is necessary to save the PG to kvstore when the PG is changed to the rescheduling state.

…ject#34467)" (ray-project#34914) This reverts commit af018f6.

larrylian requested a review from a team as a code owner April 17, 2023 12:20

larrylian requested review from rkooo567, scv119 and wumuzi520 April 17, 2023 12:25

larrylian self-assigned this Apr 17, 2023

rkooo567 self-assigned this Apr 17, 2023

larrylian requested a review from jjyao April 18, 2023 06:06

Put pg state to kv store when pg rescheduling

81cd85c

Signed-off-by: LarryLian <554538252@qq.com>

larrylian force-pushed the pg_fo_when_gcs_fo branch from 8f683c5 to 81cd85c Compare April 18, 2023 07:05

rkooo567 approved these changes Apr 25, 2023

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 25, 2023

rkooo567 merged commit af018f6 into ray-project:master Apr 27, 2023

rkooo567 added a commit to rkooo567/ray that referenced this pull request May 1, 2023

Revert "[Core] Put pg state to kv store when pg rescheduling (ray-pro…

c39668d

…ject#34467)" This reverts commit af018f6.

rkooo567 added a commit that referenced this pull request May 2, 2023

Revert "[Core] Put pg state to kv store when pg rescheduling (#34467)" (

11bf8f5

#34914) This reverts commit af018f6.

architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023

Revert "[Core] Put pg state to kv store when pg rescheduling (ray-pro…

f1d9f41

…ject#34467)" (ray-project#34914) This reverts commit af018f6.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Put pg state to kv store when pg rescheduling#34467

[Core] Put pg state to kv store when pg rescheduling#34467
rkooo567 merged 1 commit intoray-project:masterfrom
larrylian:pg_fo_when_gcs_fo

larrylian commented Apr 17, 2023 •

edited

Loading

Uh oh!

rkooo567 left a comment

Uh oh!

larrylian commented Apr 25, 2023

Uh oh!

rkooo567 commented Apr 26, 2023

Uh oh!

larrylian commented Apr 27, 2023

Uh oh!

rkooo567 commented Apr 27, 2023

Uh oh!

rkooo567 commented Apr 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

larrylian commented Apr 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

rkooo567 left a comment

Choose a reason for hiding this comment

Uh oh!

larrylian commented Apr 25, 2023

Uh oh!

rkooo567 commented Apr 26, 2023

Uh oh!

larrylian commented Apr 27, 2023

Uh oh!

rkooo567 commented Apr 27, 2023

Uh oh!

rkooo567 commented Apr 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

larrylian commented Apr 17, 2023 •

edited

Loading