[Core] Put pg state to kv store when pg rescheduling#34467
[Core] Put pg state to kv store when pg rescheduling#34467rkooo567 merged 1 commit intoray-project:masterfrom
Conversation
Signed-off-by: LarryLian <554538252@qq.com>
8f683c5 to
81cd85c
Compare
rkooo567
left a comment
There was a problem hiding this comment.
This makes sense! Btw can you also add a cpp test to simulate both cases?
- rescheduling -> node dead -> GCS restarts -> rescheduling succeeds
- rescheduling started -> GCS restarts -> rescheduling succeeds
@rkooo567
|
|
I don't think it is complicated to test it? It seems like we already have the way to test GCS restart The pro of testing it in cpp is we can test the exact edge cases. The python level test depends on the timing, and it is easy to miss edge cases. |
@rkooo567 |
|
We will merge this PR first and @larrylian will submit a follow up PR for unit test! |
|
Failures are unlikely related |
…ject#34467)" This reverts commit af018f6.
When a PG fails over but has not been scheduled successfully, the restart of gcs will cause the PG to no longer be rescheduled. A node is down, triggering the rescheduling of the PG bundle on this node However, due to insufficient resources, this PG bunlde cannot be scheduled successfully The gcs server sent FO In the end, even if the resources are sufficient, the PG bundle is still not rescheduled. Reproduce command: pytest -sv python/ray/tests/test_placement_group_failover.py::test_gcs_restart_when_placement_group_failover Because the rescheduling state of PG is lost when gcs restarts. solution: It is necessary to save the PG to kvstore when the PG is changed to the rescheduling state. Signed-off-by: Jack He <jackhe2345@gmail.com>
When a PG fails over but has not been scheduled successfully, the restart of gcs will cause the PG to no longer be rescheduled. A node is down, triggering the rescheduling of the PG bundle on this node However, due to insufficient resources, this PG bunlde cannot be scheduled successfully The gcs server sent FO In the end, even if the resources are sufficient, the PG bundle is still not rescheduled. Reproduce command: pytest -sv python/ray/tests/test_placement_group_failover.py::test_gcs_restart_when_placement_group_failover Because the rescheduling state of PG is lost when gcs restarts. solution: It is necessary to save the PG to kvstore when the PG is changed to the rescheduling state.
…ject#34467)" (ray-project#34914) This reverts commit af018f6.
Why are these changes needed?
When a PG fails over but has not been scheduled successfully, the restart of gcs will cause the PG to no longer be rescheduled.
Reproduce command:
Because the rescheduling state of PG is lost when gcs restarts.
solution:
It is necessary to save the PG to kvstore when the PG is changed to the rescheduling state.
Related issue number
When a PG fails over but has not been scheduled successfully, the restart of gcs will cause the PG to no longer be rescheduled. #34468
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.