Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Placement Group] Make the creation of placement group sync #13858

Merged
merged 31 commits into from
Feb 24, 2021

Conversation

clay4444
Copy link
Contributor

@clay4444 clay4444 commented Feb 2, 2021

Why are these changes needed?

  1. Make the creation of placement group sync, prevent the core worker returning that the creation was successful but the actual creation failed.
  2. fix ut test_automatic_cleanup_detached_actors

Related issue number

#13859

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we'd like to have this change, we should make sure the HandleCreatePlacementGroup will not have this code;

pending_register_iter->second = std::move(callback);

I am not sure if this is necessary.

@clay4444
Copy link
Contributor Author

clay4444 commented Feb 3, 2021

If we'd like to have this change, we should make sure the HandleCreatePlacementGroup will not have this code;

pending_register_iter->second = std::move(callback);

I am not sure if this is necessary.

I think this is necessary, image this case: CoreWorker return success, but its creation is actually failed, then the subsequent scheduling of task or actor with PG is meaningless, and if a failover occurs, the placement group info will be lost forever

@rkooo567
Copy link
Contributor

rkooo567 commented Feb 3, 2021

That's why we have .wait and .ready right? Also again, if we'd like to make this change, we shouldn't store the callback for CreatePlacementGroup. We should always reply right away like actor cases. Otherwise, there's no meaning of .wait and .ready APIs.

@clay4444
Copy link
Contributor Author

clay4444 commented Feb 3, 2021

That's why we have .wait and .ready right? Also again, if we'd like to make this change, we shouldn't store the callback for CreatePlacementGroup. We should always reply right away like actor cases. Otherwise, there's no meaning of .wait and .ready APIs.

I think the .wait and .ready are for detecting whether create successful instead of registration successful right? if the register is already failed, I think the pg.wait() and pg.ready() Is meaningless

@rkooo567
Copy link
Contributor

rkooo567 commented Feb 3, 2021

if we'd like to make this change, we shouldn't store the callback for CreatePlacementGroup. We should always reply right away like actor cases

Currently, the creation is replied "after" registration is done. I am not against this idea, but then please fix this issue.

@clay4444
Copy link
Contributor Author

clay4444 commented Feb 3, 2021

discussed with @rkooo567 offline, we should return success immediately when HandleCreatePlacementGroup instead of returning the result after successful creation.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 4, 2021
/// Callbacks of pending `RegisterPlacementGroup` requests.
/// Maps placement group ID to placement group registration callbacks, which is used to
/// filter duplicated messages from a driver/worker caused by some network problems.
absl::flat_hash_map<PlacementGroupID, std::vector<StatusCallback>>
Copy link
Contributor

@ffbin ffbin Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change it to placement_group_to_register_callbacks_? I'm afraid the design is a bit redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just like the registration of Actor, there will be the deconstruction problem if we just ignore the previous callback, I've checked this from @raulchen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clay4444 Can you describe the specific problem in detail? thx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... I'm not sure whether I can explain this clearly, the general reason is that the GrpcClient will not be deconstructed since we just ignore the callback and no response was sent to the client

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a network exception, reply also cannot be returned.

@clay4444 clay4444 assigned ffbin, clay4444 and rkooo567 and unassigned ffbin and rkooo567 Feb 20, 2021
@rkooo567
Copy link
Contributor

Lint failure

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of last comments! After you address all of them, and if lint passes, I will merge the PR :)!

iter->second(Status::NotFound(stream.str()));
placement_group_to_register_callback_.erase(iter);
}
// The placement group registration is synchronous, so if we found the placement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comment!

src/ray/gcs/gcs_server/gcs_placement_group_manager.cc Outdated Show resolved Hide resolved
python/ray/tests/test_placement_group.py Show resolved Hide resolved
SchedulePendingPlacementGroups();
} else {
// The placement group registration is synchronous, so if we found the placement
Copy link
Contributor Author

@clay4444 clay4444 Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussed with @ffbin , we don't need to invoke RAY_CHECK(placement_group_to_register_callback_.count(placement_group_id) == 0) here, because we will remove the placement group from registered_placement_groups_ and placement_group_to_register_callbacks_ only in the RemovePlacementGroup @rkooo567

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm can you at least make the log message to ERROR and say "This is a bug that should be addressed. Please report to Ray Github issue if you see this message"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still prefer to have a RAY_CHECK for strong consistency. What's the reason you guys prefer the log message? If we see the check failure, that means we just need to fix the issue rather than ignoring with the log messages.

@clay4444 clay4444 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Feb 22, 2021
@rkooo567 rkooo567 merged commit 911b028 into ray-project:master Feb 24, 2021
@rkooo567
Copy link
Contributor

Wait I probably needed to have a API change approval.

@rkooo567 rkooo567 added the core-interface-change-approval-required This changes the Ray core behavior / API and requires broader approvals. label Feb 24, 2021
@rkooo567
Copy link
Contributor

Okay approved :)

@rkooo567 rkooo567 removed the core-interface-change-approval-required This changes the Ray core behavior / API and requires broader approvals. label Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants