Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Fix bug in task dependency management for duplicate args #16365

Merged
merged 5 commits into from
Jun 22, 2021

Conversation

stephanie-wang
Copy link
Contributor

@stephanie-wang stephanie-wang commented Jun 11, 2021

Why are these changes needed?

We overestimated the number of missing args for tasks that have duplicate args. This can lead to the task never being scheduled.

Closes #16556.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. In pullmanager this is solved by de-duping the args, but guess the problem also occurs at a higher level.

@@ -192,10 +192,10 @@ class DependencyManager : public TaskDependencyManagerInterface {

/// A struct to represent the object dependencies of a task.
struct TaskDependencies {
TaskDependencies(const std::vector<rpc::ObjectReference> &deps)
: num_missing_dependencies(deps.size()) {
TaskDependencies(const std::vector<rpc::ObjectReference> &deps) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing this to absl::flat_hash_set so that duplicate edge cases can't happen by construction.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 11, 2021
@ericl
Copy link
Contributor

ericl commented Jun 14, 2021

Rerunning tests:
[windows]
//python/ray/tests:test_scheduling FAILED in 3 out of 3 in 96.6s

@ericl ericl removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 14, 2021
@@ -455,6 +455,37 @@ def f(x):
assert len(ray.state.objects()) == 0, ray.state.objects()


def test_many_args(ray_start_cluster):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

::test_lease_request_leak SKIPPED [ 92%]
::test_many_args Windows fatal exception: access violation

This seems to generate a memory access violation on Windows--- is it possible this is actually a real memory corruption bug?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 15, 2021
const auto dep_ids = ObjectRefsToIds(deps);
dependencies.insert(dep_ids.begin(), dep_ids.end());
}
TaskDependencies(absl::flat_hash_set<ObjectID> &&deps)
Copy link
Contributor

@ericl ericl Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R-value ref along with the move above / below is a bit suspicious, may explain the windows segfault.

Consider a normal ref or just plain value type (std::move() should be efficient in either case).

@AmeerHajAli
Copy link
Contributor

Can you please open an issue for this and label it 1.4.1 to make sure it is in the branch cut?

@stephanie-wang
Copy link
Contributor Author

Yeah sorry for the delay too, I don't know what's up with the memory corruption issue...

@ericl
Copy link
Contributor

ericl commented Jun 19, 2021

Could try to "bisect" the changes to see what causes the memory issue.

@ericl
Copy link
Contributor

ericl commented Jun 20, 2021

@stephanie-wang per #16566 the test is crashing even sans the c++ change, seems some serializer memory safety issue. How about skipping it on windows to unblock the fix? We can separately try to figure out why this test causes this.

@stephanie-wang
Copy link
Contributor Author

@stephanie-wang per #16566 the test is crashing even sans the c++ change, seems some serializer memory safety issue. How about skipping it on windows to unblock the fix? We can separately try to figure out why this test causes this.

Oh thanks, was just trying this too. Yeah, sounds good.

@DmitriGekhtman
Copy link
Contributor

cpp tests are unhappy

@stephanie-wang
Copy link
Contributor Author

Hmm I'm having trouble reproducing the C++ failure locally, which is weird (these are usually pretty reliable). I just merged, let's see what happens.

@DmitriGekhtman
Copy link
Contributor

@stephanie-wang
Copy link
Contributor Author

I wonder what this means:
https://travis-ci.com/github/ray-project/ray/builds/230265115#L37906

This issue seems to be in master too, I don't think it's related to this PR.

@stephanie-wang
Copy link
Contributor Author

test_multi_node_3 looks unrelated.

@stephanie-wang stephanie-wang merged commit e7b752c into ray-project:master Jun 22, 2021
DmitriGekhtman pushed a commit that referenced this pull request Jun 22, 2021
@stephanie-wang stephanie-wang deleted the fix-dup-args branch June 22, 2021 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. do-not-merge Do not merge this PR!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Tasks with duplicate args sometimes don't get scheduled
5 participants