-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Fix bug in task dependency management for duplicate args #16365
[core] Fix bug in task dependency management for duplicate args #16365
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. In pullmanager this is solved by de-duping the args, but guess the problem also occurs at a higher level.
src/ray/raylet/dependency_manager.h
Outdated
@@ -192,10 +192,10 @@ class DependencyManager : public TaskDependencyManagerInterface { | |||
|
|||
/// A struct to represent the object dependencies of a task. | |||
struct TaskDependencies { | |||
TaskDependencies(const std::vector<rpc::ObjectReference> &deps) | |||
: num_missing_dependencies(deps.size()) { | |||
TaskDependencies(const std::vector<rpc::ObjectReference> &deps) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider changing this to absl::flat_hash_set so that duplicate edge cases can't happen by construction.
Rerunning tests: |
@@ -455,6 +455,37 @@ def f(x): | |||
assert len(ray.state.objects()) == 0, ray.state.objects() | |||
|
|||
|
|||
def test_many_args(ray_start_cluster): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
::test_lease_request_leak SKIPPED [ 92%]
::test_many_args Windows fatal exception: access violation
This seems to generate a memory access violation on Windows--- is it possible this is actually a real memory corruption bug?
src/ray/raylet/dependency_manager.h
Outdated
const auto dep_ids = ObjectRefsToIds(deps); | ||
dependencies.insert(dep_ids.begin(), dep_ids.end()); | ||
} | ||
TaskDependencies(absl::flat_hash_set<ObjectID> &&deps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R-value ref along with the move above / below is a bit suspicious, may explain the windows segfault.
Consider a normal ref or just plain value type (std::move() should be efficient in either case).
Can you please open an issue for this and label it 1.4.1 to make sure it is in the branch cut? |
Yeah sorry for the delay too, I don't know what's up with the memory corruption issue... |
Could try to "bisect" the changes to see what causes the memory issue. |
@stephanie-wang per #16566 the test is crashing even sans the c++ change, seems some serializer memory safety issue. How about skipping it on windows to unblock the fix? We can separately try to figure out why this test causes this. |
4b6c946
to
f3124d2
Compare
Oh thanks, was just trying this too. Yeah, sounds good. |
cpp tests are unhappy |
Hmm I'm having trouble reproducing the C++ failure locally, which is weird (these are usually pretty reliable). I just merged, let's see what happens. |
I wonder what this means: |
This issue seems to be in master too, I don't think it's related to this PR. |
test_multi_node_3 looks unrelated. |
* Pytest * Skip on windows * C++
Why are these changes needed?
We overestimated the number of missing args for tasks that have duplicate args. This can lead to the task never being scheduled.
Closes #16556.
Checks
scripts/format.sh
to lint the changes in this PR.