Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix raylet bug in driver cleanup #3962

Merged

Conversation

stephanie-wang
Copy link
Contributor

What do these changes do?

When a driver exits, we clean up all of the task and object dependencies for related tasks on each node. However, there was a bug in the backend that caused us to miss some of the dependencies. Then, the raylet would crash here because a task that had been removed from all of the queues was still tracked by the TaskDependencyManager.

Related issue number

Not sure if there are any open related issues, but this can cause crashes when a driver exits early.

} else {
it++;
}
required_tasks_.erase(creating_task_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be erasing creating_task_id from required_tasks_ or just erasing one element of required_tasks_[creating_task_id][object_id]?

E.g., if there are some other tasks that still require object_id, then creating_task_id might still be required, right?

I may just be confusing myself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the best way would be to erase from required_tasks[creating_task_id], but this really should be removing all the objects anyway, so it should be okay.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining that (or a check checking that that is the case)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry I missed this. But I did add a check to the end of the function that checks whether all of the tasks were erased from required_tasks_.

it++;
}
required_tasks_.erase(creating_task_id);
RAY_CHECK(!CheckObjectRequired(object_id));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check will trivially pass because of the line required_tasks_.erase(creating_task_id); above, but it's not clear to me why we would expect object_id to not be required by any tasks. Couldn't object_id still be required by some task that wasn't in task_ids? Or are we making assumptions about when/how RemoveTasksAndRelatedObjects is called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're making the assumption that if RemoveTasksAndRelatedObjects(task_ids) is called, then it should include all local tasks that depend on tasks in task_ids. Does that make sense? I can update the header to make this clear.

}
required_tasks_.erase(creating_task_id);
RAY_CHECK(!CheckObjectRequired(object_id));
HandleRemoteDependencyCanceled(object_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is a no-op given the above check, right? Since the implementation is

void TaskDependencyManager::HandleRemoteDependencyCanceled(const ObjectID &object_id) {
  bool required = CheckObjectRequired(object_id);
  // If the object is no longer required, then cancel the object.
  if (!required) {
    auto it = required_objects_.find(object_id);
    if (it != required_objects_.end()) {
      object_manager_.CancelPull(object_id);
      reconstruction_policy_.Cancel(object_id);
      required_objects_.erase(it);
    }
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it cancels the object with the object_manager_, etc.

task_dependencies_.erase(*it);
required_tasks_.erase(*it);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not do this anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens inside HandleRemoteDependencyCanceled.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11606/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11608/
Test PASSed.

Copy link
Contributor

@pcmoritz pcmoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@stephanie-wang stephanie-wang merged commit 49e9bec into ray-project:master Feb 6, 2019
@stephanie-wang stephanie-wang deleted the fix-driver-cleanup branch February 6, 2019 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants