-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] [runtime env] Add RuntimeEnvHash and JobID to SchedulingKey #16766
[Core] [runtime env] Add RuntimeEnvHash and JobID to SchedulingKey #16766
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Should we also have a test for the case where they use the same env and a new worker is started? (I remember there being a similar issue there)
@@ -74,7 +74,8 @@ Status CoreWorkerDirectTaskSubmitter::SubmitTask(TaskSpecification task_spec) { | |||
const SchedulingKey scheduling_key( | |||
task_spec.GetSchedulingClass(), task_spec.GetDependencyIds(), | |||
task_spec.IsActorCreationTask() ? task_spec.ActorCreationId() | |||
: ActorID::Nil()); | |||
: ActorID::Nil(), | |||
task_spec.GetRuntimeEnvHash(), task_spec.JobId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we confirm that it makes sense to add JobID here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, probably best not to have job id since it doesn't really do anything. Workers belong to jobs, so job id will always be the same.
Sorry, there's still some pretty blatant compiler errors, not sure how it compiled locally... let me fix them |
serialized_runtime_env(serialized_runtime_env) {} | ||
|
||
bool WorkerCacheKey::operator==(const WorkerCacheKey &k) const { | ||
return Hash() == k.Hash(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the potential to return True when the env vars aren't actually the same doesn't it? Is that intentional? If so, can we add a comment explaining why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'm fine with comparing all the fields for equality. I think @edoakes you had some input on this earlier, do you remember? Was it for speed or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@architkulkarni can't we remove the override_environment_variables very soon? In that case they'd just be included in the hashed env. We should probably include everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, the env vars are already always included in the hash. Alex's concern was just about comparing hashes in operator== as possed to comparing all the fields.
I'll soon make a PR to remove override_environment_variables, and in that PR I'll switch out the implementation of operator== to compare the full serialized_runtime_env sinstead of the hashes.
You're right, I'll add a test for that |
Tests look fine. Merging. |
…16766) * add python integration test * improve readability * remove unneccessary ray start --head * add shutdown_only * move RuntimeEnvHash from worker_pool to task_spec * lint * Add runtimeEnvHash and JobID to SchedulingKey * remove JobID from key and hopefully fix compile * add test for same env * lint
Why are these changes needed?
A worker can only be assigned to a task if its runtime env hash matches the one in the task spec. (https://github.com/ray-project/ray/blob/master/src/ray/raylet/worker_pool.cc#L929-L942)
If a remote task called that requires a runtime env to be installed, the worker lease request will remain unfulfilled for the duration of the installation, which can take ~50 seconds. This blocks other worker lease requests, because we ratelimit to a single lease request per SchedulingKey. (https://sourcegraph.com/github.com/ray-project/ray/-/blob/src/ray/core_worker/transport/direct_task_transport.cc?L441)
The issue is if we subsequently call another remote task with a different runtime env (or the same runtime env), it won't be scheduled until the first worker lease request is fulfilled:
The fix is to add RuntimeEnvHash to the SchedulingKey, which this PR does.
Related issue number
Closes #16226.
Closes #16537 by adding a test which fails before this PR, but passes with the fix in this PR. The test is a single-driver version of the reproduction in #16226, which required two drivers.
Checks
scripts/format.sh
to lint the changes in this PR.