Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] move function and actor importer away from pubsub #24132

Merged
merged 13 commits into from
Apr 26, 2022

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented Apr 23, 2022

Why are these changes needed?

This PR moves function import to a lazy way. Several benefits of this:

  • worker start up is faster since it doesn't need to go through all functions exported
  • gcs pressure is smaller since 1) we don't need to export key and 2) all loads are done when needed.
  • get rid of function table channel

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@fishbone fishbone marked this pull request as ready for review April 23, 2022 18:46
@fishbone fishbone changed the title move function and actor descriptor away from pubsub [wip]move function and actor descriptor away from pubsub Apr 23, 2022
@fishbone fishbone changed the title [wip]move function and actor descriptor away from pubsub [wip]move function and actor importer away from pubsub Apr 24, 2022
@fishbone fishbone changed the title [wip]move function and actor importer away from pubsub [core] move function and actor importer away from pubsub Apr 24, 2022
@fishbone
Copy link
Contributor Author

@ericl could you help check the PR and let me know whether I missed something? Hopefully, it should work.

If this is working, I'll get rid of run_on_all_nodes and then import_thread.py is not necessary anymore.

Comment on lines 141 to -147
def init_func(worker_info):
a = ray.get_actor("recorder", namespace="n")
a.record.remote(worker_info['worker'].worker_id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code has a race condition. Before the worker calls run in loop, we can't call any remote function.

If init_func is called before the worker calls run in loop, it'll crash.
If init_func is called after the worker executes tasks, it'll be wrong.

@fishbone
Copy link
Contributor Author

This PR also makes me aware it's not that easy to support create new worker when GCS is down.

Copy link
Member

@mwtian mwtian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we ran nightly tests on this? Moving importing to lazy may have unknown risks.

)
if self.fetch_and_register_remote_function(key) is True:
break
elif not self._worker.actor_id.is_nil():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just else?

job_id,
function_descriptor.function_id.binary(),
)
if self.fetch_and_register_remote_function(key) is True:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this busy loop if fetching fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok since the worker will not make progress anyway before

python/ray/_private/import_thread.py Outdated Show resolved Hide resolved
@mwtian
Copy link
Member

mwtian commented Apr 25, 2022

Great, really hope this can work!

@fishbone
Copy link
Contributor Author

@mwtian it's running here https://buildkite.com/ray-project/release-tests-pr/builds/1232
So far it looks good.

@ericl
Copy link
Contributor

ericl commented Apr 25, 2022

+1 to trigger nightly tests on this. I'm fine with merging this if nightly tests pass 3x in a row with no import related issues or hangs, and also making sure the unit tests are not flaky.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 25, 2022
@fishbone
Copy link
Contributor Author

Win tests are broken but not related to the flow change. Most are because the CI is not written well. I'll fix that and run nightly test again.

@fishbone
Copy link
Contributor Author

@fishbone
Copy link
Contributor Author

@fishbone
Copy link
Contributor Author

Merge to master and trigger the third run here.

https://buildkite.com/ray-project/release-tests-pr/builds/1232
https://buildkite.com/ray-project/release-tests-pr/builds/1370
https://buildkite.com/ray-project/release-tests-pr/builds/1387

I feel there might not be enough resources. Let's wait and see.

@fishbone
Copy link
Contributor Author

Copy link
Member

@mwtian mwtian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One comment is that we are still importing all dependencies in their export order, by calling _do_import() on worker startup and when a dependency is not found. This PR is not changing to per-function lazy import yet, so there is no import ordering issue. Maybe it is useful to clarify in the PR description.

@mwtian
Copy link
Member

mwtian commented Apr 26, 2022

Chatted with Yi offline, the only purpose of _do_import() now is running functions that need to run on all workers. We are indeed importing directly from GCS KV with fetch_and_register_remote_function().

@fishbone fishbone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 26, 2022
@fishbone
Copy link
Contributor Author

@ericl it seems like lacking of resources and queues forever. The first one has completely finished. The second/third one is partially finished. Do you think it's good to let it go?

@ericl
Copy link
Contributor

ericl commented Apr 26, 2022

We can give it a try and keep an eye out.

@fishbone fishbone merged commit f112b52 into ray-project:master Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants