Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Remove BackendConfig broadcasting #19154

Merged
merged 39 commits into from
Oct 13, 2021

Conversation

edoakes
Copy link
Contributor

@edoakes edoakes commented Oct 6, 2021

Why are these changes needed?

Depends on: #19145

Instead of broadcasting max_concurrent_queries at the deployment level, tacks it on as additional metadata for each individual replica.

Related issue number

Closes #19147

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@edoakes
Copy link
Contributor Author

edoakes commented Oct 6, 2021

@simon-mo review #19145 first

@edoakes edoakes force-pushed the dont-poll-max-concurrent-queries branch from b37a0cd to 6bd3f6a Compare October 6, 2021 18:55
@edoakes edoakes force-pushed the dont-poll-max-concurrent-queries branch from 6bd3f6a to 9743854 Compare October 6, 2021 19:18
@simon-mo
Copy link
Contributor

simon-mo commented Oct 6, 2021

Yup. Totally agree this is the right track. For the Java backend prototype, I ended up adding a similar structure as ReplicaInfo


# Set the hash seed to ensure the hashes are consistent across processes.
# We should probably use a purpose-built hashing algorithm like CRC32 instead.
os.environ["PYTHONHASHSEED"] = "0"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jiaodong discovered this while working on this PR -- the user_config hashes are different across processes, so the recovery logic is likely not working for these...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow this was quite subtle but great catch. Let's follow up more in slack as it seems we need more thorough testing and harding around FT stuff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's categorize this as a tech debt and add an issue to the tech debt milestone? setting PYTHONHASHSEED is generally bad as this might also set it for whatever user code depends on

Hash randomization is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict construction, O(n2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simon-mo take a look at the updated pickle.dumps version?

@edoakes edoakes force-pushed the dont-poll-max-concurrent-queries branch from dfe0b94 to 3570f58 Compare October 8, 2021 18:07
@edoakes
Copy link
Contributor Author

edoakes commented Oct 12, 2021

@simon-mo this is ready for review now

Copy link
Contributor

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Two minor comments about replica.py code changes.

python/ray/serve/replica.py Outdated Show resolved Hide resolved
python/ray/serve/replica.py Outdated Show resolved Hide resolved
@edoakes edoakes merged commit 2ac81f3 into ray-project:master Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't use long poll client for max_concurrent_queries
3 participants