-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[NIXL] Add compatibility checking to NIXL KV connector handshake #29503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a compatibility check mechanism for the NIXL KV connector to prevent issues arising from mismatched configurations between prefill and decode instances. The compatibility is verified using a hash composed of the vLLM version, a new NIXL connector version, and a hash of the vLLM configuration. This check is performed during the handshake, allowing for a fast-fail with an informative error message if a mismatch is detected. The implementation involves wrapping the NIXL agent metadata in a new NixlHandshakePayload that includes the compatibility hash.
The changes are well-structured and address the problem effectively. My review focuses on improving the robustness of error handling in the handshake process. I've suggested replacing broad except Exception blocks with more specific exception handling to avoid masking unrelated issues and to make the code more resilient.
|
Good suggestion from @tlrmchlsmth - we should have an escape hatch flag to disable this in case users have valid config mismatch use cases |
|
Also need to add an explicit test for compat hash mismatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should have an escape hatch flag to disable this in case users have valid config mismatch use cases
Yes I think checking vLLM+nixl version is generally safe.
vllm_config check is too strict unless we take a subset of fields: eg parallel config can safely differ (heterogeneous deployments), as well as block_size (more compat checks are currently in _validate_handshake).
bbd5cd1 to
cfc0f14
Compare
|
Updates:
This is ready for more detailed review now, I think |
cfc0f14 to
75c6db5
Compare
|
Rebased to pick up #29588 |
Related to llm-d/llm-d#499 When prefill and decode instances have incompatible versions or configurations, it is possible that the handshake will fail with unclear errors, NIXL transfers will fail, or even server crashes might occur. This change adds explicit compatibility checks during handshake to fail fast with actionable error messages. The compatibility check is based on the vLLM version, a NIXL connector version number, and a hash of relevant config. All of these must match exactly. We will likely evolve these compatibility factors to be more or less permissive based on feedback. Compatibility metadata is checked before decoding NIXL agent metadata so that agent metadata schema incompatibilities are caught by the compat hash, rather than as a decoding error. An escape hatch is evailable to disable compat hash checking: ``` --kv-transfer-config `{"kv_connector_extra_config": {"enforce_handshake_compat": false}}` ``` Signed-off-by: Mark McLoughlin <markmc@redhat.com>
75c6db5 to
33d2c0b
Compare
|
Rebased to pick up #29619 |
Related to llm-d/llm-d#499
When prefill and decode instances have incompatible versions or configurations, it is possible that the handshake will fail with unclear errors, NIXL transfers will fail, or even server crashes might occur. This change adds explicit compatibility checks during handshake to fail fast with actionable error messages.
The compatibility check is based on the vLLM version, a NIXL connector version number, and a hash of relevant config. All of these must match exactly. We will likely evolve these compatibility factors to be more or less permissive based on feedback.
Compatibility metadata is checked before decoding NIXL agent metadata so that agent metadata schema incompatibilities are caught by the compat hash, rather than as a decoding error.
An escape hatch is evailable to disable compat hash checking: