Skip to content

Conversation

@markmc
Copy link
Member

@markmc markmc commented Nov 26, 2025

Related to llm-d/llm-d#499

When prefill and decode instances have incompatible versions or configurations, it is possible that the handshake will fail with unclear errors, NIXL transfers will fail, or even server crashes might occur. This change adds explicit compatibility checks during handshake to fail fast with actionable error messages.

The compatibility check is based on the vLLM version, a NIXL connector version number, and a hash of relevant config. All of these must match exactly. We will likely evolve these compatibility factors to be more or less permissive based on feedback.

Compatibility metadata is checked before decoding NIXL agent metadata so that agent metadata schema incompatibilities are caught by the compat hash, rather than as a decoding error.

An escape hatch is evailable to disable compat hash checking:

--kv-transfer-config '{"kv_connector_extra_config": {"enforce_handshake_compat": false}}'

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a compatibility check mechanism for the NIXL KV connector to prevent issues arising from mismatched configurations between prefill and decode instances. The compatibility is verified using a hash composed of the vLLM version, a new NIXL connector version, and a hash of the vLLM configuration. This check is performed during the handshake, allowing for a fast-fail with an informative error message if a mismatch is detected. The implementation involves wrapping the NIXL agent metadata in a new NixlHandshakePayload that includes the compatibility hash.

The changes are well-structured and address the problem effectively. My review focuses on improving the robustness of error handling in the handshake process. I've suggested replacing broad except Exception blocks with more specific exception handling to avoid masking unrelated issues and to make the code more resilient.

@markmc
Copy link
Member Author

markmc commented Nov 26, 2025

Good suggestion from @tlrmchlsmth - we should have an escape hatch flag to disable this in case users have valid config mismatch use cases

@markmc
Copy link
Member Author

markmc commented Nov 26, 2025

Also need to add an explicit test for compat hash mismatch

Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have an escape hatch flag to disable this in case users have valid config mismatch use cases

Yes I think checking vLLM+nixl version is generally safe.
vllm_config check is too strict unless we take a subset of fields: eg parallel config can safely differ (heterogeneous deployments), as well as block_size (more compat checks are currently in _validate_handshake).

@markmc markmc force-pushed the nixl-handshake-compat-check branch from bbd5cd1 to cfc0f14 Compare November 27, 2025 14:39
@markmc
Copy link
Member Author

markmc commented Nov 27, 2025

Updates:

  1. Greatly restricted what is included in the config hash
  2. Added an escape hatch
  3. Added testing

This is ready for more detailed review now, I think

@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 27, 2025
@markmc markmc force-pushed the nixl-handshake-compat-check branch from cfc0f14 to 75c6db5 Compare November 27, 2025 15:53
@markmc
Copy link
Member Author

markmc commented Nov 27, 2025

Rebased to pick up #29588

Related to llm-d/llm-d#499

When prefill and decode instances have incompatible versions or configurations,
it is possible that the handshake will fail with unclear errors, NIXL transfers
will fail, or even server crashes might occur. This change adds explicit
compatibility checks during handshake to fail fast with actionable error messages.

The compatibility check is based on the vLLM version, a NIXL connector version
number, and a hash of relevant config. All of these must match exactly. We will
likely evolve these compatibility factors to be more or less permissive based on
feedback.

Compatibility metadata is checked before decoding NIXL agent metadata so that
agent metadata schema incompatibilities are caught by the compat hash, rather
than as a decoding error.

An escape hatch is evailable to disable compat hash checking:

```
--kv-transfer-config `{"kv_connector_extra_config": {"enforce_handshake_compat": false}}`
```

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc markmc force-pushed the nixl-handshake-compat-check branch from 75c6db5 to 33d2c0b Compare November 27, 2025 21:24
@markmc
Copy link
Member Author

markmc commented Nov 27, 2025

Rebased to pick up #29619

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants