-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Add RADIO Vision Encoder Support to vLLM #24595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the RADIO vision encoder, enabling its use in multimodal models like Nano Nemotron VL. The changes include a new RadioModel
implementation, integration into NanoNemotronVL
, and corresponding tests. While the implementation is comprehensive, there are a few critical issues that need to be addressed. A potential crash due to unsafe dictionary access in the configuration helper needs to be fixed. The vLLM implementation of RadioInternVisionModel
is missing a final normalization layer present in the original model, which will lead to incorrect outputs. Additionally, a bug in the test file could lead to incorrect or inefficient test execution. There are also opportunities to make the weight loading logic more robust by handling unexpected weights.
vllm/model_executor/models/radio.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RadioInternVisionModel
implementation is missing the final normalization layer that is present in the original HuggingFace RadioInternVisionModel
. The original model applies a norm
layer after the encoder. This omission will lead to incorrect model outputs.
Additionally, the load_weights
method in RadioModel
silently ignores weights that it doesn't recognize, including the weights for this missing normalization layer (model.norm.weight
and model.norm.bias
). This makes the issue harder to detect.
You should add the final normalization layer to RadioInternVisionModel
and update RadioModel.load_weights
to handle its weights.
@DarkLight1337 Fixed the comments. |
This pull request has merge conflicts that must be resolved before it can be |
2d7ea3b
to
ae5b38d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, thanks
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> Co-authored-by: root <root@cw-dfw-h100-001-305-026.cm.cluster>
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> Co-authored-by: root <root@cw-dfw-h100-001-305-026.cm.cluster> Signed-off-by: charlifu <charlifu@amd.com>
This PR implements support for the C-RADIO (Retrieval-Augmented Dual Instruction Optimization) vision encoder in vLLM, enabling its use with multimodal models like Nano Nemotron VL.
Changes
New Radio Model Implementation (
vllm/model_executor/models/radio.py
)RadioInternVisionModel
: Core vision model using InternVision encoder architectureIntegration Updates (
vllm/model_executor/models/nano_nemotron_vl.py
)Testing (
tests/models/multimodal/pooling/test_radio.py
)nvidia/C-RADIOv2-H
Technical Notes
Hardcoded Values: The implementation preserves hardcoded values from the original
timm
package implementation, including OpenAI CLIP normalization constants and predefined ViT model dimensions, ensuring compatibility and reproducibility.Configuration: Create new configuration approach to instantiate the Radio model based on InterVision model architecture, with dynamic parameter mapping for different ViT variants.
Weight Loading: Custom weight loader handles mapping between HuggingFace and vLLM parameter names, supporting models with
radio_model.
prefix while skipping unused parameters.