Description
The model to consider.
I'm trying to port this model - https://huggingface.co/vikp/surya_rec2 to vLLM. I'm hitting a few roadblocks that I need guidance on
The closest model vllm already supports.
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/mllama.py is the closest model, as it takes image inputs as cross-attention, unlike LLaVA-style models
What's your difficulty of supporting the model you want?
The model has the following architecture - Image Encoder (Swin Based) -> Text Encoder -> Decoder
The Text Encoder performs cross attention over encoder_input_ids
and image_embeds
from the image encoder.
The Decoder further performs cross attention over decoder_input_ids
and text_encoder_hidden_states
.
Implementing the text encoder and decoder mostly follows that of mllama. However, porting the image encoder fully to vLLM is tricky due to the implementation of the SwinTransformers-Style attention windowing etc.
Is there a way to use the (original) image encoder to produce image embeds, and explicitly cache these for usage by the (Text Encoder + Decoder) stack? I know this is kind of how it works for LLaVA style models, but I am unclear how the image encoder outputs are cached and not recomputed on every forward call.
Any help would be appreciated. Thank you!
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Type
Projects
Status