[New Model]: Surya OCR

### The model to consider.

I'm trying to port this model - https://huggingface.co/vikp/surya_rec2 to vLLM. I'm hitting a few roadblocks that I need guidance on

### The closest model vllm already supports.

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/mllama.py is the closest model, as it takes image inputs as cross-attention, unlike LLaVA-style models

### What's your difficulty of supporting the model you want?

The model has the following architecture - Image Encoder (Swin Based) -> Text Encoder -> Decoder

The Text Encoder performs cross attention over `encoder_input_ids` and `image_embeds` from the image encoder. 
The Decoder further performs cross attention over `decoder_input_ids` and `text_encoder_hidden_states`. 

Implementing the text encoder and decoder mostly follows that of mllama. However, porting the image encoder fully to vLLM is tricky due to the implementation of the SwinTransformers-Style attention windowing etc. 

Is there a way to use the (original) image encoder to produce image embeds, and explicitly cache these for usage by the (Text Encoder + Decoder) stack? I know this is kind of how it works for LLaVA style models, but I am unclear how the image encoder outputs are cached and not recomputed on every forward call. 

Any help would be appreciated. Thank you!

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[New Model]: Surya OCR #13172

The model to consider.

The closest model vllm already supports.

What's your difficulty of supporting the model you want?

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[New Model]: Surya OCR #13172

Description

The model to consider.

The closest model vllm already supports.

What's your difficulty of supporting the model you want?

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions