Skip to content

[New Model]: Surya OCR #13172

Closed as not planned
Closed as not planned
@tarun-menta

Description

@tarun-menta

The model to consider.

I'm trying to port this model - https://huggingface.co/vikp/surya_rec2 to vLLM. I'm hitting a few roadblocks that I need guidance on

The closest model vllm already supports.

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/mllama.py is the closest model, as it takes image inputs as cross-attention, unlike LLaVA-style models

What's your difficulty of supporting the model you want?

The model has the following architecture - Image Encoder (Swin Based) -> Text Encoder -> Decoder

The Text Encoder performs cross attention over encoder_input_ids and image_embeds from the image encoder.
The Decoder further performs cross attention over decoder_input_ids and text_encoder_hidden_states.

Implementing the text encoder and decoder mostly follows that of mllama. However, porting the image encoder fully to vLLM is tricky due to the implementation of the SwinTransformers-Style attention windowing etc.

Is there a way to use the (original) image encoder to produce image embeds, and explicitly cache these for usage by the (Text Encoder + Decoder) stack? I know this is kind of how it works for LLaVA style models, but I am unclear how the image encoder outputs are cached and not recomputed on every forward call.

Any help would be appreciated. Thank you!

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    new-modelRequests to new modelsstaleOver 90 days of inactivity

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions