Add support for DeepseekAI's DeepseekVL #36248

geetu040 · 2025-02-18T07:41:43Z

What does this PR do?

This PR adds DeepseekAI's DeepseekVL model to Hugging Face Transformers.

DeepseekVL is an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.

Relevant Links

Research Paper: DeepSeek-VL: Towards Real-World Vision-Language Understanding
Authors: Haoyu Lu, Wen Liu, Bo Zhang, et al.
Implementation: github.com/deepseek-ai/DeepSeek-VL
Models Weights: huggingface.co/collections/deepseek-ai/deepseek-vl

CC: @Benjamin-eecs, @RERV (github contributors of deepseek-ai/DeepSeek-VL)

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker, @Rocketknight1, @Cyrilvallez, @zucchini-nlp

TODOs

geetu040 · 2025-02-24T04:51:00Z

@zucchini-nlp , @Rocketknight1, @Cyrilvallez

The Deepseek-VL uses Sam as backbone for encoding high-resolution images.
And to be more specific, the backbone is SamVisionEncoder instead of SamModel, which is not available as a public class. By which I mean that, you can do following with SamModel but not with SamVisionEncoder

from transformers import SamConfig, SamModel
config = SamConfig()
model = SamModel(config)

I think that we should rename SamVisionEncoder -> SamVisionModel, inherit it from SamPreTrainedModel and make it accessible to the user. I don't think it breaks backward compatibility in any way.

Otherwise, we would have to copy all the classes that build SamVisionEncoder for deepseek. There is nothing wrong with this either but having a SamVisionModel along with a SamModel makes sense, since it might benefit someone else as well.

If you think having a SamVisionModel makes sense, should that be done in a separate PR?

Btw, final results would look like this

from transformers import SamVisionConfig, SamVisionModel
config = SamVisionConfig()
model = SamVisionModel(config)

and SamVisionConfig is already available publically.

zucchini-nlp · 2025-02-24T08:26:22Z

@geetu040 we had similar situation with ideficsVision afair. Yes, in that case, we can just make it public and add in the docs. Renaming though would be breaking, imo we can leave name as is

geetu040 · 2025-02-25T05:25:04Z

@zucchini-nlp is it okay to do it in the same PR? or should I create a new one

zucchini-nlp · 2025-02-25T08:31:51Z

@geetu040 imo a new PR will make it easier for us to iterate and review

geetu040 · 2025-02-26T09:41:51Z

Hi @zucchini-nlp, I am working on the SamVisionEncoder (going to create the PR soon) and I have a quick question.
I realized that SamVisionAttention and SamVisionSdpaAttention produce attn_weights of different shapes when output_attentions=True.

Can you please answer these 2 questions:

Is this allowed in transformers for the 2 attentions to produce outputs of different shapes?
And lets suppose we do something that changes the shape of output_attentions, does that break backward compatibility?

zucchini-nlp · 2025-02-26T10:07:36Z

@geetu040 no, that is not expected to have different shapes. Usually using sdpa attention means that no attn_weights are returned, so it should be available only through 'eager' attention modules

I see that the weights are calculated on top of SDPA by manual matmul of key and query, which imo defeats the purpose of using SDPA in the first place. Can you remove the returned attention and raise warning similar to what is done in ViT?

geetu040 · 2025-02-26T10:08:44Z

@zucchini-nlp sure I'll do that.

geetu040 · 2025-03-26T09:12:02Z

@zucchini-nlp , @Rocketknight1, @Cyrilvallez

Hi everyone, I've moved this PR out of draft. Everything is complete except for the model cards. It's now ready for review. Thanks!

zucchini-nlp · 2025-03-26T12:04:15Z

Thanks @geetu040 , I will take a look tomorrow

zucchini-nlp

Very clean work, thanks for adding the model! 💛

Overall looks very much ready for core maintainer's review. Only thing I want to update is the modeling part with Auto backbones, left a comment below 👇🏻

Also we could use modular, to copy parts of processing and image processing. Not required though

docs/source/en/model_doc/deepseek_vl.md

src/transformers/models/auto/modeling_auto.py

src/transformers/models/deepseek_vl/configuration_deepseek_vl.py

src/transformers/models/deepseek_vl/modeling_deepseek_vl.py

src/transformers/models/deepseek_vl/processing_deepseek_vl.py

tests/models/deepseek_vl/test_image_processing_deepseek_vl.py

- remove model_id comments in examples - remove from pre-trained auto mapping - move to image-text-to-text from vision-to-seq in auto mapping - add image_token_index to __init__ for config - remove outdated temporary config in conversion script - update example to use chat_template in docstring example - update liscense 2021->2025

Co-authored-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>

geetu040 · 2025-03-28T10:22:34Z

Hi @zucchini-nlp, Thanks for a review, I have updated accordingly.

Also we could use modular, to copy parts of processing and image processing. Not required though

I wanted to apply modular but can't really find much code that is reusable. We can reuse the 2 ModelOutput classes, except for that I cannot really think of anything else that could benefit from modular. I was using the preprocess from ViT, but that's also changed now since #36248 (comment). Do you still suggest that I use modular to keep everything under one file and to reuse the few lines we possibly can?

PS: Failing tests are unrelated

zucchini-nlp

Great! I think we can add SamNect with "Copied from" which is more readable than copy.deepcopy(mylayer)

Overall LGTM, approving. I left two open to discussion comments, for core maintainer's decision 😉

zucchini-nlp · 2025-03-28T10:54:01Z

src/transformers/models/deepseek_vl/modeling_deepseek_vl.py

+        pixel_values = F.interpolate(
+            pixel_values.float(),
+            size=self.config.low_res_vision_config.image_size,
+            mode="bilinear",
+            antialias=True,
+        )
+        pixel_values = (pixel_values - self.low_res_vision_mean) / self.low_res_vision_std


not really fan of applying image processing in model code, but I see why this is case for DeepSeek.

We could technically have a workaround by returning pixel_values and pixel_values_high_res, which is also not the best option. I'll leave this to Arthur to decide

zucchini-nlp · 2025-03-28T10:56:04Z

src/transformers/models/deepseek_vl/modeling_deepseek_vl.py

+            # TODO: update this when https://github.com/huggingface/transformers/pull/36493 is merged
+            # self.high_res_vision_model = AutoModel.from_config(config.high_res_vision_config)
+            self.high_res_vision_model = SamVisionEncoder(config.high_res_vision_config)
+            self.high_res_vision_neck = deepcopy(self.high_res_vision_model.neck)


thanks for explaining. I think copying is still not a good option, if the aim is to have same module with possibly different weight. We can add a DeepSeekVisionNeck module copied from Sam and use it here

zucchini-nlp · 2025-03-28T11:02:47Z

src/transformers/models/deepseek_vl/modeling_deepseek_vl.py

+        if self.use_high_res_vision:
+            self.output_size = config.low_res_vision_config.image_size // config.low_res_vision_config.patch_size
+            self.global_attn_index = config.high_res_vision_config.global_attn_indexes[0]


Another concern about adding a whole backbone depending on a flag, it's not the first time I'm seeing this trend in VLMs

An option more aligned with transformers but with code duplication can be to have two separate DeepSeekVisionEncoder for Siglip and DeepSeekVisionEncoderHighRes for Siglip+Sam. Then here we add one of two depending on config. Leaving open for discussion as well

geetu040 and others added 15 commits February 18, 2025 12:27

upload initial code

Loading
Loading status checks…

f3d1896

update deepseek-vl adaptor

Loading
Loading status checks…

b904f22

update hierarchy of vision model classes

Loading
Loading status checks…

7d44bee

udpate aligner model

Loading
Loading status checks…

a3734d6

Merge branch 'main' into deepseek-vl

d0305b2

add text model

abea4eb

Added Image Processor

Loading
Loading status checks…

65886ec

Added Image Processor

Loading
Loading status checks…

19a7666

Added Image Processor

Loading
Loading status checks…

9c3c544

apply masks

1e49a1f

Merge remote-tracking branch 'fork/deepseek-vl' into deepseek-vl

Loading
Loading status checks…

972ee16

remove projection; add aligner

52c80c1

remove interpolate_pos_encoding

d362c9d

remove unused params in config

7d51093

cleaning

Loading
Loading status checks…

8d32560

This was referenced Mar 2, 2025

Create and Expose SamVisionModel as public for better accessibility #36493

Open

🚨🚨🚨 Fix sdpa in sam and refactor relative position embeddings #36422

Merged

Add the __init__ file

Loading
Loading status checks…

c72cc51

Shakib-IO force-pushed the deepseek-vl branch from 8d32560 to c72cc51 Compare March 2, 2025 17:55

Shakib-IO added 4 commits March 3, 2025 22:01

added processing deepseek_vl class

Loading
Loading status checks…

16a4f4f

modified the deepseek-vl processor

Loading
Loading status checks…

a55b781

modified the deepseek-vl processor

Loading
Loading status checks…

834ecba

update __init__

Loading
Loading status checks…

4249fc3

Shakib-IO and others added 7 commits March 25, 2025 21:31

added test_processor

Loading
Loading status checks…

4320af5

added test_processor

Loading
Loading status checks…

0c8cdfc

Merge branch 'main' into deepseek-vl

Loading
Loading status checks…

379970e

fix processor tests

Loading
Loading status checks…

c581dcf

update docs

Loading
Loading status checks…

4dca1d8

update docs

Loading
Loading status checks…

273be1e

update docs

Loading
Loading status checks…

23e29ee

geetu040 marked this pull request as ready for review March 26, 2025 09:08

github-actions bot requested review from ArthurZucker and Rocketknight1 March 26, 2025 09:08

update conversion script

Loading
Loading status checks…

2725f00

Fixed typos

Loading
Loading status checks…

708e35f

zucchini-nlp reviewed Mar 27, 2025

View reviewed changes

geetu040 and others added 12 commits March 27, 2025 19:14

fix type in config docstring

Loading
Loading status checks…

e5994f9

Co-authored-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>

update get_image_features

8eb0d16

fix config

1f6115b

improve DeepseekVLImageProcessor.preprocess

Loading
Loading status checks…

1d73b47

return image_hidden_states

Loading
Loading status checks…

2d83b69

use AutoTokenizer and AutoImageProcessor in Processor

Loading
Loading status checks…

b93ca2c

fix model outputs

Loading
Loading status checks…

87aac21

make num_image_tokens configurable

Loading
Loading status checks…

b5e456c

fix docstring of processor

Loading
Loading status checks…

678ce4d

move system prompt to chat template

Loading
Loading status checks…

f58a298

Merge branch 'main' into deepseek-vl

Loading
Loading status checks…

4b77caf

geetu040 requested a review from zucchini-nlp March 28, 2025 10:22

zucchini-nlp approved these changes Mar 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for DeepseekAI's DeepseekVL #36248

Add support for DeepseekAI's DeepseekVL #36248

geetu040 commented Feb 18, 2025 •

edited

Loading

geetu040 commented Feb 24, 2025

zucchini-nlp commented Feb 24, 2025

geetu040 commented Feb 25, 2025

zucchini-nlp commented Feb 25, 2025

geetu040 commented Feb 26, 2025

zucchini-nlp commented Feb 26, 2025

geetu040 commented Feb 26, 2025

geetu040 commented Mar 26, 2025

zucchini-nlp commented Mar 26, 2025

zucchini-nlp left a comment •

edited

Loading

geetu040 commented Mar 28, 2025

zucchini-nlp left a comment

zucchini-nlp Mar 28, 2025

zucchini-nlp Mar 28, 2025

zucchini-nlp Mar 28, 2025

Add support for DeepseekAI's DeepseekVL #36248

Are you sure you want to change the base?

Add support for DeepseekAI's DeepseekVL #36248

Conversation

geetu040 commented Feb 18, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

TODOs

geetu040 commented Feb 24, 2025

zucchini-nlp commented Feb 24, 2025

geetu040 commented Feb 25, 2025

zucchini-nlp commented Feb 25, 2025

geetu040 commented Feb 26, 2025

zucchini-nlp commented Feb 26, 2025

geetu040 commented Feb 26, 2025

geetu040 commented Mar 26, 2025

zucchini-nlp commented Mar 26, 2025

zucchini-nlp left a comment • edited Loading

Choose a reason for hiding this comment

geetu040 commented Mar 28, 2025

zucchini-nlp left a comment

Choose a reason for hiding this comment

zucchini-nlp Mar 28, 2025

Choose a reason for hiding this comment

zucchini-nlp Mar 28, 2025

Choose a reason for hiding this comment

zucchini-nlp Mar 28, 2025

Choose a reason for hiding this comment

geetu040 commented Feb 18, 2025 •

edited

Loading

zucchini-nlp left a comment •

edited

Loading