Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for DeepseekAI's DeepseekVL #36248

Open
wants to merge 83 commits into
base: main
Choose a base branch
from

Conversation

geetu040
Copy link
Contributor

@geetu040 geetu040 commented Feb 18, 2025

What does this PR do?

Fixes #36110

This PR adds DeepseekAI's DeepseekVL model to Hugging Face Transformers.

DeepseekVL is an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.

Relevant Links

CC: @Benjamin-eecs, @RERV (github contributors of deepseek-ai/DeepSeek-VL)

Before submitting

Who can review?

@ArthurZucker, @Rocketknight1, @Cyrilvallez, @zucchini-nlp

TODOs

  • Documentation
  • Import Statements and Auto Modeling
  • Modeling
    • VisionModel
    • TextModel
    • AlignerModel
  • Tests
  • Integration Tests
  • Weight Conversion Script
  • Model Cards
  • Tokenizer/Processor
  • Fix CI/CD tests

Sorry, something went wrong.

@geetu040
Copy link
Contributor Author

@zucchini-nlp , @Rocketknight1, @Cyrilvallez

The Deepseek-VL uses Sam as backbone for encoding high-resolution images.
And to be more specific, the backbone is SamVisionEncoder instead of SamModel, which is not available as a public class. By which I mean that, you can do following with SamModel but not with SamVisionEncoder

from transformers import SamConfig, SamModel
config = SamConfig()
model = SamModel(config)

I think that we should rename SamVisionEncoder -> SamVisionModel, inherit it from SamPreTrainedModel and make it accessible to the user. I don't think it breaks backward compatibility in any way.

Otherwise, we would have to copy all the classes that build SamVisionEncoder for deepseek. There is nothing wrong with this either but having a SamVisionModel along with a SamModel makes sense, since it might benefit someone else as well.

If you think having a SamVisionModel makes sense, should that be done in a separate PR?

Btw, final results would look like this

from transformers import SamVisionConfig, SamVisionModel
config = SamVisionConfig()
model = SamVisionModel(config)

and SamVisionConfig is already available publically.

@zucchini-nlp
Copy link
Member

@geetu040 we had similar situation with ideficsVision afair. Yes, in that case, we can just make it public and add in the docs. Renaming though would be breaking, imo we can leave name as is

@geetu040
Copy link
Contributor Author

@zucchini-nlp is it okay to do it in the same PR? or should I create a new one

@zucchini-nlp
Copy link
Member

@geetu040 imo a new PR will make it easier for us to iterate and review

@geetu040
Copy link
Contributor Author

Hi @zucchini-nlp, I am working on the SamVisionEncoder (going to create the PR soon) and I have a quick question.
I realized that SamVisionAttention and SamVisionSdpaAttention produce attn_weights of different shapes when output_attentions=True.

Can you please answer these 2 questions:

  1. Is this allowed in transformers for the 2 attentions to produce outputs of different shapes?
  2. And lets suppose we do something that changes the shape of output_attentions, does that break backward compatibility?

@zucchini-nlp
Copy link
Member

@geetu040 no, that is not expected to have different shapes. Usually using sdpa attention means that no attn_weights are returned, so it should be available only through 'eager' attention modules

I see that the weights are calculated on top of SDPA by manual matmul of key and query, which imo defeats the purpose of using SDPA in the first place. Can you remove the returned attention and raise warning similar to what is done in ViT?

@geetu040
Copy link
Contributor Author

@zucchini-nlp sure I'll do that.

Shakib-IO and others added 7 commits March 25, 2025 21:31
@geetu040 geetu040 marked this pull request as ready for review March 26, 2025 09:08
@geetu040
Copy link
Contributor Author

@zucchini-nlp , @Rocketknight1, @Cyrilvallez

Hi everyone, I've moved this PR out of draft. Everything is complete except for the model cards. It's now ready for review. Thanks!

@zucchini-nlp
Copy link
Member

Thanks @geetu040 , I will take a look tomorrow

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean work, thanks for adding the model! 💛

Overall looks very much ready for core maintainer's review. Only thing I want to update is the modeling part with Auto backbones, left a comment below 👇🏻

Also we could use modular, to copy parts of processing and image processing. Not required though

geetu040 and others added 12 commits March 27, 2025 19:14
- remove model_id comments in examples
- remove from pre-trained auto mapping
- move to image-text-to-text from vision-to-seq in auto mapping
- add image_token_index to __init__ for config
- remove outdated temporary config in conversion script
- update example to use chat_template in docstring example
- update liscense 2021->2025
Co-authored-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
@geetu040
Copy link
Contributor Author

Hi @zucchini-nlp, Thanks for a review, I have updated accordingly.

Also we could use modular, to copy parts of processing and image processing. Not required though

I wanted to apply modular but can't really find much code that is reusable. We can reuse the 2 ModelOutput classes, except for that I cannot really think of anything else that could benefit from modular. I was using the preprocess from ViT, but that's also changed now since #36248 (comment). Do you still suggest that I use modular to keep everything under one file and to reuse the few lines we possibly can?

PS: Failing tests are unrelated

@geetu040 geetu040 requested a review from zucchini-nlp March 28, 2025 10:22
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I think we can add SamNect with "Copied from" which is more readable than copy.deepcopy(mylayer)

Overall LGTM, approving. I left two open to discussion comments, for core maintainer's decision 😉

Comment on lines +394 to +400
pixel_values = F.interpolate(
pixel_values.float(),
size=self.config.low_res_vision_config.image_size,
mode="bilinear",
antialias=True,
)
pixel_values = (pixel_values - self.low_res_vision_mean) / self.low_res_vision_std
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really fan of applying image processing in model code, but I see why this is case for DeepSeek.

We could technically have a workaround by returning pixel_values and pixel_values_high_res, which is also not the best option. I'll leave this to Arthur to decide

# TODO: update this when https://github.com/huggingface/transformers/pull/36493 is merged
# self.high_res_vision_model = AutoModel.from_config(config.high_res_vision_config)
self.high_res_vision_model = SamVisionEncoder(config.high_res_vision_config)
self.high_res_vision_neck = deepcopy(self.high_res_vision_model.neck)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explaining. I think copying is still not a good option, if the aim is to have same module with possibly different weight. We can add a DeepSeekVisionNeck module copied from Sam and use it here

Comment on lines +340 to +342
if self.use_high_res_vision:
self.output_size = config.low_res_vision_config.image_size // config.low_res_vision_config.patch_size
self.global_attn_index = config.high_res_vision_config.global_attn_indexes[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another concern about adding a whole backbone depending on a flag, it's not the first time I'm seeing this trend in VLMs

An option more aligned with transformers but with code duplication can be to have two separate DeepSeekVisionEncoder for Siglip and DeepSeekVisionEncoderHighRes for Siglip+Sam. Then here we add one of two depending on config. Leaving open for discussion as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Deepseek-VL
3 participants