-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for DeepseekAI's DeepseekVL #36248
base: main
Are you sure you want to change the base?
Conversation
@zucchini-nlp , @Rocketknight1, @Cyrilvallez The from transformers import SamConfig, SamModel
config = SamConfig()
model = SamModel(config) I think that we should rename Otherwise, we would have to copy all the classes that build If you think having a Btw, final results would look like this from transformers import SamVisionConfig, SamVisionModel
config = SamVisionConfig()
model = SamVisionModel(config) and |
@geetu040 we had similar situation with ideficsVision afair. Yes, in that case, we can just make it public and add in the docs. Renaming though would be breaking, imo we can leave name as is |
@zucchini-nlp is it okay to do it in the same PR? or should I create a new one |
@geetu040 imo a new PR will make it easier for us to iterate and review |
Hi @zucchini-nlp, I am working on the Can you please answer these 2 questions:
|
@geetu040 no, that is not expected to have different shapes. Usually using sdpa attention means that no I see that the weights are calculated on top of SDPA by manual matmul of key and query, which imo defeats the purpose of using SDPA in the first place. Can you remove the returned attention and raise warning similar to what is done in ViT? |
@zucchini-nlp sure I'll do that. |
@zucchini-nlp , @Rocketknight1, @Cyrilvallez Hi everyone, I've moved this PR out of draft. Everything is complete except for the model cards. It's now ready for review. Thanks! |
Thanks @geetu040 , I will take a look tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean work, thanks for adding the model! 💛
Overall looks very much ready for core maintainer's review. Only thing I want to update is the modeling part with Auto backbones, left a comment below 👇🏻
Also we could use modular, to copy parts of processing and image processing. Not required though
src/transformers/models/deepseek_vl/configuration_deepseek_vl.py
Outdated
Show resolved
Hide resolved
src/transformers/models/deepseek_vl/configuration_deepseek_vl.py
Outdated
Show resolved
Hide resolved
- remove model_id comments in examples - remove from pre-trained auto mapping - move to image-text-to-text from vision-to-seq in auto mapping - add image_token_index to __init__ for config - remove outdated temporary config in conversion script - update example to use chat_template in docstring example - update liscense 2021->2025
Co-authored-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
Hi @zucchini-nlp, Thanks for a review, I have updated accordingly.
I wanted to apply modular but can't really find much code that is reusable. We can reuse the 2 ModelOutput classes, except for that I cannot really think of anything else that could benefit from modular. I was using the PS: Failing tests are unrelated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I think we can add SamNect with "Copied from" which is more readable than copy.deepcopy(mylayer)
Overall LGTM, approving. I left two open to discussion comments, for core maintainer's decision 😉
pixel_values = F.interpolate( | ||
pixel_values.float(), | ||
size=self.config.low_res_vision_config.image_size, | ||
mode="bilinear", | ||
antialias=True, | ||
) | ||
pixel_values = (pixel_values - self.low_res_vision_mean) / self.low_res_vision_std |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really fan of applying image processing in model code, but I see why this is case for DeepSeek.
We could technically have a workaround by returning pixel_values
and pixel_values_high_res
, which is also not the best option. I'll leave this to Arthur to decide
# TODO: update this when https://github.com/huggingface/transformers/pull/36493 is merged | ||
# self.high_res_vision_model = AutoModel.from_config(config.high_res_vision_config) | ||
self.high_res_vision_model = SamVisionEncoder(config.high_res_vision_config) | ||
self.high_res_vision_neck = deepcopy(self.high_res_vision_model.neck) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for explaining. I think copying is still not a good option, if the aim is to have same module with possibly different weight. We can add a DeepSeekVisionNeck
module copied from Sam and use it here
if self.use_high_res_vision: | ||
self.output_size = config.low_res_vision_config.image_size // config.low_res_vision_config.patch_size | ||
self.global_attn_index = config.high_res_vision_config.global_attn_indexes[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another concern about adding a whole backbone depending on a flag, it's not the first time I'm seeing this trend in VLMs
An option more aligned with transformers but with code duplication can be to have two separate DeepSeekVisionEncoder
for Siglip and DeepSeekVisionEncoderHighRes
for Siglip+Sam. Then here we add one of two depending on config. Leaving open for discussion as well
What does this PR do?
Fixes #36110
This PR adds DeepseekAI's DeepseekVL model to Hugging Face Transformers.
DeepseekVL is an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.
Relevant Links
CC: @Benjamin-eecs, @RERV (github contributors of deepseek-ai/DeepSeek-VL)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker, @Rocketknight1, @Cyrilvallez, @zucchini-nlp
TODOs