Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add DINO DETR Model to HuggingFace Transformers #36711

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

konstantinos-p
Copy link

What does this PR do?

This PR introduces the DINO DETR (DEtection TRansformer with DIstillation) model (https://arxiv.org/abs/2203.03605) to the Hugging Face Transformers library. DINO DETR is a state-of-the-art object detection model that builds upon the original DETR architecture, incorporating improvements such as:

  • Contrastive denoising training to enhance object queries.
  • Mixed query selection for more robust matching between predictions and ground truths.
  • Look forward twice (LFT) mechanism to refine object boxes.
  • Efficient matching and distillation techniques that stabilize bipartite matching loss.

The model achieves strong performance on COCO test-dev (https://paperswithcode.com/sota/object-detection-on-coco).

Fixes #36205

What's included

  • Implementation of the DinoDetrModel, and DinoDetrForObjectDetection
  • Implementation of DinoDetrImageProcessor

Resources I've used

Who can review?

@amyeroberts, @qubvel

@konstantinos-p
Copy link
Author

Hello @qubvel! The state of the PR is that I've gotten the forward pass to match the original implementation up to the required precision. I've marked this as a draft because I wanted to just get a first opinion on if I'm modifying the correct files in the codebase. Let me know if I'm missing anything big. Regarding tests, i've copied some from Deformable Detr but haven't tried getting them to work. Let me know if I need to add any apart from what's already there.

@qubvel
Copy link
Member

qubvel commented Mar 14, 2025

Hi @konstantinos-p! Thanks a lot for working on the model, super excited to see it merged! 🚀

Before diving into the implementation details, here are some general comments we need to address in the PR:

  • Integration Tests
    Let's set up integration tests to ensure output consistency is maintained while refactoring the modeling code. Please use torch.testing.assert_close when comparing tensors (not use self.assertTrue(torch.allclose(...))

  • Modeling Code
    It would be super nice to use modular approach! It's the way to use inheritance in transformers while keeping a one-model-one-file format, please see more information here: https://huggingface.co/docs/transformers/main/en/modular_transformers and in examples/modular-transformers folder of the repo. Also, Siglip2 and RT-DETRv2 model were added using modular.

  • Consistent Naming for Classes & Modules

    • Model names and types: DinoDetr and dino_detr
    • All module names should be prefixed with the model name. For example, instead of ResidualBlock, use DinoDetrResidualBlock.
  • Code Paths & Cleanup

    • No asserts in code and minimum raise statements, config params should be validated in config.

    • Remove unused conditional statements if no existing configuration requires them. Also, remove the unnecessary config parameters.

      # Instead of:
      if config.use_last_layer_norm:
          self.norm = nn.LayerNorm()
      else:
          self.norm = nn.Identity()
      # If `use_last_layer_norm = True` in all configs, simplify it to:  
      self.norm = nn.LayerNorm()
  • Code Style

    • Use clear, descriptive variable names: avoid single-letter (x) or placeholder (tmp) variables.
    • Add comments for non-obvious code.
    • Use type hints for modules (e.g. in __init__ and forward).
  • Conversion Script Standard
    Please follow the mllama model format for the conversion script. Define a single key mapping using regex (see rt_detr_v2, ijepa, mllama, superglue models for reference). This is our new standard for all newly added models.

Thanks again! Looking forward to the updates 🚀

Sorry, something went wrong.

@konstantinos-p
Copy link
Author

Thanks for the comments! I'll start addressing them!

Konstantinos Pitas added 2 commits March 22, 2025 12:39
The integration tests and most unit tests are passing. The unit tests
that are failing are mainly due to gradient checkpointing not being supported and due to using shared tensors in the
implementation (which causes saving and loading tests to fail).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request to add DINO object detector
2 participants