This repository contains the implementation of a Visual Grounding model that can localize objects in an image based on a natural language text prompt. The model identifies the region of interest described by the text and predicts a bounding box around it. The architecture leverages powerful pre-trained models: DINOv3 as the vision backbone and BGE-Small-EN-v1.5 as the text encoder. These two modalities are fused using a Transformer Decoder, which then predicts the bounding box coordinates.
- Text-to-Object Localization: Takes an image and a text prompt to find a specific object.
- Powerful Encoders: Utilizes frozen, state-of-the-art vision (DINOv3) and text (BGE) encoders for robust feature extraction.
- Transformer-based Fusion: Employs a Transformer Decoder to effectively combine visual and textual information.
- End-to-End Training: Trained on the RefCOCO dataset for referential expression grounding.
The model is composed of three main parts:
- Vision Encoder: A frozen
facebook/dinov3-vitb16-pretrain-lvd1689mmodel processes the input image to extract rich visual features. - Text Encoder: A frozen
BAAI/bge-small-en-v1.5model processes the text prompt to generate contextual text embeddings. - Fusion and Prediction:
- Linear projection layers align the dimensions of the image and text embeddings.
- A 6-layer Transformer Decoder takes the text embeddings as the query (
tgt) and the image embeddings as the memory (memory) to fuse the information. - The output from the decoder is passed through a final MLP head to predict the four bounding box coordinates (
xmin,ymin,xmax,ymax).
- Total Parameters: 144,968,836
- Trainable Parameters: 25,948,420
Download the Model Weights From Here
The model was trained on the RefCOCO dataset, a standard benchmark for referential expression grounding. The dataset was loaded using the Hugging Face datasets library.
Dataset Splits:
DatasetDict({
train: Dataset({
features: ['file_name', 'raw_sentences', 'bbox'],
num_rows: 42404
}),
validation: Dataset({
features: ['file_name', 'raw_sentences', 'bbox'],
num_rows: 3811
}),
test: Dataset({
features: ['file_name', 'raw_sentences', 'bbox'],
num_rows: 1975
}),
testB: Dataset({
features: ['file_name', 'raw_sentences', 'bbox'],
num_rows: 1810
})
})After processing sentences, the final sample counts were:
- Train Samples: 120,624
- Validation Samples: 10,834
- Test Samples: 10,752
The model was trained for 10 epochs using the AdamW optimizer and a cosine learning rate scheduler. The loss was calculated using the distance_box_iou_loss. The training code is present in the .ipynb file
Training Results:
| Epoch | Train Loss | Train IoU | Val Loss | Val IoU |
|-------|------------|-----------|----------|---------|
| 1 | 0.6769 | 0.3658 | 0.6040 | 0.4308 |
| 2 | 0.5558 | 0.4791 | 0.5361 | 0.4991 |
| 3 | 0.5027 | 0.5302 | 0.5003 | 0.5329 |
| 4 | 0.4678 | 0.5628 | 0.4795 | 0.5525 |
| 5 | 0.4392 | 0.5892 | 0.4518 | 0.5778 |
| 6 | 0.4104 | 0.6155 | 0.4352 | 0.5940 |
| 7 | 0.3856 | 0.6380 | 0.4212 | 0.6065 |
| 8 | 0.3655 | 0.6562 | 0.4093 | 0.6174 |
| 9 | 0.3509 | 0.6696 | 0.4065 | 0.6197 |
| 10 | 0.3440 | 0.6759 | 0.4056 | 0.6205 |
- Mean IoU (mIoU): 0.6196
- Mean Average Precision (MAP):
- mAP@50: 0.5746
- mAP@75: 0.2173
To use the trained model, load the state dictionary and use the predict_and_crop function provided in the notebook.
Requirements:
bash pip install torch torchvision transformers datasets albumentations torchmetrics Pillow requests
Inference Code:
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
import torch.nn as nn
# --- 1. Load Model Components ---
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
IMG_SIZE = 224
DIM = 512
# Load models and tokenizer
img_model = AutoModel.from_pretrained("facebook/dinov3-vitb16-pretrain-lvd1689m")
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-small-en-v1.5")
text_model = AutoModel.from_pretrained("BAAI/bge-small-en-v1.5")
decoder_layer = nn.TransformerDecoderLayer(d_model=DIM, nhead=8, batch_first=True)
decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
# --- 2. Initialize and Load Trained Model ---
inf_model = GroundingModel(img_model, text_model, decoder, DIM).to(device)
inf_model_path = 'path/to/your/best_model_iou.pt'
inf_model.load_state_dict(torch.load(inf_model_path, map_location=device))
inf_model.eval()
# --- 3. Run Inference ---
image_url = "https://bouldervet.com/wp-content/uploads/2023/09/dog-cat-coexistence-1024x683.jpg"
prompt = "the cat on the sofa"
# The 'transforms' object should also be defined as in the notebook
cropped_object, box, image_with_box = predict_and_crop(
image_url, prompt, inf_model, tokenizer, transforms, device
)
# Display the results
if cropped_object:
image_with_box.show()
cropped_object.show()While the current model performs well, there are several avenues for future improvement:
- Contrastive Loss: Implement a contrastive loss function (e.g., CLIP-style loss) in addition to the bounding box regression loss. This would encourage the model to learn a more aligned joint embedding space, pushing corresponding image regions and text descriptions closer together and improving overall grounding accuracy.
- Expanded Datasets: Augment the training data with more diverse and challenging datasets beyond RefCOCO. Incorporating images from different domains (e.g., medical, satellite, artistic) could significantly enhance the model's generalization capabilities and make it more robust to out-of-distribution examples.
This project was made possible by leveraging several incredible open-source models and datasets.
@article{li2024dinov3,
title={DINOv3: A General-Purpose Visual Encoder with Controllable Saliency},
author={Li, Bowen and Fan, Hao and Clark, Jonathan and Torresani, Lorenzo},
journal={arXiv preprint arXiv:2407.16347},
year={2024}
}@misc{bge_embedding,
title={C-Pack: Packaged Resources To Advance General Chinese Embedding},
author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
year={2023},
eprint={2309.07597},
archivePrefix={arXiv},
primaryClass={cs.CL}
}@misc{liu2023grounding,
title={Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection},
author={Shilong Liu and Zhaoyang Zeng and Tianhe Ren and Feng Li and Hao Zhang and Jie Yang and Chunyuan Li and Jianwei Yang and Hang Su and Jun Zhu and Lei Zhang},
year={2023},
eprint={2303.05499},
archivePrefix={arXiv},
primaryClass={cs.CV}
}@inproceedings{yu2016modeling,
title={Modeling context in referring expressions},
author={Yu, Licheng and Poirson, Patrick and Yang, Shan and Berg, Alexander C and Berg, Tamara L},
booktitle={European conference on computer vision},
pages={69--85},
year={2016},
organization={Springer}
}



