- This fork adds a Google Colab link to try model
- Evalution notes of overall scene summary generation using GRiT + Detic with ChatGPT
- Original repo - below
Object Detection - behaviour of incorrectly mapping objects to other known classes - this is due to the absence of the classes in the training set - not an inherent deficiency of the model
This particular aspect is perhaps a distinguising factor of GRiT's approach compared to using a model like Detic which only outputs object classes. Detic's object detection capability may appear to be better than GRiT in images tested only because of the difference in training set. So if we combine these two model outputs and use them as input to ChatGPT we get a rich overall scene summary. Images were selected from a royalty free site Pexels.
GRiT is a general and open-set object understanding framework that localizes objects and describes them with any style of free-form texts it was trained with, e.g., class names, descriptive sentences (including object attributes, actions, counts and many more).
GRiT: A Generative Region-to-text Transformer for Object Understanding
Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang
1State University of New York at Buffalo, 2Microsoft
arXiv technical report (PDF)
Please follow Installation instructions.
Download the GRiT model or use the following commend to download:
mkdir models && cd models
wget https://datarelease.blob.core.windows.net/grit/models/grit_b_densecap_objectdet.pth && cd ..
The downloaded GRiT model was jointly trained on dense captioning
task and object detection task. With the same trained model, it can
output both rich descriptive sentences and short class names by varying
the flag --test-task
. Play it as follows! 🤩🤩🤩
python demo.py --test-task DenseCap --config-file configs/GRiT_B_DenseCap_ObjectDet.yaml --input demo_images --output visualization --opts MODEL.WEIGHTS models/grit_b_densecap_objectdet.pth
python demo.py --test-task ObjectDet --config-file configs/GRiT_B_DenseCap_ObjectDet.yaml --input demo_images --output visualization --opts MODEL.WEIGHTS models/grit_b_densecap_objectdet.pth
Output images will be saved under the visualization
folder, which looks like:
Please follow dataset preparation instructions to download datasets.
Download our trained models and put them to models/
for evaluation.
Model | val AP | test-dev AP | Download |
---|---|---|---|
GRiT (ViT-B) | 53.7 | 53.8 | model |
GRiT (ViT-L) | 56.4 | 56.6 | model |
GRiT (ViT-H) | 60.4 | 60.4 | model |
To evaluate the trained GRiT on coco 2017 val, run:
# GRiT (ViT-B)
python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet --eval-only MODEL.WEIGHTS models/grit_b_objectdet.pth
# GRiT (ViT-L)
python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_L_ObjectDet.yaml --output-dir-name ./output/grit_l_objectdet --eval-only MODEL.WEIGHTS models/grit_l_objectdet.pth
# GRiT (ViT-H)
python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_H_ObjectDet.yaml --output-dir-name ./output/grit_h_objectdet --eval-only MODEL.WEIGHTS models/grit_h_objectdet.pth
Model | mAP | Download |
---|---|---|
GRiT (ViT-B) | 15.5 | model |
To test on VG test set, run:
python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_DenseCap.yaml --output-dir-name ./output/grit_b_densecap --eval-only MODEL.WEIGHTS models/grit_b_densecap.pth
It will save the inference results to output/grit_b_densecap/vg_instances_results.json
.
We use the VG dense captioning official evaluation codebase
to report the results. We didn't integrate the evaluation code into our project as it was written in Lua.
To evaluate on VG, please follow the original codebase's instructions and test based upon it. We're happy to discuss
in our issue section about the issues you may encounter when using their code.
To save training memory, we use DeepSpeed for training which can work well for activation checkpointing in distributed training.
To train on single machine node, run:
python train_deepspeed.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet
To train on multiple machine nodes, run:
python train_deepspeed.py --num-machines 4 --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet
Our code is in part based on Detic, CenterNet2, detectron2, GIT, and transformers. We thank the authors and appreciate their great works!
If you find our work interesting and would like to cite it, please use the following BibTeX entry.
@article{wu2022grit,
title={GRiT: A Generative Region-to-text Transformer for Object Understanding},
author={Wu, Jialian and Wang, Jianfeng and Yang, Zhengyuan and Gan, Zhe and Liu, Zicheng and Yuan, Junsong and Wang, Lijuan},
journal={arXiv preprint arXiv:2212.00280},
year={2022}
}