Skip to content

seanzhuh/Awesome-Open-Vocabulary-Detection-and-Segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 

Repository files navigation

Static Badge Static Badge Static Badge arXiv PDF

Chaoyang Zhu, Long Chen*


✨ PR is welcome!

Todo

  • Add detailed impls of each method, such as template prompts vs learnable prompts, CLIP text encoder vs BERT, initialization of image encoder, etc.

General Overview

In this survey, we cover two settings (zero-shot and open-vocabulary) and six tasks (object detection, semantic/instance/panoptic segmentation, 3D scene understanding, and video understanding). We pivot on the permission to weak supervision signals and the usage of weak supervision signals to build a taxonomy that is universal across these diverse settings and tasks. The weak supervision signals can be image-text pairs or large vision-language models. Below is a general overview of each methodology.

In current literature, zero-shot and open-vocabulary are used interchangeably, however, we note their subtle differences through the evolvement from traditional zero-shot to the newly formulated open-vocabulary setting.

Table of Contents

Zero-Shot Object Detection

Visual-Semantic Space Mapping

Venue Paper Abbr Project
ECCV'18 ZSDv1 N/A
ACCV'18 & IJCV'20 ZSDv2 N/A
AAAI'20 CA-ZSR Code
AAAI'19 ZSD-TD N/A
ACCV'20 BLC Code
ICCV'19 TL-ZSD N/A
arXiv'23 SSB N/A
WACV'20 MS-Zero N/A
TCSVT'19 ZS-YOLO N/A
AAAI'21 DPIF Code
TPAMI'21 ContrastZSD N/A
IJCAI'20 ZSD-CNN N/A

Novel Visual Feature Synthesis

Venue Paper Abbr Project
CVPR'20 DELO N/A
ACCV'20 SU Code
AAAI'20 GTNet Code
CVPR'22 RRFS Code

Zero-Shot Segmentation

Zero-Shot Semantic Segmentation

Visual-Semantic Space Mapping

Venue Paper Abbr Project
CVPR'20 SPNet Code
NeurIPS'20 ULZSS Code
ICCV'21 JoEm Code
ICCVW'19 VM N/A
ICCV'21 PMOSR N/A

Novel Visual Feature Synthesis

Venue Paper Abbr Project
NeurIPS'19 ZS3Net Code
NeurIPS'20 CSRL N/A
MM'20 CaGNet Code
ICCV'21 SIGN Code

Zero-Shot Instance Segmentation

Venue Paper Abbr Project
CVPR'21 ZSIS Code

Open-Vocabulary Object Detection

Region-Aware Training

Venue Paper Abbr Project Text Encoder Prompt Image Backbone (w/ init. method) Detector
CVPR'21 OVR-CNN Code BERT ❌ R50 (IN-1K) Faster R-CNN
GCPR'22 LocOv Code BERT ❌ R50 (IN-1K) Faster R-CNN
arXiv'23 MMC-Det N/A BERT ❌ R50 (N/A) Faster R-CNN/CenterNetv2
NeurIPS'22 DetCLIP N/A
CVPR'23 DetCLIPv2 N/A
CVPR'24 DetCLIPv3 N/A
AAAI'24 WSOVOD Code CLIP T (cat) R50 (IN-1K) Faster R-CNN
CVPR'23 RO-ViT N/A CLIP T (cat) ViT (ALIGN) Mask R-CNN
ICCV'23 CFM-ViT N/A CLIP T (cat) ViT (ALIGN) Mask R-CNN
ICCV'23 DITO Code CLIP T (cat) ViT (CLIP, ALIGN, DataComp-1B) Faster R-CNN
ICLR'23 VLDet Code CLIP T (cat) R50 (IN-1K) Faster R-CNN/CenterNetv2
ICCV'23 GOAT N/A CLIP T (cat) R50 (IN-1K/RegionCLIP) Faster R-CNN/CenterNetv2
ECCV'22 OV-DETR Code CLIP T (cat) R50 (N/A) Def-DETR
arXiv'23 Prompt-OVD N/A CLIP T (cat) ViTDet (IN-1K) Def-DETR
CVPR'23 CORA N/A CLIP T (cat) R50 (N/A) SAM-DETR/CenterNetv2
ICCV'23 EdaDet Code CLIP T (cat)
ICCV'21 MDETR Code
ECCV'22 MAVL Code
NeurIPS'24 MQ-Det Code
CVPR'24 YOLO-World Code
MM'23 SGDN N/A RoBERTa ❌

Pseudo-Labeling

Venue Paper Abbr Project Text Encoder Prompt
CVPR'22 RegionCLIP Code CLIP T (cat)
ECCV'22 VL-PLM Code
CVPR'22 GLIP Code
NeurIPS'22 GLIPv2 Code
arXiv'23 Grounding-DINO Code
ECCV'22 PromptDet Code CLIP L (cat+desc)
arXiv'23 SAS-Det Code CLIP T (cat)
ECCV'22 PB-OVD Code CLIP T (cat)
AAAI'24 CLIM Code CLIP T (cat)
arXiv'22 VTP-OVD N/A CLIP T (cat)
AAAI'24 ProxyDet Code CLIP T (cat)
NeurIPS'23 CoDet Code CLIP T (cat)
ECCV'22 Detic Code CLIP T (cat)
ICML'23 MMC Code CLIP GPT-3
arXiv'23 3Ways N/A CLIP T (cat)
arXiv'23 PLAC N/A CLIP T (cat)
arXiv'23 PCL N/A
NeurIPS'24 OWLv2 Code

Knowledge Distillation

Venue Paper Abbr Project Text Encoder Prompt
ICLR'22 ViLD Code CLIP T (cat)
ICDMW'22 ZSD-YOLO Code CLIP T (cat+desc)
WACV'24 LP-OVOD Code CLIP T (cat)
arXiv'23 EZSD Code CLIP T (cat)
AAAI'24 SIC-CADS Code CLIP T (cat)
CVPR'23 BARON Code CLIP T (cat)
CVPR'23 OADP Code CLIP T (cat)
arXiv'23 GridCLIP N/A
NeurIPS'22 RKDWTF Code CLIP T (cat)
ICCV'23 DK-DETR Code CLIP T (cat)
CVPR'22 HierKD Code CLIP T (cat/desc)
CVPR'22 DetPro Code CLIP L (cat)
arXiv'23 CLIPSelf Code CLIP T (cat)

Transfer Learning

Venue Paper Abbr Project Text Encoder Prompt
ECCV'22 OWL-ViT Code CLIP T (cat)
CVPR'23 UniDetector Code
ICLR'23 F-VLM Code CLIP T (cat)
CVPR'23 ScaleDet N/A
ICCV'23 OpenSeed Code
arXiv'23 DRR N/A CLIP T (cat)
arXiv'23 Sambor Code

Open-Vocabulary Segmentation

Open-Vocabulary Semantic Segmentation

Region-Aware Training

Venue Paper Abbr Project
ECCV'22 OpenSeg N/A
arXiv'23 SLIC N/A
CVPR'22 GroupViT Code
ECCV'22 ViL-Seg N/A
ICML'23 SegCLIP Code
CVPR'23 OVSegmentor Code
CVPR'23 PACL N/A
CVPR'23 TCL Code
ECCV'22 SimSeg Code

Pseudo-Labeling

Venue Paper Abbr Project
ECCV'22 TTD N/A

Knowledge Distillation

Venue Paper Abbr Project
arXiv'23 GKC N/A
arXiv'23 SAM-CLIP N/A
ICCV'23 ZeroSeg Code

Transfer Learning

Venue Paper Abbr Project
ICLR'22 LSeg Code
CVPR'23 SAZS Code
MM'23 CEL N/A
CVPR'22 ZegFormer Code
NeurIPS'22 ReCo Project
arXiv'23 SCAN N/A
ECCV'22 ZSSeg Code
ECCV'22 MaskCLIP Code
arXiv'23 CLIP-DINOiser Code
PRCV'23 MVP-SEG N/A
arXiv'23 OVDiff Project
WACV'24 FOSSIL N/A
NeurIPS'24 POMP Code
NeurIPS'24 AttrSeg N/A
arXiv'23 PnP-OVSS Code
arXiv'23 TagAlign Project
arXiv'23 SelfSeg N/A
CVPR'22 DenseCLIP Code
CVPR'23 OVSeg Code
arXiv'23 CAT-Seg Code
arXiv'23 SED Code
NeurIPS'23 MAFT Code
arXiv'23 TagCLIP N/A
CVPR'23 ZegCLIP Code
CVPR'22 CLIPSeg Code
CVPR'23 SAN Code
arXiv'23 CLIP Surgery Code
arXiv'23 CaR Project

Open-Vocabulary Instance Segmentation

Region-Aware Training

Venue Paper Abbr Project
ICCV'23 CGG Code
CVPR'23 D2Zero Code

Pseudo-Labeling

Venue Paper Abbr Project
CVPR'23 XPM Code
CVPR'23 Mask-free OVIS Code
arXiv'23 MosaicFusion Code

Knowledge Distillation

Venue Paper Abbr Project
arXiv'24 OV-SAM Code

Open-Vocabulary Panoptic Segmentation

Region-Aware Training

Venue Paper Abbr Project
arXiv'24 Uni-OVSeg Code
CVPR'23 X-Decoder Code
CVPR'24 APE Code

Knowledge Distillation

Venue Paper Abbr Project
CVPR'23 PADing Code

Transfer Learning

Venue Paper Abbr Project
NeurIPS'23 FC-CLIP Code
CVPR'23 FreeSeg Project
arXiv'24 PosSAM Project
ICCV'23 MasQCLIP Project
CVPR'23 OMG-Seg Code
arXiv'23 Semantic-SAM Code
CVPR'23 ODISE Code
NeurIPS'23 HIPIE Code
ICML'23 MaskCLIP Project
ICCV'23 OPSNet N/A

Open-Vocabulary 3D Scene Understanding

Open-Vocabulary 3D Detection

Venue Paper Abbr Project
CVPR'23 OV-3DET Code
AAAI'24 FM-OV3D Code
arXiv'23 OpenSight N/A
NeurIPS'23 CoDA Code
arXiv'23 L3Det N/A

Open-Vocabulary 3D Segmentation

Open-Vocabulary 3D Semantic Segmentation

Venue Paper Abbr Project
arXiv'21 SeCondPoint N/A
3DV'21 3DGenZ Code
CVPR'23 OpenScene Project
CVPR'23 PLA Code
arXiv'23 RegionPLC Project

Open-Vocabulary 3D Instance Segmentation

Venue Paper Abbr Project
NeurIPS'23 OpenMask3D Project
CVPR'24 MaskClustering Project
arXiv'23 OpenIns3D Project
arXiv'23 Open3DIS Project

Open-Vocabulary Video Understanding

Open-Vocabulary Video Instance Segmentation

Venue Paper Abbr Project
ICCV'23 OV2Seg Code
arXiv'23 OpenVIS Code
arXiv'24 BriVIS Code

Acknowledgement

If you find our survey helpful, please consider citing our paper:

@article{survey-ovd-ovs,
    title={A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future},
    author={Chaoyang Zhu and Long Chen},
    journal={arXiv preprint arXiv:2307.09220},
    year={2023}
}