Cluster2Former: Semisupervised Clustering Transformers for Video Instance Segmentation (Sensors 2024)

Áron Fóthi, Adrián Szlatincsán, Ellák Somfai

[mdpi][BibTeX]

Abstract

A novel approach for video instance segmentation is presented using semisupervised learning. Our Cluster2Former model leverages scribble-based annotations for training, significantly reducing the need for comprehensive pixel-level masks. We augment a video instance segmenter, for example, the Mask2Former architecture, with similarity-based constraint loss to handle partial annotations efficiently. We demonstrate that despite using lightweight annotations (using only 0.5% of the annotated pixels), Cluster2Former achieves competitive performance on standard benchmarks. The approach offers a cost-effective and computationally efficient solution for video instance segmentation, especially in scenarios with limited annotation resources.

Keywords: transformers; video processing; instance segmentation; semisupervised learning

Features

A single architecture for panoptic, instance and semantic segmentation.
Based on Mask2Former, no change in the architecture of the model
With the use of scribble like annotations and Similarity-based Constraint loss, you can achive competitive performance, but with much less annotation effort compared to the full mask annotation.
Tensorboard visualization support during training and evaluation
Support major VIS datasets and scribble version of them: YouTubeVIS 2019/2021, OVIS.

Installation

See installation instructions.

Getting Started

See Preparing Datasets for Mask2Former and Cluster2Former.

See Getting Started with Mask2Former and Cluster2Former.

See more in Mask2Former

Advanced usage

See Advanced Usage of Mask2Former.

Model Zoo and Baselines

We also provide a set of baseline results and trained models available for download in addition to the Model Zoo of the Mask2Fomer in the Mask2Former and Cluster2Former Model Zoo.

Citing Cluster2Former

If you use Cluster2Former in your research or wish to refer to the baseline results published in the Model Zoo, please use the following BibTeX entry.

@Article{s24030997,
AUTHOR = {Fóthi, Áron and Szlatincsán, Adrián and Somfai, Ellák},
TITLE = {Cluster2Former: Semisupervised Clustering Transformers for Video Instance Segmentation},
JOURNAL = {Sensors},
YEAR = {2024},
}

Acknowledgement

Code is based on Mask2Former (https://github.com/facebookresearch/Mask2Former).

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
_layouts		_layouts
assets/css		assets/css
configs		configs
datasets		datasets
demo		demo
demo_video		demo_video
mask2former		mask2former
mask2former_video		mask2former_video
tools		tools
.gitignore		.gitignore
ADVANCED_USAGE.md		ADVANCED_USAGE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GETTING_STARTED.md		GETTING_STARTED.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
MODEL_ZOO.md		MODEL_ZOO.md
README.md		README.md
TENSORBOARD_VIS.md		TENSORBOARD_VIS.md
_config.yml		_config.yml
build_container.def		build_container.def
cog.yaml		cog.yaml
container_launcher.sh		container_launcher.sh
predict.py		predict.py
requirements.txt		requirements.txt
train_net.py		train_net.py
train_net_video.py		train_net_video.py

License

szlAdrian/Cluster2Former

Folders and files

Latest commit

History

Repository files navigation

Cluster2Former: Semisupervised Clustering Transformers for Video Instance Segmentation (Sensors 2024)

Abstract

Features

Installation

Getting Started

Advanced usage

Model Zoo and Baselines

Citing Cluster2Former

Acknowledgement

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages