Stitched ViTs are Flexible Vision Backbones

This is the official PyTorch implementation for Stitched ViTs are Flexible Vision Backbones.

By Zizheng Pan, Jing Liu, Haoyu He, Jianfei Cai, and Bohan Zhuang.

We adapt the framework of stitchable neural networks (SN-Net) into downstream dense prediction tasks. Compared to SNNetv1, the new framework consistently improves the performance at low FLOPs while maintaining competitive performance at high FLOPs across different datasets, thus obtaining a better Pareto frontier (highlighted in lines).

📰 News

23/01/2024. Release code for depth estimation on NYUv2, see the subfolder depth_estimation.
18/01/2024. Huggingface online demo for image classification is live! Checkout here.
18/01/2024. Release code for semantic segmentation on ADE20K and COCO-Stuff-10K, see the subfolder segmentation.
13/01/2024. Release code on ImageNet-1K classification 🔥. The classification code is an easy way to start understanding how SN-Netv2 works and how it differs from V1.

💪 Getting Started

For image classification on ImageNet-1K, please refer to classification.

For semantic segmentation on ADE20K and COCO-Stuff-10K, please refer to segmentation.

For depth estimation on NYUv2, please refer to depth_estimation.

🪄 Gradio Demo for Segmentation

First, install gradio by

pip install gradio

Next, install the required packages at segmentation, then run the gradio demo by

cd segmentation
python demo/video_demo_gradio.py --config [path/to/config] --checkpoint [path/to/checkpoint]

✨ Results

Understand the figures:

Each point represents for a stitch in SN-Net, which can be instantly selected at runtime without additional training cost.
SN-Netv2 can produces 10x more stitches than SN-Netv1. For better comparison, we highlight the Pareto frontier in SN-Netv2.
The yellow star represents for adopting an individual ViT as backbone for downstream task adaptation.
All models are trained under the same training iterations/epochs.

Image Classification on ImageNet-1K

Semantic Segmentation on ADE20K and COCO-Stuff-10K

ADE20K	COCO-Stuff-10K

Depth Estimation on NYUv2

Stitching DeiT3-S and DeiT3-L based on DPT.

Object Detection and Instance Segmentation on COCO-2017

Stitching DeiT3-S and DeiT3-L based on Mask R-CNN/ViTDet.

Training Efficiency Comparison

🚧 TODO List

✍ Citation

If you use SN-Netv2 in your research, please consider the following BibTeX entry and giving us a star 🌟.

@article{pan2023snnetv2,
  title={Stitched ViTs are Flexible Vision Backbones},
  author={Pan, Zizheng and Liu, Jing and He, Haoyu and Cai, Jianfei and Zhuang, Bohan},
  journal={arXiv},
  year={2023}
}

If you find the code useful, please also consider the following BibTeX entry

@inproceedings{pan2023snnetv1,
  title     = {Stitchable Neural Networks},
  author    = {Pan, Zizheng and Cai, Jianfei and Zhuang, Bohan},
  booktitle = {CVPR},
  year      = {2023},
}

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
classification		classification
depth_estimation		depth_estimation
pretrained_weights		pretrained_weights
segmentation		segmentation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

classification

classification

depth_estimation

depth_estimation

pretrained_weights

pretrained_weights

segmentation

segmentation

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Stitched ViTs are Flexible Vision Backbones

📰 News

💪 Getting Started

🪄 Gradio Demo for Segmentation

✨ Results

Image Classification on ImageNet-1K

Semantic Segmentation on ADE20K and COCO-Stuff-10K

Depth Estimation on NYUv2

Object Detection and Instance Segmentation on COCO-2017

Training Efficiency Comparison

🚧 TODO List

✍ Citation

License

About

Releases 1

Packages

Languages

License

ziplab/SN-Netv2

Folders and files

Latest commit

History

Repository files navigation

Stitched ViTs are Flexible Vision Backbones

📰 News

💪 Getting Started

🪄 Gradio Demo for Segmentation

✨ Results

Image Classification on ImageNet-1K

Semantic Segmentation on ADE20K and COCO-Stuff-10K

Depth Estimation on NYUv2

Object Detection and Instance Segmentation on COCO-2017

Training Efficiency Comparison

🚧 TODO List

✍ Citation

License

About

Resources

License

Stars

Watchers

Forks

Languages