Holistic scene understanding includes semantic segmentation, surface normal estimation, object boundary detection, depth estimation, etc. The key aspect of this problem is to learn representation effectively, as each subtask builds upon not only correlated but also distinct attributes. Inspired by visual-prompt tuning, we propose a Task-Specific Prompts Transformer, dubbed TSP-Transformer, for holistic scene understanding. It features a vanilla transformer in the early stage and tasks-specific prompts transformer encoder in the lateral stage, where tasks-specific prompts are augmented. By doing so, the transformer layer learns the generic information from the shared parts and is endowed with task-specific capacity. First, the tasks-specific prompts serve as induced priors for each task effectively. Moreover, the task-specific prompts can be seen as switches to favor task-specific representation learning for different tasks. Extensive experiments on NYUD-v2 and PASCAL-Context show that our method achieves state-of-the-art performance, validating the effectiveness of our method for holistic scene understanding.
Tested with PyTorch 1.11 and CUDA 11.3:
git clone https://github.com/tb2-sy/TSP-Transformer.git
conda create -n tsp python=3.7
conda activate tsp
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
pip install tqdm Pillow easydict pyyaml imageio scikit-image tensorboard
pip install opencv-python==4.5.4.60 setuptools==59.5.0
pip install timm==0.5.4 einops==0.4.1
We use the same dataset (PASCAL-Context and NYUD-v2) as InvPT. You can download the data from here. And then extract the datasets by:
tar xfvz NYUDv2.tar.gz
tar xfvz PASCALContext.tar.gz
You need to specify the dataset directory as db_root
variable in configs/mypath.py
.
Set the config files in ./conifigs
, with PASCAL-Context and NYUD-v2 dataset.
# Train the NYUD-v2 dataset
./run_nyud.sh
# Train the PASCAL-Context dataset
./run_pascal.sh
# Evaluation the NYUD-v2 dataset
./infer_nyud.sh
# Evaluation the PASCAL-Context dataset
./infer_pascal.sh
Please download the weights of our SOTA results.
Version | Dataset | Download | Segmentation | Human parsing | Saliency | Normals | Boundary |
---|---|---|---|---|---|---|---|
TSP-Transformer (our paper) | PASCAL-Context | google drive | 81.48 | 70.64 | 84.86 | 13.69 | 74.80 |
TaskPrompter (ICLR 2023) | PASCAL-Context | - | 80.89 | 68.89 | 84.83 | 13.72 | 73.50 |
InvPT (ECCV 2022) | PASCAL-Context | - | 79.03 | 67.61 | 84.81 | 14.15 | 73.00 |
Version | Dataset | Download | Segmentation | Depth | Normals | Boundary |
---|---|---|---|---|---|---|
TSP-Transformer (our paper) | NYUD-v2 | google drive | 55.39 | 0.4961 | 18.44 | 77.50 |
TaskPrompter (ICLR 2023) | NYUD-v2 | - | 55.30 | 0.5152 | 18.47 | 78.20 |
InvPT (ECCV 2022) | NYUD-v2 | - | 53.56 | 0.5183 | 19.04 | 78.10 |
- Upload paper and init project
- Training and Inference code
- Reproducible checkpoints
- Speed training and inference
- Reduce cuda memory usage
For any questions related to our paper and implementation, please email wangshuo2022@shanghaitech.edu.cn.
If you find our code or paper helps, please consider citing:
@article{wang2023tsp,
title={TSP-Transformer: Task-Specific Prompts Boosted Transformer for Holistic Scene Understanding},
author={Wang, Shuo and Li, Jing and Zhao, Zibo and Lian, Dongze and Huang, Binbin and Wang, Xiaomei and Li, Zhengxin and Gao, Shenghua},
journal={arXiv preprint arXiv:2311.03427},
year={2023}
}
The code is available under the MIT license and draws from InvPT, ATRC, and MIT-Net, which are also licensed under the MIT license.