✉ Corresponding Authors
This repository provides the codebase of SSP-SAM, a referring expression segmentation framework built on top of SAM with semantic-spatial prompts.
Current repo status:
- Training/testing/data processing scripts are available.
- Multiple dataset configs are provided under
configs/.
- 17 Mar, 2026: Open-source codebase has been organized and released.
- 4 Dec, 2025: SSP-SAM paper accepted by IEEE TCSVT.
- Release final model checkpoints on Hugging Face
- Release processed training/evaluation metadata
- Release arXiv version
- Paper:
SSP-SAM Hugging Face Checkpoints/datasets:
https://huggingface.co/wayneicloud/SSP-SAM
.
├── configs/ # training/evaluation configs
├── data_seg/ # data preprocessing scripts and generated anns/masks
├── datasets/ # dataloader and transforms
├── models/ # SSP_SAM model definitions
├── segment-anything/ # modified SAM dependency (editable install)
├── train.py # training entry
├── test.py # evaluation entry
├── submit_train.sh # train launcher (with examples)
└── submit_test.sh # test launcher (with examples)
Recommended: conda environment on macOS/Linux.
conda create -n ssp_sam python=3.10 -y
conda activate ssp_sam
pip install --upgrade pip
# 1) install PyTorch (CUDA example: cu121)
pip install torch==2.1.0+cu121 torchvision==0.16.0+cu121 torchaudio==2.1.0+cu121 --index-url https://download.pytorch.org/whl/cu121
# 2) install modified segment-anything first
cd segment-anything
pip install -e .
cd ..
# 3) install remaining dependencies
pip install -r requirements.txtNote: the
segment-anythingcode in this repository has been modified based on the original SAM implementation.
Please install the localsegment-anythingin editable mode (pip install -e .) as shown above.
Please check:
data_seg/README.mddata_seg/run.sh
You have two options:
-
Use our provided annotations + generate masks locally (recommended)
-
Regenerate annotations/masks by yourself
See the collapsible section below.
Generate Annotations/Masks by Yourself (click to expand)
References:
data_seg/README.mddata_seg/run.shlegacy_data_prep_simrec.md(legacy reference for raw data preparation and sources)
Required raw annotation folders/files for generation include (examples):
data_seg/refcoco/data_seg/refcoco+/data_seg/refcocog/data_seg/refclef/
Each folder should contain raw files such as instances.json and refs(...).p.
Minimal expected layout (example):
data_seg/
├── refcoco/
│ ├── instances.json
│ ├── refs(unc).p
│ └── refs(google).p
├── refcoco+/
│ ├── instances.json
│ └── refs(unc).p
├── refcocog/
│ ├── instances.json
│ ├── refs(google).p
│ └── refs(umd).p
└── refclef/
├── instances.json
├── refs(unc).p
└── refs(berkeley).p
Example preprocessing command:
python ./data_seg/data_process.py \
--data_root ./data_seg \
--output_dir ./data_seg \
--dataset refcoco \
--split unc \
--generate_maskDetailed dataset path/config settings are defined in the corresponding preprocessing scripts/config files in data_seg/.
Please modify them according to your local environment before running.
Also check dataset/image path settings in:
datasets/dataset.py
Important: in
datasets/dataset.py, classVGDataset, you should update local paths for images/annotations/masks according to your machine.
Example local data organization:
your_project_root/
├── data/ # set --data_root to this folder
│ ├── coco/
│ │ └── train2014/ # COCO images (unc/unc+/gref/gref_umd/grefcoco)
│ ├── referit/
│ │ └── images/ # ReferIt images
│ ├── VG/ # Visual Genome images (merge pretrain path)
│ └── vg/ # Visual Genome images (phrase_cut path, if used)
└── data_seg/ # same level as data/
├── anns/
│ ├── refcoco.json
│ ├── refcoco+.json
│ ├── refcocog_umd.json
│ ├── refclef.json
│ └── grefcoco.json
└── masks/
├── refcoco/
├── refcoco+/
├── refcocog_umd/
├── refclef/
└── grefcoco/
For training/testing, use:
data_seg/anns/*.json(provided)data_seg/masks/*(generated locally viabash data_seg/run.sh)
For training/evaluation, you need the corresponding image files locally (COCO/Flickr/ReferIt/VG depending on dataset split and config).
Common sources:
- RefCOCO / RefCOCO+ / RefCOCOg / RefClef annotations: http://bvisionweb1.cs.unc.edu/licheng/referit/data/
- MS COCO 2014 images: https://cocodataset.org/
- Flickr30k images: http://shannon.cs.illinois.edu/DenotationGraph/
- ReferItGame images: due to original dataset restrictions, please download by yourself from the official/authorized source.
- Visual Genome images: https://visualgenome.org/
Default training launcher:
bash submit_train.shsubmit_train.sh already includes commented examples for multiple datasets, e.g.:
refcocorefcoco+refcocog_umdreferitgrefcoco
You can also run directly:
torchrun --nproc_per_node=8 train.py \
--config configs/SSP_SAM_CLIP_B_FT_unc.py \
--clip_pretrained pretrained_checkpoints/CS/CS-ViT-B-16.pttrain.py supports two resume modes:
--resume <ckpt>: use this for interrupted training and continue from the previous checkpoint (断点续训).--resume_from_pretrain <ckpt>: use this for loading pretrained weights before fine-tuning/training.
Default testing launcher:
bash submit_test.shExample direct command:
torchrun --nproc_per_node=1 --master_port=29590 test.py \
--config configs/SSP_SAM_CLIP_L_FT_unc.py \
--test_split testB \
--clip_pretrained pretrained_checkpoints/CS/CS-ViT-L-14-336px.pt \
--checkpoint output/your_save_folder/checkpoint_best_miou.pth- COCO image path in visualization prioritizes
data/coco/train2014. - Current mask prediction/evaluation path uses
512x512mask space. - Config files in
configs/are set with:output_dir='outputs/your_save_folder'batch_size=8freeze_epochs=20
This repository benefits from ideas and/or codebases of the following projects:
- SimREC: https://github.com/luogen1996/SimREC
- gRefCOCO: https://github.com/henghuiding/gRefCOCO
- TransVG: https://github.com/djiajunustc/TransVG
- Segment Anything (SAM): https://github.com/facebookresearch/segment-anything
Thanks to the authors for their valuable open-source contributions.
If you find this repository useful, please cite our SSP-SAM paper.
@article{ssp_sam_tcsvt,
title={SSP-SAM: SAM with Semantic-Spatial Prompt for Referring Expression Segmentation},
author={Tang, Wei and Liu, Xuejing and Sun, Yanpeng and Li, Zechao},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
year={2025}
}