This repository is the official implementation of "TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias".
[07/02/2024] Our TTD has been accepted to ECCV 2024. π₯π₯π₯
[04/02/2024] Released initial commits.
Please cite our paper if the code is helpful to your research.
@misc{jo2024ttd,
title={TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias},
author={Sanghyun Jo and Soohyun Ryu and Sungyub Kim and Eunho Yang and Kyungsu Kim},
booktitle={European Conference on Computer Vision (ECCV)},
year={2024}
}
We identify a critical bias in contemporary CLIP-based models, which we denote as "single-tag bias". This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP's text embeddings that prioritize one specific tag in image-text relationships. When deconstructing text into individual tags, only one tag tends to have high relevancy with CLIP's image embedding, leading to an imbalanced tag relevancy. This results in an uneven alignment among multiple tags present in the text. To tackle this challenge, we introduce a novel two-step fine-tuning approach. First, our method leverages the similarity between tags and their nearest pixels for scoring, enabling the extraction of image-relevant tags from the text. Second, we present a self-distillation strategy aimed at aligning the combined masks from extracted tags with the text-derived mask. This approach mitigates the single tag bias, thereby significantly improving the alignment of CLIP's model without necessitating additional data or supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources.
Setting up for this project involves installing dependencies and preparing datasets. The code is tested on Ubuntu 20.04 with NVIDIA GPUs and CUDA installed.
To install all dependencies, please run the following:
pip install -U "ray[default]"
pip install git+https://github.com/lucasb-eyer/pydensecrf.git
python3 -m pip install -r requirements.txt
or reproduce our results using docker.
docker build -t ttd_pytorch:v1.13.1 .
docker run --gpus all -it --rm \
--shm-size 32G --volume="$(pwd):$(pwd)" --workdir="$(pwd)" \
ttd_pytorch:v1.13.1
Create two directories for GCC3M and GCC12M dataset following directory structure.
We provide GCC3M for evaluation with our annotations [Google Drive].
../ # parent directory
βββ ./ # current (project) directory
β βββ core/ # (dir.) implementation of our TTD
β βββ tools/ # (dir.) helper functions
β βββ README.md # instruction for a reproduction
β βββ ... some python files ...
β
βββ cc3m_for_evaluation/ # GCC3M for evaluation (our benchmark, CC3M-TagMask)
β βββ image/
β βββ json_refined/ # ground-truth tags for text inputs
β βββ mask_using_HQ-SAM/ # ground-truth masks for text inputs
β
βββ cc3m/ # GCC3M for training
β βββ train/
β β βββ image/
β β βββ json/
β βββ validation/
β βββ image/
β βββ json/
β
βββ cc12m/ # GCC12M for training
βββ train/
βββ image/
βββ json/
Please download following VOC, COCO, Context, ADE, and COCO-Stuff datasets. Each dataset has a different directory structure. Therefore, we modify directory structures of all datasets for a comfortable implementation.
Download PASCAL VOC 2012 dataset from our [Google Drive].
Download MS COCO 2014 dataset from our [Google Drive].
Download Pascal Context dataset from our [Google Drive].
Download ADE 2016 dataset from our [Google Drive].
Download COCO-Stuff dataset from our [Google Drive].
Download Cityscapes dataset from our [Google Drive].
Create a directory "../VOC2012/" for storing the dataset and appropriately place each dataset.
../ # parent directory
βββ ./ # current (project) directory
β βββ core/ # (dir.) implementation of our TTD
β βββ tools/ # (dir.) helper functions
β βββ README.md # instruction for a reproduction
β βββ ... some python files ...
β
βββ VOC2012/ # PASCAL VOC 2012
β βββ validation/
β βββ image/
β βββ mask/
β βββ xml/
β
βββ COCO2014/ # MS COCO 2014
β βββ validation/
β βββ image/
β βββ mask/
β βββ xml/
β
βββ PascalContext/ # PascalContext
β βββ validation/
β βββ image/
β βββ mask/
β βββ xml/
β
βββ ADE2016/ # ADE2016
β βββ validation/
β βββ image/
β βββ mask/
β βββ xml/
β
βββ Cityscapes/ # Cityscapes
β βββ validation/
β βββ image/
β βββ mask/
β βββ xml/
β
βββ COCO-Stuff/ # COCO-Stuff
βββ validation/
βββ image/
βββ mask/
βββ xml/
Our code is coming soon.
Our code is coming soon.
Visualize heatmaps for a wild image.
python demo.py --arch TCL --tags flame smoke --scales 1.0 0.5 1.5 2.0 --image 448 --pamr --lora "./weights/TCL+TTD.pt"
Release our checkpoint.
Method | Checkpoints |
---|---|
TCL+TTD | Google Drive |
Evaluate performance for multi-tag selection (input: texts)
CUDA_VISIBLE_DEVICES=3 python3 produce_tags_from_text.py --root ../cc3m_for_evaluation/ --arch TCL --lora ./weights/TCL+TTD.pt --scales 1.0 --hflip --fixed --stopwords
# [ NLTK] Precision: 59.8, Recall: 83.7, F1: 69.8, Acc: 79.6
# [ Vicuna-7B] Precision: 44.1, Recall: 71.0, F1: 54.4, Acc: 70.9
# [ Vicuna-33B] Precision: 52.7, Recall: 70.7, F1: 60.4, Acc: 75.9
# [ Qwen-72B] Precision: 69.3, Recall: 56.2, F1: 62.1, Acc: 80.9
# ----------------------------------------------------------------------------
# [ TCL_224_[1.0]@image] Precision: 92.5, Recall: 28.6, F1: 43.7, Acc: 79.5
# [ TCL_224_[1.0]@text] Precision: 85.6, Recall: 29.7, F1: 44.1, Acc: 79.0
# [ TCL_224_[1.0]@pixel] Precision: 82.9, Recall: 74.5, F1: 78.5, Acc: 88.6
# [TCL_224_[1.0]@pixel+TTD] Precision: 88.3, Recall: 78.0, F1: 82.8, Acc: 91.0
python3 evaluate_classification.py --root ../cc3m_for_evaluation/ --arch "TCL_224_[1.0]@pixel+TTD"
Evaluate performance for text-level semantic segmentation (input: texts)
python3 produce_masks_from_text.py --gpus 0 --root ../ --data CC3M --arch TCL --bg 0.40 --scales 1.0 0.5 1.5 2.0 --hflip --pamr --lora ./weights/TCL+TTD.pt
# TCL | Caption IoU: 60.4%, mFPR: 0.199, mFNR: 0.198
# TCL+TTD | Caption IoU: 65.5%, mFPR: 0.163, mFNR: 0.182
python3 evaluate_segmentation_for_text.py --pred "./results_caption/TCL@CC3M@OpenAI@[1.0, 0.5, 1.5, 2.0]@448@s=2@hflip@pamr@TTD@bg=0.40/" --tag TCL+TTD
Evaluate performance for open-vocabulary semantic segmentation (input: tags)
python3 produce_masks_from_tags.py --gpus 0 --root ../ --data VOC2012 --arch TCL --bg 0.40 --scales 1.0 0.5 1.5 2.0 --hflip --pamr --lora ./weights/TCL+TTD.pt
python3 produce_masks_from_tags.py --gpus 0 --root ../ --data COCO2014 --arch TCL --bg 0.45 --scales 1.0 0.5 1.5 2.0 --hflip --pamr --lora ./weights/TCL+TTD.pt
python3 produce_masks_from_tags.py --gpus 0 --root ../ --data ADE2016 --arch TCL --bg 0.45 --scales 1.0 0.5 1.5 2.0 --hflip --pamr --lora ./weights/TCL+TTD.pt
python3 produce_masks_from_tags.py --gpus 0 --root ../ --data Cityscapes --arch TCL --bg 0.40 --scales 1.0 0.5 1.5 2.0 --hflip --pamr --lora ./weights/TCL+TTD.pt
python3 produce_masks_from_tags.py --gpus 0 --root ../ --data COCO-Stuff --arch TCL --bg 0.40 --scales 1.0 0.5 1.5 2.0 --hflip --pamr --lora ./weights/TCL+TTD.pt
python3 produce_masks_from_tags.py --gpus 0 --root ../ --data PascalContext --arch TCL --bg 0.45 --scales 1.0 0.5 1.5 2.0 --hflip --pamr --lora ./weights/TCL+TTD.pt
# TCL@VOC2012@Ours | mIoU: 61.1%, mFPR: 0.185, mFNR: 0.204
# TCL@COCO2014@Ours | mIoU: 37.4%, mFPR: 0.350, mFNR: 0.276
# TCL@ADE2016@Ours | mIoU: 17.0%, mFPR: 0.361, mFNR: 0.468
# TCL@Cityscapes@Ours | mIoU: 27.0%, mFPR: 0.248, mFNR: 0.483
# TCL@COCO-Stuff@Ours | mIoU: 23.7%, mFPR: 0.430, mFNR: 0.333
# TCL@PascalContext@Ours | mIoU: 37.4%, mFPR: 0.389, mFNR: 0.237
python3 evaluate_segmentation.py --data VOC2012 --pred "./results/TCL@VOC2012@OpenAI@[1.0, 0.5, 1.5, 2.0]@448@s=2@hflip@pamr@TTD@bg=0.40/" --tag "TCL+TTD@VOC2012"
python3 evaluate_segmentation.py --data COCO2014 --pred "./results/TCL@COCO2014@OpenAI@[1.0, 0.5, 1.5, 2.0]@448@s=2@hflip@pamr@TTD@bg=0.45/" --tag "TCL+TTD@COCO2014"
python3 evaluate_segmentation.py --data ADE2016 --pred "./results/TCL@ADE2016@OpenAI@[1.0, 0.5, 1.5, 2.0]@448@s=2@hflip@pamr@TTD@bg=0.45/" --tag "TCL+TTD@ADE2016"
python3 evaluate_segmentation.py --data Cityscapes --pred "./results/TCL@Cityscapes@OpenAI@[1.0, 0.5, 1.5, 2.0]@448@s=2@hflip@pamr@TTD@bg=0.40/" --tag "TCL+TTD@Cityscapes"
python3 evaluate_segmentation.py --data COCO-Stuff --pred "./results/TCL@COCO-Stuff@OpenAI@[1.0, 0.5, 1.5, 2.0]@448@s=2@hflip@pamr@TTD@bg=0.40/" --tag "TCL+TTD@COCO-Stuff"
python3 evaluate_segmentation.py --data PascalContext --pred "./results/TCL@PascalContext@OpenAI@[1.0, 0.5, 1.5, 2.0]@448@s=2@hflip@pamr@TTD@bg=0.45/" --tag "TCL+TTD@PascalContext"