Query on Inference Setting #8

sauradip · 2022-01-02T05:05:39Z

Hi,

Thanks for making the code public !

I had a general query on the inference setting chosen for this paper -- Why is this paper not targetting zero -shot setting and instead focussed on fully supervised setting ? Is there any reason ? As the power of CLIP lies in zero-shot task transfer, i was wondering why no experiments were done for this ? and instead posed this problem as a multi-modal fully supervised dense detection task ?

Thanks in advance

raoyongming · 2022-01-02T06:26:46Z

Hi,

Thanks for your interest in our work. In this work, we want to study the problem of how to use the large-scale pre-training models to various dense prediction tasks. Our method has many applications like replacing the conventional ImageNet pre-training models or unsupervised pre-training models with CLIP models. Due to the large gap between instance-level image-text pretraining and dense prediction tasks, we found that CLIP models can only obtain relatively low performance on zero-shot segmentation or detection tasks (e.g., 15.3 mIoU on ADE20k according to [2]), which is not strong enough for many application scenarios. Therefore, we focus on the fully supervised dense prediction settings where we can use more supervisions to fully exploit the power of CLIP pretraining. We also show our method can be applied to any visual backbones (see Section 4.3).

I think both the zero-shot transfer ability and the rich knowledge learned from large-scale text-image pre-training are key advantages of CLIP. Some recent papers like [1] and [2] explore the former and we study the latter property.

[1] DenseCLIP: Extract Free Dense Labels from CLIP, https://arxiv.org/abs/2112.01071
[2] A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model, https://arxiv.org/abs/2112.14757

sauradip · 2022-01-08T08:47:56Z

Thanks for your clarification ! But i am still curious about the fact that the [2] paper you cited and some followups after that say that 2-stage (1st stage propose/generate mask , 2nd stage pixels aligned corresponding to masks for Zero-shot semantic setup )is the way forward for dense tasks zero-shot task transfer ! In your implementation, looks like you also did a 2-stage approach ( text-pixel embedding fused with visual embedding passed into decoder ) for the decoding part ! Then why do you think the performance for your case still dropped for zero-shot scenario ?

raoyongming · 2022-01-09T04:50:53Z

I think one key reason is that the feature maps extracted using the pre-trained CLIP visual encoder lack locality (i.e., the feature may not precisely represent the semantic information of the corresponding patch/region). Therefore, we need to fine-tune the encoder with pixel-wise supervision to recover the locality.

raoyongming closed this as completed Jan 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query on Inference Setting #8

Query on Inference Setting #8

sauradip commented Jan 2, 2022 •

edited

raoyongming commented Jan 2, 2022 •

edited

sauradip commented Jan 8, 2022 •

edited

raoyongming commented Jan 9, 2022

Query on Inference Setting #8

Query on Inference Setting #8

Comments

sauradip commented Jan 2, 2022 • edited

raoyongming commented Jan 2, 2022 • edited

sauradip commented Jan 8, 2022 • edited

raoyongming commented Jan 9, 2022

sauradip commented Jan 2, 2022 •

edited

raoyongming commented Jan 2, 2022 •

edited

sauradip commented Jan 8, 2022 •

edited