Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query on Inference Setting #8

Closed
sauradip opened this issue Jan 2, 2022 · 3 comments
Closed

Query on Inference Setting #8

sauradip opened this issue Jan 2, 2022 · 3 comments

Comments

@sauradip
Copy link

sauradip commented Jan 2, 2022

Hi,

Thanks for making the code public !

I had a general query on the inference setting chosen for this paper -- Why is this paper not targetting zero -shot setting and instead focussed on fully supervised setting ? Is there any reason ? As the power of CLIP lies in zero-shot task transfer, i was wondering why no experiments were done for this ? and instead posed this problem as a multi-modal fully supervised dense detection task ?

Thanks in advance

@raoyongming
Copy link
Owner

raoyongming commented Jan 2, 2022

Hi,

Thanks for your interest in our work. In this work, we want to study the problem of how to use the large-scale pre-training models to various dense prediction tasks. Our method has many applications like replacing the conventional ImageNet pre-training models or unsupervised pre-training models with CLIP models. Due to the large gap between instance-level image-text pretraining and dense prediction tasks, we found that CLIP models can only obtain relatively low performance on zero-shot segmentation or detection tasks (e.g., 15.3 mIoU on ADE20k according to [2]), which is not strong enough for many application scenarios. Therefore, we focus on the fully supervised dense prediction settings where we can use more supervisions to fully exploit the power of CLIP pretraining. We also show our method can be applied to any visual backbones (see Section 4.3).

I think both the zero-shot transfer ability and the rich knowledge learned from large-scale text-image pre-training are key advantages of CLIP. Some recent papers like [1] and [2] explore the former and we study the latter property.

[1] DenseCLIP: Extract Free Dense Labels from CLIP, https://arxiv.org/abs/2112.01071
[2] A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model, https://arxiv.org/abs/2112.14757

@sauradip
Copy link
Author

sauradip commented Jan 8, 2022

Thanks for your clarification ! But i am still curious about the fact that the [2] paper you cited and some followups after that say that 2-stage (1st stage propose/generate mask , 2nd stage pixels aligned corresponding to masks for Zero-shot semantic setup )is the way forward for dense tasks zero-shot task transfer ! In your implementation, looks like you also did a 2-stage approach ( text-pixel embedding fused with visual embedding passed into decoder ) for the decoding part ! Then why do you think the performance for your case still dropped for zero-shot scenario ?

@raoyongming
Copy link
Owner

I think one key reason is that the feature maps extracted using the pre-trained CLIP visual encoder lack locality (i.e., the feature may not precisely represent the semantic information of the corresponding patch/region). Therefore, we need to fine-tune the encoder with pixel-wise supervision to recover the locality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants