New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query on Inference Setting #8
Comments
Hi, Thanks for your interest in our work. In this work, we want to study the problem of how to use the large-scale pre-training models to various dense prediction tasks. Our method has many applications like replacing the conventional ImageNet pre-training models or unsupervised pre-training models with CLIP models. Due to the large gap between instance-level image-text pretraining and dense prediction tasks, we found that CLIP models can only obtain relatively low performance on zero-shot segmentation or detection tasks (e.g., 15.3 mIoU on ADE20k according to [2]), which is not strong enough for many application scenarios. Therefore, we focus on the fully supervised dense prediction settings where we can use more supervisions to fully exploit the power of CLIP pretraining. We also show our method can be applied to any visual backbones (see Section 4.3). I think both the zero-shot transfer ability and the rich knowledge learned from large-scale text-image pre-training are key advantages of CLIP. Some recent papers like [1] and [2] explore the former and we study the latter property. [1] DenseCLIP: Extract Free Dense Labels from CLIP, https://arxiv.org/abs/2112.01071 |
Thanks for your clarification ! But i am still curious about the fact that the [2] paper you cited and some followups after that say that 2-stage (1st stage propose/generate mask , 2nd stage pixels aligned corresponding to masks for Zero-shot semantic setup )is the way forward for dense tasks zero-shot task transfer ! In your implementation, looks like you also did a 2-stage approach ( text-pixel embedding fused with visual embedding passed into decoder ) for the decoding part ! Then why do you think the performance for your case still dropped for zero-shot scenario ? |
I think one key reason is that the feature maps extracted using the pre-trained CLIP visual encoder lack locality (i.e., the feature may not precisely represent the semantic information of the corresponding patch/region). Therefore, we need to fine-tune the encoder with pixel-wise supervision to recover the locality. |
Hi,
Thanks for making the code public !
I had a general query on the inference setting chosen for this paper -- Why is this paper not targetting zero -shot setting and instead focussed on fully supervised setting ? Is there any reason ? As the power of CLIP lies in zero-shot task transfer, i was wondering why no experiments were done for this ? and instead posed this problem as a multi-modal fully supervised dense detection task ?
Thanks in advance
The text was updated successfully, but these errors were encountered: