You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Very nice work! One question is that, according to the paper, ClipCAP is pre-trained on COCO-captions using frozen RegionCLIP. However, there may exist domain gaps between COCO images and the datasets used in the paper, especially those stylized images. Does the gaps effect the pre-trained ClipCAP? Besides, it would be best to provide the complete configuration and the related code on the the training of ClipCAP with RegionCLIP.
The text was updated successfully, but these errors were encountered:
There may exist domain gaps between COCO images and the datasets used in the paper, especially those stylized images.
This is true. There is a domain gap between the COCO and the stylized images. So, as shown in the paper, the model doesn't produce meaningful captions on the stylized images in the beginning. Our goal is to resolve this by making the vision encoder robust. So, further training the RegionCLIP using the proposed approach to produce robust embedding for the image and its stylized version such that an arbitrary image-captioning model (in our case CLIPCAP) can produce meaningful captions.
It would be best to provide the complete configuration and the related code on the training of ClipCAP with RegionCLIP
Very nice work! One question is that, according to the paper, ClipCAP is pre-trained on COCO-captions using frozen RegionCLIP. However, there may exist domain gaps between COCO images and the datasets used in the paper, especially those stylized images. Does the gaps effect the pre-trained ClipCAP? Besides, it would be best to provide the complete configuration and the related code on the the training of ClipCAP with RegionCLIP.
The text was updated successfully, but these errors were encountered: