Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conceptual Captions Training #23

Open
goel-shashank opened this issue Feb 1, 2022 · 9 comments
Open

Conceptual Captions Training #23

goel-shashank opened this issue Feb 1, 2022 · 9 comments

Comments

@goel-shashank
Copy link

I have trained the model (both MLP and GPT-2) using the CC3M dataset but the loss doesn't seem to decrease very much (stays around 3.0). What loss can I expect for a good model? How many epochs should I run it for? Also, is any specific hyperparameter tuning required for CC? I have a model trained for 5 epochs but it generates a similar caption for every image. I tried fitting on a batch of 512 image-caption pairs and everything works out so I don't think there is any logical issue with the pipeline. Please let me know.

@goel-shashank goel-shashank changed the title Conceptual Captions training Conceptual Captions Training Feb 1, 2022
@rmokady
Copy link
Owner

rmokady commented Feb 1, 2022

Hi @goel-shashank,
Are you using our default parameters?

Did you try both the GPT-2 fine-tuning and the frozen GPT-2?

@goel-shashank
Copy link
Author

Hi @rmokady,
I tried the default parameters. Do you have the training logs for your run? One thing I'm certainly doing differently is that I have trained a separate CLIP model (RN50 with 20% imagenet zero-shot accuracy) which is trained on CC3M (not OpenAI's pretrained). The prefixes are generated from this model. I don't think this should be causing these issues.

@rmokady
Copy link
Owner

rmokady commented Feb 3, 2022

For COCO where we train both prefix and GPT-2 the loss got to 1.47
Unfortunately, the logs for the Conceptual got left on an old server and cannot access these anymore
5 epochs for 3M images is a lot using the standard clip

Anyway, outputting the same sentence for any prefix usually means there is a bug somewhere

@goel-shashank
Copy link
Author

As I mentioned, I was able to fit on a batch of 512 image-caption pairs and everything works out so I don't think there is any logical issue with the pipeline. But still, I will check everything for once. Closing this issue! Please let me know if you find something useful!

@rmokady
Copy link
Owner

rmokady commented Feb 19, 2022

Hi @goel-shashank,
I find some logs for conceptual captions. This is with the resnet CLIP:
Screen Shot 2022-02-19 at 16 33 41

@rmokady rmokady reopened this Feb 19, 2022
@rmokady
Copy link
Owner

rmokady commented Feb 19, 2022

This is with the Vit CLIP:
Screen Shot 2022-02-19 at 16 35 36

@ycchanau
Copy link

I have the same problem with my own dataset. It keeps generating similar captions...

@surisdi
Copy link

surisdi commented Jul 12, 2022

Hi, I have the same problem for Conceptual Captions + frozen model. Do you have loss values for that scenario? All the inputs end up converging to the same prefix. Thanks!

I followed the README and ran:

python parse_conceptual.py --clip_model_type ViT-B/32 --data_root /path/to/conceptual_captions --num_threads 100

and then

python train.py --only_prefix --data /path/to/conceptual_captions/conceptual_clip_ViT-B_32_train.pkl --out_dir /path/to/output_dir --mapping_type transformer --num_layers 8 --prefix_length 40 --prefix_length_clip 40

@mmderakhshani
Copy link

@surisdi Did you manage to reproduce the results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants