A question about the x dimension. #26

tianjunyu0871 · 2022-02-14T16:03:43Z

When I train only transformer mapping network,I found that the dimension of x is(40 , 512),but prefix_dim = 640.I don't know why this is happening. Is it caused by the extraction of clip features? Hope to get your help, thank you.

rmokady · 2022-02-14T19:59:07Z

Hi @tianjunyu0871 ,
There is two version of CLIP (Resnet and VIT)
Their encoding size is different - 500 and 640
I assume this is your issue

It should be solvable using different command line arguments

Is it helpful?

tianjunyu0871 · 2022-02-15T01:09:11Z

Thanked your reply.
Does the parameter is_rn represent resnet?But the following command appears is_rn?Is it a clerical error?

In addition, can you share the pre-training weights of MLP and the program evaluation code? Thank you so much!!

rmokady · 2022-02-15T09:13:34Z

Yes this is an error
Thank you very much for pointing it out
I will fix it ASAP

We use the evaluation code as used in the OSCAR repository
Just replacing the JSON files with our JSONs

We already shared the weights of MLP - see "Inference Notebooks" section in the readme.

tianjunyu0871 · 2022-02-18T11:30:25Z

I tried to modify the prediction code and the following error occurred while loading the pre-trained Transformer data.

I don't know if there is a problem with my code. Can you share your code for forecasting with Transformer? Thank you very much!

rmokady · 2022-02-19T13:53:18Z

Prediction with transformer is available in this notebook

tianjunyu0871 · 2022-02-27T14:33:16Z

I have gained a lot from your work, but I still have a few questions, and I hope to get your answers.
First question: I tried to remove the stoptoken, but the effect is not good, is there a good way to generate more than one sentence?
Second question: Have you tried using different GPT models? Such as GPT2-medium or GPT2-large . Is the difference significant?
Third question: what does the prefix_length_clip parameter mean in training?
Looking forward to your reply, thank you very much!

rmokady · 2022-03-06T12:13:16Z

To generate more than one sentence you should replace the inference algorithm (e.g. beam search)
Using a variants of beam search you can produce different captions.

We haven't tried to use different GPT models.

prefix_length_clip control the transformer mapping network - size (in tokens) of the clip embedding, as some of the prefix is a learned const.

rmokady closed this as completed May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question about the x dimension. #26

A question about the x dimension. #26

tianjunyu0871 commented Feb 14, 2022

rmokady commented Feb 14, 2022

tianjunyu0871 commented Feb 15, 2022

rmokady commented Feb 15, 2022

tianjunyu0871 commented Feb 18, 2022

rmokady commented Feb 19, 2022

tianjunyu0871 commented Feb 27, 2022

rmokady commented Mar 6, 2022

A question about the x dimension. #26

A question about the x dimension. #26

Comments

tianjunyu0871 commented Feb 14, 2022

rmokady commented Feb 14, 2022

tianjunyu0871 commented Feb 15, 2022

rmokady commented Feb 15, 2022

tianjunyu0871 commented Feb 18, 2022

rmokady commented Feb 19, 2022

tianjunyu0871 commented Feb 27, 2022

rmokady commented Mar 6, 2022