Details about CLIP fine-tuning and zero-shot text-guided editing

Hi, 

Could you kindly provide more details on the **_setting for model fine-tuning with CLIP_** and the **_zero-shot text-guided expression editing procedure_**?

For model fine-tuning with CLIP, my understanding is that:
the same losses in emotion adaptation are used in addition to CLIP loss; 
the fine-tuning is performed on MEAD, where each training video is paired with fixed text prompts of the corresponding emotion category (attached in the screenshot).

For the zero-shot text-guided expression editing, I was wondering how is the CLIP text feature incorporated into the existing model structure (e.g. a mapping from CLIP feature to the latent code z or to the emotion prompt?).

Thank you in advance for your time and help!

![image](https://github.com/yuangan/EAT_code/assets/66498825/dc7a5698-237f-426a-bbab-6534c3bdde83)

_Originally posted by @JamesLong199 in https://github.com/yuangan/EAT_code/issues/23#issuecomment-2110126521_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Details about CLIP fine-tuning and zero-shot text-guided editing #30

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Details about CLIP fine-tuning and zero-shot text-guided editing #30

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions