IConZIC

IConZIC: Image-Conditioned Zero-shot Image Captioning by Vision-Language Pre-Training Model (pdf)

Natural Language Processing (COSE461) Final Project

Abstract

The field of image captioning, which combines computer vision and natural language processing, has witnessed extensive research efforts. However, the exploration of zero-shot learning for image captioning remains relatively underexplored. Zero-shot image captioning research began with ZeroCap and was followed by ConZIC, which is currently considered state-of-the-art (SOTA). However, ConZIC still has certain limitations. To address these limitations and advance the field of zero-shot image captioning, we propose IConZIC (Image- Conditioned Zeroshot Image Captioning). IConZIC overcomes the initialization issue of ConZIC by leveraging the Vision-Language Pre-training(VLP) encoder, resulting in faster and more accurate caption generation.

Approach

Algorithm

Results

Qualitative Results

Comparison with ConZIC

Google Colaboratory

ConZIC

[CVPR 2023]ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing
Zequn Zeng, Hao Zhang, Zhengjue Wang, Ruiying Lu, Dongsheng Wang, Bo Chen

News

[2023/4] Adding demo on Huggingface Space and Colab!
[2023/3] ConZIC is publicly released!

Framework

Gibbs-BERT

Example of sentiment control

DEMO

Preparation

Please download CLIP and BERT from Huggingface Space.

SketchyCOCOcaption benchmark in our work is available here.

Environments setup.

pip install -r requirements.txt

To run zero-shot captioning on images:

ConZIC supports arbitary generation orders by change order. You can increase alpha for more fluency, beta for more image content. Notably, there is a trade-off between fluency and image-matching degree.
Sequential: update tokens in classical left to right order. At each iteration, the whole sentence will be updated.

python demo.py --run_type "caption" --order "sequential" --sentence_len 10 --caption_img_path "./examples/girl.jpg" --samples_num 1
--lm_model "bert-base-uncased" --match_model "openai/clip-vit-base-patch32" 
--alpha 0.02 --beta 2.0

Shuffled: update tokens in random shuffled generation order, different orders resulting in different captions.

python demo.py --run_type "caption" --order "shuffle" --sentence_len 10 --caption_img_path "./examples/girl.jpg" --samples_num 3
--lm_model "bert-base-uncased" --match_model "openai/clip-vit-base-patch32" 
--alpha 0.02 --beta 2.0

Random: only randomly select a position and then update this token at each iteration, high diversity due to high randomness.

python demo.py --run_type "caption" --order "random" --sentence_len 10 --caption_img_path "./examples/girl.jpg" --samples_num 3
--lm_model "bert-base-uncased" --match_model "openai/clip-vit-base-patch32" 
--alpha 0.02 --beta 2.0

To run controllable zero-shot captioning on images:

ConZIC supports many text-related controllable signals. For examples:
Sentiments(positive/negative): you can increase gamma for higher controllable degree, there is also a trade-off.

python demo.py 
--run_type "controllable" --control_type "sentiment" --sentiment_type "positive"
--order "sequential" --sentence_len 10 --caption_img_path "./examples/girl.jpg" --samples_num 1
--lm_model "bert-base-uncased" --match_model "openai/clip-vit-base-patch32" 
--alpha 0.02 --beta 2.0 --gamma 5.0

Part-of-speech(POS): it will meet the predefined POS templete as much as possible.

python demo.py 
--run_type "controllable" --control_type "pos" --order "sequential"
--pos_type "your predefined POS templete"
--sentence_len 10 --caption_img_path "./examples/girl.jpg"  --samples_num 1
--lm_model "bert-base-uncased" --match_model "openai/clip-vit-base-patch32" 
--alpha 0.02 --beta 2.0 --gamma 5.0

Length: change sentence_len.

Gradio Demo

We highly recommend to use the following WebUI demo in your browser from the local url: http://127.0.0.1:7860.

pip install gradio
python app.py --lm_model "bert-base-uncased" --match_model "openai/clip-vit-base-patch32"

You can also use the demo.launch() function to create a public link used by anyone to access the demo from their browser by setting share=True.

Citation

Please cite our work if you use it in your research:

@article{zeng2023conzic,
  title={ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing},
  author={Zeng, Zequn and Zhang, Hao and Wang, Zhengjue and Lu, Ruiying and Wang, Dongsheng and Chen, Bo},
  journal={arXiv preprint arXiv:2303.02437},
  year={2023}
}

Contact

If you have any questions, please contact zzequn99@163.com or zhanghao_xidian@163.com.

Acknowledgment

This code is based on the bert-gen and MAGIC.

Thanks for Jiaqing Jiang providing huggingface and Colab demo.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
IConZIC_images		IConZIC_images
clip		clip
examples		examples
legacy		legacy
paper_images		paper_images
COSE461_Project_Final_Report.pdf		COSE461_Project_Final_Report.pdf
IConZIC.ipynb		IConZIC.ipynb
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
gen_utils.py		gen_utils.py
gen_utils_no_clip.py		gen_utils_no_clip.py
requirements.txt		requirements.txt
run.py		run.py
stop_words.txt		stop_words.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IConZIC

Abstract

Approach

Algorithm

Results

Qualitative Results

Comparison with ConZIC

Google Colaboratory

ConZIC

News

Framework

Gibbs-BERT

Example of sentiment control

DEMO

Preparation

To run zero-shot captioning on images:

To run controllable zero-shot captioning on images:

Gradio Demo

Citation

Contact

Acknowledgment

About

Releases

Packages

Languages

License

stop1one/IConZIC

Folders and files

Latest commit

History

Repository files navigation

IConZIC

Abstract

Approach

Algorithm

Results

Qualitative Results

Comparison with ConZIC

Google Colaboratory

ConZIC

News

Framework

Gibbs-BERT

Example of sentiment control

DEMO

Preparation

To run zero-shot captioning on images:

To run controllable zero-shot captioning on images:

Gradio Demo

Citation

Contact

Acknowledgment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages