AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception

How do Multimodal LLMs perform on Image Aesthetics Perception?

Yipo Huang¹^*, Quan Yuan¹^*, Xiangfei Sheng¹, Zhichao Yang¹, Haoning Wu²,

Pengfei Chen¹, Yuzhe Yang³, Leida Li¹^#, Weisi Lin²,

¹Xidian University, ²Nanyang Technological University, ³OPPO Research Institute

^*Equal contribution. ^#Corresponding author.

If you like this work, please give us a star ⭐ on GitHub.

We construct a high-quality Expert-labeled Aesthetic Perception Database (EAPD), based on which we further build the golden benchmark to evaluate four abilities of MLLMs on image aesthetics perception, including Aesthetic Perception (AesP), Aesthetic Empathy (AesE), Aesthetic Assessment (AesA) and Aesthetic Interpretation (AesI).

News

[2024/01/20] 🎉 Congrats to SPHINX-MoE for achieving new SOTAs on AesP and AesE!!
[2024/01/18] 🤗 Database of AesBench now support Huggingface!
[2024/01/17] 🚩 We have released the Evaluation Database and Codes of AesBench! Check Here for more details.

GPT-4V and Gemini Pro Vision!

Here is the comparison of GPT-4V, Gemini Pro Vision, and other OA MLLMs on AesP.

Rank	MLLM	Tec. Qua.	Col. Lig.	Comp.	Content	NIs	AIs	AGIs	Yes-No	What	How	Why	Overall
🥇	SPHINX-MoE	66.67%	76.31%	72.68%	66.31%	75.84%	72.19%	68.88%	69.12%	62.18%	80.38%	88.05%	72.93%
🥈	Q-Instruct	66.03%	74.48%	73.68%	68.09%	76.48%	69.70%	69.28%	64.68%	63.31%	85.28%	86.34%	72.61%
🥉	GPT-4V	69.02%	74.66%	71.72%	65.57%	75.67%	72.58%	65.82%	68.93%	64.67%	76.70%	84.46%	72.08%
4	Gemini Pro Vision	65.08%	74.57%	72.24%	67.97%	74.63%	69.62%	70.03%	64.70%	64.95%	78.71%	90.24%	71.99%
5	ShareGPT4V	62.18%	71.90%	69.29%	64.89%	70.79%	71.57%	63.96%	69.32%	61.33%	72.01%	77.56%	69.18%
6	mPLUG-Owl2	60.90%	70.57%	68.30%	62.77%	72.23%	64.71%	64.10%	65.59%	58.64%	73.02%	80.73%	67.89%
7	LLaVA-1.5	53.85%	70.16%	67.40%	59.93%	69.10%	65.71%	62.37%	62.36%	58.92%	70.71%	81.22%	66.32%
8	Qwen-VL	54.81%	66.25%	62.91%	60.64%	68.30%	58.85%	59.44%	61.25%	55.38%	67.53%	74.15%	63.21%
9	LLaVA	46.79%	63.59%	65.30%	64.54%	64.29%	61.10%	60.77%	65.39%	52.27%	61.18%	74.88%	62.43%
10	InstructBLIP	37.82%	55.36%	55.43%	57.09%	57.06%	55.86%	47.21%	59.84%	45.01%	54.98%	56.34%	54.29%
11	MiniGPT-v2	56.73%	56.44%	51.74%	50.00%	56.74%	53.24%	50.93%	53.99%	43.06%	58.73%	66.10%	54.18%
12	GLM	55.77%	54.61%	51.25%	48.94%	54.90%	55.24%	47.34%	60.95%	44.62%	48.48%	55.61%	52.96%
13	Otter	35.90%	54.28%	51.65%	51.06%	51.04%	50.62%	51.20%	56.10%	44.48%	51.37%	49.02%	50.96%
14	IDEFICS-Instruct	37.50%	52.87%	52.84%	51.06%	52.97%	50.12%	48.40%	50.96%	44.62%	51.09%	60.73%	50.82%
15	MiniGPT-4	39.42%	41.31%	42.67%	44.33%	41.57%	42.89%	41.36%	47.23%	32.01%	41.99%	46.10%	41.93%
16	TinyGPT-V	21.79%	24.52%	22.13%	28.01%	22.71%	24.69%	24.34%	32.39%	17.99%	19.77%	19.27%	23.71%

Here is the comparison of GPT-4V, Gemini Pro Vision, and other OA MLLMs on AesE.

Rank	MLLM	Emotion	Interest	Uniqueness	Vibe	NIs	AIs	AGIs	Yes-No	What	How	Why	Overall
🥇	SPHINX-MoE	68.59%	80.65%	75.86%	82.14%	74.72%	75.19%	69.02%	74.95%	62.89%	72.71%	88.48%	73.32%
🥈	Q-Instruct	68.64%	83.86%	75.86%	80.00%	76.65%	72.19%	66.62%	64.30%	67.42%	81.57%	86.76%	72.68%
🥉	Gemini Pro Vision	66.87%	87.50%	70.00%	79.09%	70.60%	72.35%	71.53%	67.50%	64.52%	72.25%	90.37%	71.37%
4	ShareGPT4V	66.48%	80.65%	68.97%	78.72%	70.95%	73.69%	67.29%	67.75%	65.58%	72.71%	83.58%	70.75%
5	GPT-4V	65.06%	72.41%	62.07%	80.15%	73.87%	72.08%	62.27%	68.67%	64.02%	70.07%	84.20%	70.16%
6	mPLUG-Owl2	65.60%	77.42%	65.52%	78.07%	71.03%	71.57%	66.22%	68.05%	64.16%	70.14%	83.82%	69.89%
7	LLaVA-1.5	62.49%	80.65%	75.85%	78.93%	69.26%	69.58%	65.43%	62.37%	64.16%	71.71%	84.07%	68.32%
8	LLaVA	58.61%	80.63%	65.52%	75.83%	67.01%	66.96%	58.38%	67.95%	55.95%	60.14%	79.66%	64.68%
9	Qwen-VL	58.67%	83.87%	72.41%	73.90%	63.88%	67.08%	61.57%	60.65%	58.07%	66.14%	79.90%	64.18%
10	MiniGPT-v2	52.52%	58.06%	44.83%	58.07%	55.86%	55.85%	50.27%	57.81%	43.48%	53.43%	66.42%	54.36%
11	GLM	53.13%	70.97%	44.83%	55.29%	56.58%	54.86%	48.67%	60.65%	41.78%	50.43%	64.95%	53.96%
12	InstructBLIP	49.64%	58.06%	51.72%	61.50%	55.06%	55.24%	48.94%	55.88%	50.99%	51.43%	58.33%	53.89%
13	Otter	48.42%	70.97%	51.72%	63.21%	53.05%	55.74%	52.39%	54.77%	51.84%	53.43%	54.41%	53.64%
14	IDEFICS-Instruct	43.93%	64.52%	62.07%	64.06%	50.72%	53.12%	49.07%	50.20%	41.08%	52.43%	66.42%	50.82%
15	MiniGPT-4	39.78%	38.71%	24.14%	39.04%	42.70%	37.78%	35.51%	50.61%	31.59%	31.86%	38.48%	39.35%
16	TinyGPT-V	30.36%	29.03%	31.03%	35.40%	32.50%	36.03%	26.99%	36.00%	29.89%	28.86%	31.62%	32.04%

Here is the comparison of GPT-4V, Gemini Pro Vision, and other OA MLLMs on AesA.

Rank	MLLM	NIs	AIs	AGIs	Overall
🥇	Q-Instruct	62.20%	49.75%	40.69%	52.86%
🥈	GPT-4V	59.98%	46.92%	40.59%	50.86%
🥉	mPLUG-Owl2	57.78%	49.50%	40.83%	50.57%
4	SPHINX-MoE	57.62%	48.50%	38.70%	49.93%
5	Gemini Pro Vision	54.17%	48.39%	42.20%	49.38%
6	ShareGPT4V	54.65%	48.38%	35.90%	47.82%
7	InstructBLIP	52.73%	47.88%	34.84%	46.54%
8	Qwen-VL	54.25%	39.28%	40.43%	46.25%
9	LLaVA	51.69%	48.00%	34.31%	45.96%
10	LLaVA-1.5	50.08%	48.13%	34.97%	45.46%
11	IDEFICS-Instruct	50.00%	47.76%	33.78%	45.00%
12	Otter	49.20%	48.25%	34.04%	44.86%
13	TinyGPT-V	44.06%	41.65%	44.81%	43.57%
14	MiniGPT-4	41.65%	36.28%	35.90%	38.57%
15	GLM	38.92%	37.78%	35.90%	37.79%
16	MiniGPT-v2	27.05%	31.92%	36.97%	31.11%

Here is the comparison of GPT-4V, Gemini Pro Vision, and other OA MLLMs on AesI.

Rank	Model	Relevance	Precision	Completeness	Overall
🥇	GPT-4V	1.385	1.151	1.366	1.301
🥈	ShareGPT4V	1.440	1.117	1.331	1.296
🥉	SPHINX-MoE	1.501	1.171	1.130	1.267
4	Gemini Pro Vision	1.416	1.087	1.164	1.222
5	Qwen-VL	1.393	1.006	1.175	1.192
6	mPLUG-Owl2	1.402	1.016	1.130	1.182
7	IDEFICS-Instruct	1.406	1.007	1.126	1.180
8	LLaVA-1.5	1.397	0.953	1.120	1.157
9	InstructBLIP	1.372	0.863	1.144	1.126
10	LLaVA	1.374	0.918	1.084	1.125
11	Otter	1.242	0.848	0.989	1.027
12	Q-Instruct	1.222	0.939	0.898	1.020
13	MiniGPT-v2	1.191	0.868	0.948	1.003
14	MiniGPT-4	1.158	0.823	1.016	0.999
15	GLM	1.122	0.729	0.944	0.932
16	TinyGPT-V	0.871	0.511	0.720	0.701

Submission Guideline

via GitHub Release: Please see our release for details.

Acknowledgement

Special thanks are extended to the 32 aesthetic experts who participated in our experiments, whose rich aesthetic experience and responsible attitude played a crucial role in the construction of the dataset. We highlight the following:

Wei Liu, Xin Liu, Luxia Chen, Tianjiao Gu, Dahai Tian, Ziyan Ou, et al.

Many thanks are extended to collaborators, for their kind assistance in data collection and MLLM deployment:

Zhichao Duan and Pangu Xie.

Citation

If you find our work interesting, please feel free to cite our paper:

@article{AesBench,
    title={AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception},
    author={Huang, Yipo and Yuan, Quan and Sheng, Xiangfei and Yang, Zhichao and Wu, Haoning and Chen, Pengfei and Yang, Yuzhe and Li, Leida and Lin, Weisi},
   journal={arXiv preprint arXiv:2401.08276},
    year={2024},
}

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
data_release		data_release
eval		eval
imgs		imgs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_release

data_release

eval

eval

imgs

imgs

LICENSE

LICENSE

README.md

README.md

Repository files navigation

AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception

If you like this work, please give us a star ⭐ on GitHub.

News

GPT-4V and Gemini Pro Vision!

Submission Guideline

Acknowledgement

Citation

About

Releases

Packages

Contributors 2

Languages

License

yipoh/AesBench

Folders and files

Latest commit

History

Repository files navigation

AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception

If you like this work, please give us a star ⭐ on GitHub.

News

GPT-4V and Gemini Pro Vision!

Submission Guideline

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Languages