This repository contains the implementation and evaluation framework for assessing visualization literacy capabilities of Visual Language Models (VLMs) using standardized tests VLAT and CALVI. The study provides a comprehensive comparison of four state-of-the-art VLMs' abilities to interpret, reason about, and critically analyze data visualizations.
The project evaluates VLMs through:
- Visualization Literacy Assessment Test (VLAT) - 53 multiple-choice items across 12 visualization types
- Critical thinking Assessment for Literacy in Visualization (CALVI) - 45 items focused on misleading visualization elements
- 10 randomized evaluation runs per model to ensure robust results
| Model | Version | Provider |
|---|---|---|
| GPT-4 Vision | GPT-4o | OpenAI |
| Claude | 3.5 Sonnet | Anthropic |
| Gemini | 1.5 Pro | |
| Llama | 3.2-vision | Meta |
All models are configured with:
- Temperature: 0
- Max tokens: 300
├── README.md
├── data/
│ ├── VLAT/ # VLAT test images and questions
│ └── CALVI/ # CALVI test images and questions
├── scripts/
│ ├── gpt4_evaluation.ipynb # GPT-4 Vision evaluation notebook
│ ├── claude_evaluation.ipynb # Claude evaluation notebook
│ ├── gemini_evaluation.ipynb # Gemini evaluation notebook
│ ├── llama_evaluation.ipynb # Llama evaluation notebook
├── prompts/
│ ├── VLAT_prompt.txt # Standardized VLAT assessment prompt
│ └── CALVI_prompt.txt # Standardized CALVI assessment prompt
├── Output/
│ ├── CALVI/ # model responses to CALVI questions
│ ├── VLAT/ # model responses to VLAT questions
- Clone the repository:
git clone https://github.com/washuvis/VisLit-VLM-Eval.git- Install required dependencies:
pip install -r requirements.txt-
Configure API keys:
- Add your API keys for each VLM provider
-
Run evaluations:
- Navigate to the
scriptsdirectory - Execute evaluation notebooks for each model
- Navigate to the