This repository contains two related projects developed in Spring 2025 focused on evaluating and generating image captions using large vision-language models.
A research-driven pipeline for evaluating AI-generated image captions.
It includes tools to:
- Generate image captions using local models (e.g., Gemma, Kosmos-2)
- Convert outputs to standardized JSON/CSV formats
- Evaluate caption quality using:
- BLEU, ROUGE, METEOR
- BERTScore
- CLIPScore
- Visualize and compare model performance across datasets (e.g., urban vs. rural)
This project was developed for internal analysis and academic poster presentation.
A lightweight Dockerized Python tool for generating image captions using vision-language models served via Ollama.
Supports models like:
llava:latestllama3.2-vision:90b
Captions are generated via the Ollama REST API and exported to CSV.
โ๏ธ The output CSVs from this program are already in the format required by captionResearchProject, making them directly compatible for evaluation.
The referenceCaptions/ directory contains multiple sets of manually written ground truth captions for each evaluation batch (urban and rural). Each batch has two independently written reference sets to reduce bias in metric evaluation. In a sub-directory, the model captions generated for the research report are found.
All files are in COCO-style JSON format, with fields including:
images: metadata about each image (filename, dimensions, timestamp, etc.)annotations: human-written captions, each linked to animage_id- Optional
labels: keywords or concepts associated with the caption (used for exploratory purposes)
The captionCorrection/ directory contains caption pairs consisting of a model's output caption and a human evaluated/corrected version of the model's output. In addition, there are metric evaluations of the improvements.
--
Both tools are functional and were used in real evaluation pipelines.
You may need to adjust image paths, model names, or mounts depending on your system setup.