This project implements an end-to-end image captioning pipeline by integrating a Vision Transformer (ViT) for visual feature extraction with GPT-2 for sequence generation. The model is trained on the Flickr8k dataset and achieves a strong BLEU-2 score. The pipeline is containerized using Docker, deployed on AWS ECS, and monitored in real-time with Prometheus and Grafana.
- Generate meaningful and grammatically correct captions for images using a hybrid ViT-GPT2 architecture.
- Achieve high performance on standard image captioning benchmarks (e.g., BLEU scores).
- Deploy the pipeline in a scalable and observable production environment.
- Encoder: Pretrained Vision Transformer (ViT) extracts visual tokens from input images.
- Decoder: GPT-2 (fine-tuned) generates captions based on visual tokens from ViT.
- Token Fusion: ViT embeddings are linearly projected and prepended to GPT-2βs text input.
- Loss Function: Cross-Entropy Loss
- Optimizer: AdamW
- Flickr8k: A dataset containing 8,000 images, each annotated with 5 different captions.
- Source: Flickr8k on Kaggle
- Preprocessing:
- Images resized and normalized for ViT
- Captions tokenized using GPT-2 tokenizer
- Filtered captions > 20 tokens
Metric | Score |
---|---|
BLEU-1 | 94% |
BLEU-2 | 87% |
BLEU-3 | 75% |
BLEU-4 | 63% |
BLEU scores were computed on the validation set using
nltk.translate.bleu_score
.
Component | Tech Used |
---|---|
Containerization | Docker |
Orchestration | AWS ECS + Fargate |
Monitoring | Prometheus + Grafana |
Logging | AWS CloudWatch |
Model Serving | FastAPI |
Input Image | Generated Caption |
---|---|
![]() |
"a girl climbing stairs" |
- Python 3.9
- PyTorch
- Hugging Face Transformers
- OpenCV
- FastAPI
- Docker, ECS, CloudWatch
- Prometheus + Grafana
- Hybrid encoder-decoder architecture using state-of-the-art models
- Achieved 87% BLEU-2, reflecting meaningful short-phrase caption generation
- Fully containerized and deployed with auto-scaling support
- Monitoring integrated for live inference metrics and uptime stats
- Integrate CLIP for improved vision-text alignment
- Extend to larger datasets like Flickr30k or MS-COCO
- Add user feedback collection via a frontend interface