This project implements an AI-based Image-to-Text system that generates descriptive captions for images by integrating advanced computer vision and natural language processing techniques. The system utilizes Xception (a pre-trained CNN) for image feature extraction and LSTM for caption generation.
- Extracts high-level features from images using a pre-trained Xception model.
- Generates accurate and contextually relevant captions with LSTM.
- Supports BLEU score evaluation for caption quality.
- Modular implementation for easy extensibility.
This project uses the Flickr8k dataset, which includes:
- 8,000 images with 5 captions per image.
Dataset/
├── Images/ # Folder containing image files
├── captions.txt # Text file with image-caption mappings
- Python 3.7+
- Pip package manager
Install the required Python packages:
pip install -r requirements.txt- Place the dataset under the
Dataset/folder as described in the dataset structure. - Ensure
captions.txtcontains the mappings of image filenames to their captions.
Run the script to preprocess the data, extract features, and train the model:
python Untitled-1.py- Default parameters:
- Epochs: 13
- Batch size: 32
Use the generate_caption function in the script to predict captions for images:
lst, pred = generate_caption(model, "1096165011_cc5eb16aa6", image_directory, mapping, featuresx, tokenizer, max_length)Evaluate the performance using BLEU scores:
from nltk.translate.bleu_score import corpus_bleu
actual, predicted = list(), list()
bleu_score = corpus_bleu(actual, predicted)
print(f"BLEU Score: {bleu_score}")- Generated Caption: "A boys is smiling underwater."
- Achieved BLEU score:
0.066.
.
├── Untitled-1.py # Main script for training and testing
├── Dataset/ # Contains images and captions.txt
├── requirements.txt # Python dependencies
├── README.md # Project documentation
- Subjective Captions: BLEU scores can be low due to the subjective nature of captions.
- Complex Scenes: Model struggles with images containing multiple objects or intricate details.
- Advanced Architectures:
- Experiment with Vision Transformers (ViT) or GPT-based models for improved caption generation.
- Larger Datasets:
- Incorporate datasets like COCO or Visual Genome for better generalization.
- Multilingual Captioning:
- Extend functionality to support captions in multiple languages.
- Dataset: Flickr8k Dataset
- Xception Model: Keras Applications
- References: