Skip to content

shraddhaborah/ImageCaptionGenerator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

ImageCaptionGenerator

Image Caption Generator

A deep learning model that automatically generates descriptive captions for images using computer vision and natural language processing techniques.

Overview

This project implements an image captioning system that combines Convolutional Neural Networks (CNN) for image feature extraction and Recurrent Neural Networks (RNN/LSTM) for text generation. The model analyzes visual content and produces coherent, contextually relevant captions describing what it sees in the image.

Features

  • Automatic Caption Generation: Generate descriptive captions for any input image
  • Deep Learning Architecture: Combines CNN and LSTM/RNN models for optimal performance
  • Pre-trained Models: Utilizes pre-trained CNN models (VGG16/ResNet) for feature extraction
  • Flexible Input: Supports various image formats (JPEG, PNG, etc.)
  • Customizable: Easy to fine-tune and adapt for specific use cases

Architecture

The model follows an encoder-decoder architecture:

  1. Encoder (CNN): Extracts visual features from input images using pre-trained convolutional networks
  2. Decoder (LSTM): Generates captions word by word using the extracted image features
  3. Attention Mechanism (if implemented): Focuses on relevant parts of the image while generating each word

Requirements

tensorflow>=2.0.0
keras>=2.0.0
numpy>=1.19.0
pandas>=1.1.0
matplotlib>=3.3.0
pillow>=8.0.0
opencv-python>=4.5.0
jupyter>=1.0.0

Installation

  1. Clone the repository:
git clone https://github.com/shraddhaborah/ImageCaptionGenerator.git
cd ImageCaptionGenerator
  1. Install required dependencies:
pip install -r requirements.txt
  1. Download the dataset (if training from scratch):
    • Flickr8k dataset
    • MS COCO dataset
    • Or any custom image-caption dataset

Usage

Running the Notebook

  1. Open the Jupyter notebook:
jupyter notebook Image_Caption_Generator.ipynb
  1. Follow the notebook cells sequentially:
    • Data preprocessing
    • Model architecture setup
    • Training (if applicable)
    • Caption generation for test images

Generating Captions

# Load the trained model
model = load_model('path/to/your/model.h5')

# Generate caption for an image
image_path = 'path/to/your/image.jpg'
caption = generate_caption(model, image_path)
print(f"Generated Caption: {caption}")

Dataset

The model can be trained on various datasets:

  • Flickr8k: 8,000 images with 5 captions each
  • Flickr30k: 30,000 images with 5 captions each
  • MS COCO: Large-scale dataset with detailed captions
  • Custom Dataset: Your own image-caption pairs

Model Performance

The model's performance is evaluated using standard metrics:

  • BLEU Score: Measures n-gram overlap between generated and reference captions
  • METEOR: Considers synonyms and word order
  • CIDEr: Consensus-based evaluation
  • ROUGE-L: Longest common subsequence based metric

Examples

Input Image

[Sample image of psyduck standing on a rock]

Generated Caption

"A brown dog is running through the grass in a park"

Project Structure

ImageCaptionGenerator/
├── Image_Caption_Generator.ipynb    # Main notebook with implementation
├── models/                          # Saved model files
├── data/                           # Dataset directory
├── utils/                          # Utility functions
├── requirements.txt                # Dependencies
└── README.md                      # Project documentation

Training

To train the model from scratch:

  1. Prepare your dataset in the required format
  2. Run the preprocessing steps in the notebook
  3. Configure model hyperparameters
  4. Execute the training cells
  5. Monitor training progress and validation metrics

Hyperparameters

  • Embedding Dimension: 300
  • LSTM Units: 512
  • Batch Size: 32
  • Learning Rate: 0.001
  • Epochs: 50

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

Future Improvements

  • Implement attention mechanism for better focus
  • Add transformer-based architectures
  • Support for video captioning
  • Multi-language caption generation
  • Real-time caption generation
  • Web interface for easy testing

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Pre-trained CNN models from TensorFlow/Keras
  • Dataset providers (Flickr, MS COCO)
  • Research papers in image captioning field
  • Open source community contributions

References

  • Show and Tell: A Neural Image Caption Generator
  • Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
  • Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Contact

Author: Shraddha Borah
GitHub: @shraddhaborah

For questions or suggestions, please open an issue or contact the author directly.


This project demonstrates the power of combining computer vision and natural language processing to create meaningful descriptions of visual content.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors