Image Caption Generation is a project that automatically generates textual descriptions of images using deep learning techniques. Given an image as input, the model produces a natural language description of the image, allowing for easier interpretation and understanding of the visual content.
This project uses a combination of Convolutional Neural Networks (CNNs) to extract features from images, and Recurrent Neural Networks (RNNs) to generate text descriptions. The architecture of the model is based on the widely used Encoder-Decoder architecture with attention mechanism, which has been shown to perform well in image caption generation tasks.
- Python 3
- TensorFlow 2
- NumPy
- Matplotlib
- Pillow
- Clone the repository:
git clone https://github.com/psychopass-me/CodeClause_Image_Caption_Generator.git - Install the required dependencies:
pip install -r requirements.txt - Download the dataset: You can download a pre-processed dataset such as COCO (Common Objects in Context) or Flickr30k.
- Train the model: Run
python train.py --data_path path/to/dataset --model_path path/to/save/model - Generate captions: Run
python generate.py --image_path path/to/image --model_path path/to/model
This project can be used with any image caption dataset. The model has been tested on the COCO and Flickr30k datasets, which are commonly used for image caption generation tasks. The dataset should include images and corresponding textual descriptions.
To train the model, run python train.py with the following arguments:
-
--data_path: Path to the directory containing the pre-processed dataset.
-
--model_path: Path to save the trained model.
-
--epochs: Number of epochs to train for.
-
--batch_size: Batch size for training.
-
--learning_rate: Learning rate for training.
-
--save_every: Number of epochs between saving the model. Inference To generate captions for an image, run python generate.py with the following arguments:
-
--image_path: Path to the image for which to generate a caption.
-
--model_path: Path to the trained model.
-
--max_length: Maximum length of the generated caption.
-
--beam_size: Beam size for beam search decoding.
-
--temperature: Temperature for sampling words during decoding.
To evaluate the performance of the model, metrics such as BLEU (Bilingual Evaluation Understudy), ROUGE (Recall-Oriented Understudy for Gisting Evaluation), and METEOR (Metric for Evaluation of Translation with Explicit ORdering) can be used. You can use the nltk library in Python to calculate these metrics.
This project is licensed under the MIT License. See the LICENSE file for more information.