Skip to content

shabnam-codes/SmartImageCaptioning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SmartImageCaptioning 🖼️→📝

Automatic image captioning using CNN-RNN architecture

Introduction

FrameToPhrase is a deep learning project that automatically generates natural language descriptions for images. Using a combination of Convolutional Neural Networks (CNN) for image feature extraction and Recurrent Neural Networks (RNN) for sequence generation, the model learns to "describe what it sees" by training on image-caption pairs from the COCO dataset. This encoder-decoder architecture bridges computer vision and natural language processing to create meaningful captions that describe the content of images.

DEMO

image

Prediction from Model : a woman is playing tennis on a tennis court .
image
Prediction from Model : a baseball player holding a bat on a field .
image
Prediction from Model : a cat sitting on a window sill looking out a window .

Technologies

  • Python 3.x - Core programming language
  • PyTorch - Deep learning framework
  • torchvision - Pre-trained models and image transformations
  • ResNet-50 - Pre-trained CNN for image feature extraction
  • LSTM - Recurrent neural network for caption generation
  • COCO API (pycocotools) - MS COCO dataset interface
  • NLTK - Natural Language Toolkit for text tokenization
  • NumPy - Numerical computing
  • Matplotlib - Visualization

Prerequisites

System Requirements

  • python>=3.2.0.6
  • matplotlib>=2.1.1
  • pandas>=0.22.0
  • numpy>=1.12.1
  • pillow>=5.0.0
  • scipy>=1.0.0
  • nltk>=3.2.2
  • tqdm>=4.19.4
  • scikit-learn>=0.19.1
  • scikit-image>=0.13.1
  • seaborn>=0.8.1
  • torch>=0.4.0
  • torchvision>=0.2.0

Dataset

Download the MS COCO 2014 Dataset:

Installation

  1. Install required packages
pip install torch torchvision
pip install pycocotools
pip install nltk
pip install numpy matplotlib
  1. Download NLTK data
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

Project Structure

FrameToPhrase/
├── Datasets/                   # COCO dataset location
├── models/                     # Saved model checkpoints
├── logs/                       # Training logs
├── model.py                    # CNN-RNN architecture
├── Vocabulary.py               # Vocabulary builder
├── data_loader.py              # Custom data loader
├── Preliminaries.py            # Data preparation
├── Training.py                 # Training script
└── Prediction.py               # Inference script

Features

  • Transfer Learning: Leverages pre-trained ResNet-50 on ImageNet for robust image feature extraction
  • Encoder-Decoder Architecture: CNN encoder extracts visual features, LSTM decoder generates sequential text
  • Vocabulary Building: Automatic vocabulary construction with customizable word frequency thresholds
  • Batch Training: Efficient training with dynamic batch sampling based on caption lengths
  • Inference Pipeline: Complete prediction workflow for generating captions on new images
  • Model Checkpointing: Saves model weights at configurable intervals during training
  • Training Monitoring: Tracks loss and perplexity metrics with logging to file

The Process

1. Data Preparation

  • Vocabulary Building: The Vocabulary class processes all training captions, tokenizing text with NLTK and building word-to-index mappings. Only words appearing at least 5 times are included to reduce vocabulary size
  • Special tokens (<start>, <end>, <unk>) are added to handle sentence boundaries and unknown words
  • Vocabulary is saved to vocab.pkl for reuse across training and inference

2. Model Architecture

  • EncoderCNN: Uses a pre-trained ResNet-50 with frozen weights, removing the final classification layer and replacing it with a custom embedding layer that outputs fixed-size feature vectors
  • DecoderRNN: LSTM-based decoder that takes image features and generates captions word-by-word using teacher forcing during training

3. Training Pipeline

  • Images are preprocessed with random cropping, horizontal flipping, and ImageNet normalization
  • The model trains for 3 epochs with batch size of 128
  • Uses Adam optimizer with learning rate of 0.001
  • Cross-entropy loss measures the difference between predicted and actual captions
  • Only the decoder and encoder's embedding layer are trained; ResNet backbone remains frozen

4. Inference

  • Test images are center-cropped to 224x224 and normalized
  • The encoder extracts features, and the decoder generates captions autoregressively
  • Sampling continues until an end token is generated or maximum length is reached
  • Post-processing removes special tokens and formats the output as readable sentences

Usage

Training the Model

python Training.py

This will:

  • Build vocabulary from COCO captions (or load existing vocab.pkl)
  • Train the CNN-RNN model for 3 epochs
  • Save model checkpoints in ./models/ directory
  • Log training metrics to ./logs/training_log.txt

Generating Captions

python Prediction.py

This will:

  • Load trained encoder and decoder models
  • Process test images
  • Generate and display captions for sample images

Configuration

Modify hyperparameters in Training.py:

batch_size = 128          # Batch size
vocab_threshold = 5       # Minimum word frequency
embed_size = 256          # Embedding dimensions
hidden_size = 512         # LSTM hidden units
num_epochs = 3            # Training epochs

What I Learned

  • Sequence-to-Sequence Models: Gained hands-on experience with encoder-decoder architectures for mapping images to text sequences
  • Transfer Learning: Learned how to effectively leverage pre-trained models and fine-tune only specific layers for a new task
  • LSTM Mechanics: Deepened understanding of recurrent networks, hidden states, and how they maintain temporal dependencies
  • PyTorch Best Practices: Mastered model initialization, device management (CPU/GPU), gradient handling, and state dictionary operations
  • Data Pipeline Design: Implemented custom data loading strategies with variable-length sequences and dynamic batch sampling
  • Training Monitoring: Developed skills in tracking metrics like perplexity and implementing checkpointing for long training runs
  • Caption Generation: Understood the difference between teacher forcing during training and greedy decoding during inference

Releases

No releases published

Packages

 
 
 

Contributors

Languages