SmartImageCaptioning 🖼️→📝

Automatic image captioning using CNN-RNN architecture

Introduction

FrameToPhrase is a deep learning project that automatically generates natural language descriptions for images. Using a combination of Convolutional Neural Networks (CNN) for image feature extraction and Recurrent Neural Networks (RNN) for sequence generation, the model learns to "describe what it sees" by training on image-caption pairs from the COCO dataset. This encoder-decoder architecture bridges computer vision and natural language processing to create meaningful captions that describe the content of images.

DEMO

Prediction from Model : a woman is playing tennis on a tennis court .

Prediction from Model : a baseball player holding a bat on a field .

Prediction from Model : a cat sitting on a window sill looking out a window .

Technologies

Python 3.x - Core programming language
PyTorch - Deep learning framework
torchvision - Pre-trained models and image transformations
ResNet-50 - Pre-trained CNN for image feature extraction
LSTM - Recurrent neural network for caption generation
COCO API (pycocotools) - MS COCO dataset interface
NLTK - Natural Language Toolkit for text tokenization
NumPy - Numerical computing
Matplotlib - Visualization

Prerequisites

System Requirements

python>=3.2.0.6
matplotlib>=2.1.1
pandas>=0.22.0
numpy>=1.12.1
pillow>=5.0.0
scipy>=1.0.0
nltk>=3.2.2
tqdm>=4.19.4
scikit-learn>=0.19.1
scikit-image>=0.13.1
seaborn>=0.8.1
torch>=0.4.0
torchvision>=0.2.0

Dataset

Download the MS COCO 2014 Dataset:

Training images: train2014.zip
Training annotations: annotations_trainval2014.zip

Installation

Install required packages

pip install torch torchvision
pip install pycocotools
pip install nltk
pip install numpy matplotlib

Download NLTK data

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

Project Structure

FrameToPhrase/
├── Datasets/                   # COCO dataset location
├── models/                     # Saved model checkpoints
├── logs/                       # Training logs
├── model.py                    # CNN-RNN architecture
├── Vocabulary.py               # Vocabulary builder
├── data_loader.py              # Custom data loader
├── Preliminaries.py            # Data preparation
├── Training.py                 # Training script
└── Prediction.py               # Inference script

Features

Transfer Learning: Leverages pre-trained ResNet-50 on ImageNet for robust image feature extraction
Encoder-Decoder Architecture: CNN encoder extracts visual features, LSTM decoder generates sequential text
Vocabulary Building: Automatic vocabulary construction with customizable word frequency thresholds
Batch Training: Efficient training with dynamic batch sampling based on caption lengths
Inference Pipeline: Complete prediction workflow for generating captions on new images
Model Checkpointing: Saves model weights at configurable intervals during training
Training Monitoring: Tracks loss and perplexity metrics with logging to file

The Process

1. Data Preparation

Vocabulary Building: The Vocabulary class processes all training captions, tokenizing text with NLTK and building word-to-index mappings. Only words appearing at least 5 times are included to reduce vocabulary size
Special tokens (<start>, <end>, <unk>) are added to handle sentence boundaries and unknown words
Vocabulary is saved to vocab.pkl for reuse across training and inference

2. Model Architecture

EncoderCNN: Uses a pre-trained ResNet-50 with frozen weights, removing the final classification layer and replacing it with a custom embedding layer that outputs fixed-size feature vectors
DecoderRNN: LSTM-based decoder that takes image features and generates captions word-by-word using teacher forcing during training

3. Training Pipeline

Images are preprocessed with random cropping, horizontal flipping, and ImageNet normalization
The model trains for 3 epochs with batch size of 128
Uses Adam optimizer with learning rate of 0.001
Cross-entropy loss measures the difference between predicted and actual captions
Only the decoder and encoder's embedding layer are trained; ResNet backbone remains frozen

4. Inference

Test images are center-cropped to 224x224 and normalized
The encoder extracts features, and the decoder generates captions autoregressively
Sampling continues until an end token is generated or maximum length is reached
Post-processing removes special tokens and formats the output as readable sentences

Usage

Training the Model

python Training.py

This will:

Build vocabulary from COCO captions (or load existing vocab.pkl)
Train the CNN-RNN model for 3 epochs
Save model checkpoints in ./models/ directory
Log training metrics to ./logs/training_log.txt

Generating Captions

python Prediction.py

This will:

Load trained encoder and decoder models
Process test images
Generate and display captions for sample images

Configuration

Modify hyperparameters in Training.py:

batch_size = 128          # Batch size
vocab_threshold = 5       # Minimum word frequency
embed_size = 256          # Embedding dimensions
hidden_size = 512         # LSTM hidden units
num_epochs = 3            # Training epochs

What I Learned

Sequence-to-Sequence Models: Gained hands-on experience with encoder-decoder architectures for mapping images to text sequences
Transfer Learning: Learned how to effectively leverage pre-trained models and fine-tune only specific layers for a new task
LSTM Mechanics: Deepened understanding of recurrent networks, hidden states, and how they maintain temporal dependencies
PyTorch Best Practices: Mastered model initialization, device management (CPU/GPU), gradient handling, and state dictionary operations
Data Pipeline Design: Implemented custom data loading strategies with variable-length sequences and dynamic batch sampling
Training Monitoring: Developed skills in tracking metrics like perplexity and implementing checkpointing for long training runs
Caption Generation: Understood the difference between teacher forcing during training and greedy decoding during inference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SmartImageCaptioning 🖼️→📝

Introduction

DEMO

Technologies

Prerequisites

System Requirements

Dataset

Installation

Project Structure

Features

The Process

1. Data Preparation

2. Model Architecture

3. Training Pipeline

4. Inference

Usage

Training the Model

Generating Captions

Configuration

What I Learned

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Prediction.py		Prediction.py
Preliminaries.py		Preliminaries.py
README.md		README.md
Training.py		Training.py
Vocabulary.py		Vocabulary.py
data_loader.py		data_loader.py
model.py		model.py

Folders and files

Latest commit

History

Repository files navigation

SmartImageCaptioning 🖼️→📝

Introduction

DEMO

Technologies

Prerequisites

System Requirements

Dataset

Installation

Project Structure

Features

The Process

1. Data Preparation

2. Model Architecture

3. Training Pipeline

4. Inference

Usage

Training the Model

Generating Captions

Configuration

What I Learned

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages