Automatic image captioning using CNN-RNN architecture
FrameToPhrase is a deep learning project that automatically generates natural language descriptions for images. Using a combination of Convolutional Neural Networks (CNN) for image feature extraction and Recurrent Neural Networks (RNN) for sequence generation, the model learns to "describe what it sees" by training on image-caption pairs from the COCO dataset. This encoder-decoder architecture bridges computer vision and natural language processing to create meaningful captions that describe the content of images.
Prediction from Model : a woman is playing tennis on a tennis court .
Prediction from Model : a baseball player holding a bat on a field .
Prediction from Model : a cat sitting on a window sill looking out a window .
- Python 3.x - Core programming language
- PyTorch - Deep learning framework
- torchvision - Pre-trained models and image transformations
- ResNet-50 - Pre-trained CNN for image feature extraction
- LSTM - Recurrent neural network for caption generation
- COCO API (pycocotools) - MS COCO dataset interface
- NLTK - Natural Language Toolkit for text tokenization
- NumPy - Numerical computing
- Matplotlib - Visualization
- python>=3.2.0.6
- matplotlib>=2.1.1
- pandas>=0.22.0
- numpy>=1.12.1
- pillow>=5.0.0
- scipy>=1.0.0
- nltk>=3.2.2
- tqdm>=4.19.4
- scikit-learn>=0.19.1
- scikit-image>=0.13.1
- seaborn>=0.8.1
- torch>=0.4.0
- torchvision>=0.2.0
Download the MS COCO 2014 Dataset:
- Training images: train2014.zip
- Training annotations: annotations_trainval2014.zip
- Install required packages
pip install torch torchvision
pip install pycocotools
pip install nltk
pip install numpy matplotlib- Download NLTK data
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')FrameToPhrase/
├── Datasets/ # COCO dataset location
├── models/ # Saved model checkpoints
├── logs/ # Training logs
├── model.py # CNN-RNN architecture
├── Vocabulary.py # Vocabulary builder
├── data_loader.py # Custom data loader
├── Preliminaries.py # Data preparation
├── Training.py # Training script
└── Prediction.py # Inference script
- Transfer Learning: Leverages pre-trained ResNet-50 on ImageNet for robust image feature extraction
- Encoder-Decoder Architecture: CNN encoder extracts visual features, LSTM decoder generates sequential text
- Vocabulary Building: Automatic vocabulary construction with customizable word frequency thresholds
- Batch Training: Efficient training with dynamic batch sampling based on caption lengths
- Inference Pipeline: Complete prediction workflow for generating captions on new images
- Model Checkpointing: Saves model weights at configurable intervals during training
- Training Monitoring: Tracks loss and perplexity metrics with logging to file
- Vocabulary Building: The
Vocabularyclass processes all training captions, tokenizing text with NLTK and building word-to-index mappings. Only words appearing at least 5 times are included to reduce vocabulary size - Special tokens (
<start>,<end>,<unk>) are added to handle sentence boundaries and unknown words - Vocabulary is saved to
vocab.pklfor reuse across training and inference
- EncoderCNN: Uses a pre-trained ResNet-50 with frozen weights, removing the final classification layer and replacing it with a custom embedding layer that outputs fixed-size feature vectors
- DecoderRNN: LSTM-based decoder that takes image features and generates captions word-by-word using teacher forcing during training
- Images are preprocessed with random cropping, horizontal flipping, and ImageNet normalization
- The model trains for 3 epochs with batch size of 128
- Uses Adam optimizer with learning rate of 0.001
- Cross-entropy loss measures the difference between predicted and actual captions
- Only the decoder and encoder's embedding layer are trained; ResNet backbone remains frozen
- Test images are center-cropped to 224x224 and normalized
- The encoder extracts features, and the decoder generates captions autoregressively
- Sampling continues until an end token is generated or maximum length is reached
- Post-processing removes special tokens and formats the output as readable sentences
python Training.pyThis will:
- Build vocabulary from COCO captions (or load existing
vocab.pkl) - Train the CNN-RNN model for 3 epochs
- Save model checkpoints in
./models/directory - Log training metrics to
./logs/training_log.txt
python Prediction.pyThis will:
- Load trained encoder and decoder models
- Process test images
- Generate and display captions for sample images
Modify hyperparameters in Training.py:
batch_size = 128 # Batch size
vocab_threshold = 5 # Minimum word frequency
embed_size = 256 # Embedding dimensions
hidden_size = 512 # LSTM hidden units
num_epochs = 3 # Training epochs- Sequence-to-Sequence Models: Gained hands-on experience with encoder-decoder architectures for mapping images to text sequences
- Transfer Learning: Learned how to effectively leverage pre-trained models and fine-tune only specific layers for a new task
- LSTM Mechanics: Deepened understanding of recurrent networks, hidden states, and how they maintain temporal dependencies
- PyTorch Best Practices: Mastered model initialization, device management (CPU/GPU), gradient handling, and state dictionary operations
- Data Pipeline Design: Implemented custom data loading strategies with variable-length sequences and dynamic batch sampling
- Training Monitoring: Developed skills in tracking metrics like perplexity and implementing checkpointing for long training runs
- Caption Generation: Understood the difference between teacher forcing during training and greedy decoding during inference