# Introduction: 
This assignment is about building a multimodal deep learning model that generates natural language descriptions for images using a Sequence-to-Sequence (Seq2Seq) architecture.

# Environment Setup
We've created a notebook on **Kaggle**, and using the **Accelerator**: `GPU T4 x2 (Dual GPU)`

For the dataset, we're using the **Flickr30k** by *adityajn105*.

Link: https://www.kaggle.com/datasets/adityajn105/flickr30k


# Part 1: Feature Extraction Pipeline

In [1]:
import os, pickle, torch, torch.nn as nn, re, pandas as pd, nltk, numpy as np
from torchvision import models, transforms
from torch.utils.data import DataLoader, Dataset
from PIL import Image
from tqdm import tqdm
from collections import Counter
from nltk.tokenize import word_tokenize

In [2]:
def find_image_dir():
    # Common Kaggle root
    base_input = '/kaggle/input'
    # Walk through the input directory to find where the images actually are
    for root, dirs, files in os.walk(base_input):
    # Look for the folder containing a high volume of jpg files
        if len([f for f in files if f.endswith('.jpg')]) > 1000:
            return root
    return None
    
IMAGE_DIR = find_image_dir()
OUTPUT_FILE = 'flickr30k_features.pkl'

if IMAGE_DIR:
    print(f" Found images at: {IMAGE_DIR}")
else:
    raise FileNotFoundError("Could not find the Flickr30k image directory. Please ensure the dataset is added to the notebook.")


# --- THE DATASET CLASS ---
class FlickrDataset(Dataset):
    def __init__(self, img_dir, transform):
        self.img_names = [f for f in os.listdir(img_dir) if f.endswith(('.jpg', '.jpeg'))]
        self.transform = transform
        self.img_dir = img_dir
    
    def __len__(self):
        return len(self.img_names)
    def __getitem__(self, idx):
        name = self.img_names[idx]
        img_path = os.path.join(self.img_dir, name)
        img = Image.open(img_path).convert('RGB')
        return self.transform(img), name


# --- REMAINDER OF THE PIPELINE (AS BEFORE) ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
model = nn.Sequential(*list(model.children())[:-1]) # Feature vector only
model = nn.DataParallel(model).to(device)
model.eval()

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
 transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
])


dataset = FlickrDataset(IMAGE_DIR, transform)
loader = DataLoader(dataset, batch_size=128, num_workers=4)
features_dict = {}


with torch.no_grad():
 for imgs, names in tqdm(loader, desc="Extracting Features"):
     feats = model(imgs.to(device)).view(imgs.size(0), -1)
     for i, name in enumerate(names):
         features_dict[name] = feats[i].cpu().numpy()


with open(OUTPUT_FILE, 'wb') as f:
     pickle.dump(features_dict, f)

print(f"Success! {len(features_dict)} images processed and saved to {OUTPUT_FILE}")

 Found images at: /kaggle/input/flickr30k/Images
Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 97.8M/97.8M [00:00<00:00, 176MB/s] 
Extracting Features: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 249/249 [01:53<00:00,  2.20it/s]


Success! 31783 images processed and saved to flickr30k_features.pkl


# Part 2: Vocabulary & Text Pre-Processing
So here we're trying to convert text captions into numbers that the neural network can understand.

1. **Load Data** â†’ Read captions from file into a dataframe (table)

2. **Clean Text** â†’ Remove special characters, numbers, extra spaces, convert to lowercase

3. **Tokenize** â†’ Split captions into individual words (tokens) using NLTK

4. **Add Special Tokens** â†’ Add `<start>` and `<end>` markers to each caption

5. **Build Vocabulary** â†’ 
   - Collect all words from all captions
   - Keep only words appearing â‰¥5 times (removes rare/noisy words)
   - Add special tokens: `<pad>`, `<start>`, `<end>`, `<unk>`

6. **Create Mappings** â†’ 
   - word2idx: word â†’ number (e.g., "dog" â†’ 42)
   - idx2word: number â†’ word (e.g., 42 â†’ "dog")

7. **Convert to Numbers** â†’ Transform all captions from words to indices

8. **Standardize Length** â†’
   - Find MAX_LENGTH (95th percentile of caption lengths)
   - Pad short captions with `<pad>` tokens
   - Truncate long captions (keep `<end>` token)

9. **Organize Data** â†’ Map each image to its 5 captions (as number sequences)

10. **Save** â†’ Store vocabulary and image-caption mapping in pickle files

**Output:** 
- `vocab.pkl` - Contains vocabulary and mappings
- `image_to_captions.pkl` - Contains image â†’ captions dictionary

**Result:** Raw text â†’ Standardized numerical sequences ready for neural network training! ðŸŽ¯

In [3]:
nltk.download('punkt')

# --- LOAD DATA ---
captions_file = '/kaggle/input/flickr30k/captions.txt'
dataframe = pd.read_csv(captions_file, sep=',')

print(f"Loaded {len(dataframe)} captions")

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Loaded 158915 captions


In [4]:
# --- CLEAN CAPTIONS ---
def clean_caption(caption):
    caption = str(caption).lower()
    caption = re.sub(r'[^a-z\s]', '', caption)
    caption = ' '.join(caption.split())
    return caption

dataframe['caption'] = dataframe['caption'].apply(clean_caption)
dataframe['tokens'] = dataframe['caption'].apply(lambda x: x.split())
dataframe['tokens'] = dataframe['tokens'].apply(lambda x: ['<start>'] + x + ['<end>'])
print("Example tokenized caption:")
print(dataframe['tokens'].iloc[0])

Example tokenized caption:
['<start>', 'two', 'young', 'guys', 'with', 'shaggy', 'hair', 'look', 'at', 'their', 'hands', 'while', 'hanging', 'out', 'in', 'the', 'yard', '<end>']


In [5]:
# --- BUILD VOCAB ---
# Collect all tokens from all captions
all_tokens = []
for token_list in dataframe['tokens']:
    all_tokens.extend(token_list)

token_counts = Counter(all_tokens)

# Filter by minimum frequency
MIN_FREQ = 5
vocab = [token for token, count in token_counts.items() if count >= MIN_FREQ]
special_tokens = ['<pad>', '<start>', '<end>', '<unk>']
vocab = special_tokens + [v for v in vocab if v not in special_tokens]
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for idx, word in enumerate(vocab)}

print(f"\nVocabulary size: {len(vocab)}")
print(f"Most common tokens: {token_counts.most_common(20)}")


Vocabulary size: 7689
Most common tokens: [('a', 271705), ('<start>', 158915), ('<end>', 158915), ('in', 83466), ('the', 62978), ('on', 45669), ('and', 44263), ('man', 42598), ('is', 41117), ('of', 38776), ('with', 36207), ('woman', 22211), ('two', 21642), ('are', 20196), ('to', 17607), ('people', 17337), ('at', 16259), ('an', 15883), ('wearing', 15709), ('young', 13218)]


In [6]:
# --- CONVERT TOKENS TO INDICES ---
def tokens_to_indices(tokens, word2idx):
    indices = []
    for token in tokens:
        if token in word2idx:
            indices.append(word2idx[token])
        else:
            indices.append(word2idx['<unk>'])
    return indices

dataframe['indices'] = dataframe['tokens'].apply(lambda x: tokens_to_indices(x, word2idx))

In [7]:
# --- CHOOSE MAX LENGTH ---
# Get length of each caption (in tokens)
caption_lengths = dataframe['tokens'].apply(len)

MAX_LENGTH = int(np.percentile(caption_lengths, 95))
print(f"\nUsing MAX_LENGTH: {MAX_LENGTH}")

# Check how many captions will be truncated
num_truncated = (caption_lengths > MAX_LENGTH).sum()
print(f"Captions to be truncated: {num_truncated} ({num_truncated/len(dataframe)*100:.2f}%)")


Using MAX_LENGTH: 24
Captions to be truncated: 6779 (4.27%)


In [8]:
# --- PADDING AND TRUNCATION FUNCTION ---
def pad_or_truncate(indices, max_length, pad_idx):
    if len(indices) > max_length:
        # Truncate (but keep <end> token)
        return indices[:max_length-1] + [indices[-1]]  # Keep <end>
    else:
        # Pad with <pad> tokens
        return indices + [pad_idx] * (max_length - len(indices))

# Apply padding/truncation
PAD_IDX = word2idx['<pad>']
dataframe['padded_indices'] = dataframe['indices'].apply(
    lambda x: pad_or_truncate(x, MAX_LENGTH, PAD_IDX)
)

# Verify all have same length
assert all(len(x) == MAX_LENGTH for x in dataframe['padded_indices']), "Not all sequences have MAX_LENGTH!"
print(f"\nAll captions now have length {MAX_LENGTH}")


All captions now have length 24


In [9]:
# --- UPDATE IMAGE-TO-CAPTIONS MAPPING ---
image_to_captions = {}
for idx, row in dataframe.iterrows():
    img_name = row['image']
    caption_indices = row['padded_indices']
    
    if img_name not in image_to_captions:
        image_to_captions[img_name] = []
    
    image_to_captions[img_name].append(caption_indices)

# --- SAVE WITH MAX_LENGTH ---
with open('vocab.pkl', 'wb') as frame:
    pickle.dump({
        'word2idx': word2idx,
        'idx2word': idx2word,
        'vocab': vocab,
        'max_length': MAX_LENGTH,  # Save this too!
        'pad_idx': PAD_IDX
    }, frame)

with open('image_to_captions.pkl', 'wb') as frame:
    pickle.dump(image_to_captions, frame)

print("\nAll preprocessing complete")
print(f"Vocabulary size: {len(vocab)}")
print(f"Max caption length: {MAX_LENGTH}")
print(f"Total images: {len(image_to_captions)}")


All preprocessing complete!
Vocabulary size: 7689
Max caption length: 24
Total images: 31783
