# Gender Neutral Image Captioning

## Part I. Preparing Dataset for Training

In [1]:
from data_utils import get_activity_list, get_gender_nouns, get_qualified_dataset

In [2]:
annotations_path = './data/annotations/'
get_activity_list(save_file = True)
get_gender_nouns(save_file = True)
get_qualified_dataset(annotations_path, save_file = True)

Dictionary activity_image_ids is saved as pickle in ~/obj/
Dictionary gender_nouns_lookup is saved as pickle in ~/obj/

Evaluating ground truth labels in train set

Caption 0 processed, out of 414113 captions
No. of qualified images processed: 0

Caption 100000 processed, out of 414113 captions
No. of qualified images processed: 6452

Caption 200000 processed, out of 414113 captions
No. of qualified images processed: 13359

Caption 300000 processed, out of 414113 captions
No. of qualified images processed: 24080

Caption 400000 processed, out of 414113 captions
No. of qualified images processed: 34712

Evaluating ground truth labels in val set

Caption 0 processed, out of 202654 captions
No. of qualified images processed: 35500

Caption 100000 processed, out of 202654 captions
No. of qualified images processed: 42016

Caption 200000 processed, out of 202654 captions
No. of qualified images processed: 52292
List qualified_image_ids is saved as csv in ~/data/list/
Dictionary captions_dic

## Part II. Select Training Method

One of our motivation of the project is to counter the bias in the dataset. As ground truth labels are not availabie from the original COCO dataset, we are experimenting with different methods of balancing the dataset. In the **get_training_data** function in data_utils.py, there are 8 different modes of generating data.

    - random: randomized selection of qualified images
    - balanced_mode: balanced ratio between male, female and neutral
    - balanced_clean: balanced ratio between male, female and neutral, only use images when all captions agree on using the same gender
    - balanced_gender_only: same as balanced_mode, but without neutral captions
    - balanced_clean_noun: balanced ratio between male, female and neutral, only use images when all captions agree on using the same noun
    - clean_noun: only use images when all captions agree on the same noun
    - activity_balanced: from activity tagged image sets, choose same ratio of male, female, neutral image
    - activity_balanced_clean: similar to activity_balanced, but all captions must agree on the same gender
    
Note that it is possible that output size may be smaller than training_size, especially for activity_balanced and activity_balanced_clean. As for certain activities, the sample size of clean data might be limited for some classes, e.g. women wearing tie.

In [1]:
from data_utils import get_training_indices, train_test_split

sample_size = 1000
test_size = 0.3
training_image_ids, training_captions_dict = get_training_indices(sample_size = sample_size, mode = "balanced_clean")
train_image_ids, val_image_ids, gender_train, gender_val = train_test_split(training_image_ids, test_size = test_size)

Loading im_gender_summary from ~/obj/im_gender_summary.pkl
Loading captions_dict from ~/obj/captions_dict.pkl
Loading activity_image_ids from ~/obj/activity_image_ids.pkl
captions of 1000 images are added
Loading im_gender_summary from ~/obj/im_gender_summary.pkl


In [2]:
from model_utils import load_data

image_folder_path = './data/images/'

train_loader = load_data(train_image_ids, image_folder_path, mode = 'train')
val_loader = load_data(val_image_ids, image_folder_path, mode = 'val')

Loading captions_dict from ~/obj/captions_dict.pkl
Tokenize captions: (0, 2665)
Tokenize captions: (100, 2665)
Tokenize captions: (200, 2665)
Tokenize captions: (300, 2665)
Tokenize captions: (400, 2665)
Tokenize captions: (500, 2665)
Tokenize captions: (600, 2665)
Tokenize captions: (700, 2665)
Tokenize captions: (800, 2665)
Tokenize captions: (900, 2665)
Tokenize captions: (1000, 2665)
Tokenize captions: (1100, 2665)
Tokenize captions: (1200, 2665)
Tokenize captions: (1300, 2665)
Tokenize captions: (1400, 2665)
Tokenize captions: (1500, 2665)
Tokenize captions: (1600, 2665)
Tokenize captions: (1700, 2665)
Tokenize captions: (1800, 2665)
Tokenize captions: (1900, 2665)
Tokenize captions: (2000, 2665)
Tokenize captions: (2100, 2665)
Tokenize captions: (2200, 2665)
Tokenize captions: (2300, 2665)
Tokenize captions: (2400, 2665)
Tokenize captions: (2500, 2665)
Tokenize captions: (2600, 2665)
vocab saved  as ~/obj/vocab.pkl
Vocabulary successfully created
Loading captions_dict from ~/obj/

In [3]:
import torch.utils.data as data
# Sample a subset of captions with a randomized length
indices = train_loader.dataset.get_indices()

# Create and assign batch sampler to retrieve a batch with the sampled indices
new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
train_loader.batch_sampler.sampler = new_sampler
    
# Load one batch
# images, captions = next(iter(data_loader))

# Obtain the batch
for batch in train_loader:
    images, captions = batch[0], batch [1]
    
print('images.shape:', images.shape)
print('captions.shape:', captions.shape)

images.shape: torch.Size([10, 3, 224, 224])
captions.shape: torch.Size([10, 22])


In [4]:
import torch
import torch.nn as nn
from model import EncoderCNN, DecoderRNN
import math

batch_size = 32
embed_size = 256
hidden_size = 512
num_epochs = 10
vocab_size = len(train_loader.dataset.vocab)

# Initialize CNN and RNN
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)

# Use GPU if available
if torch.cuda.is_available():
    encoder.cuda()
    decoder.cuda()

# Define the loss function
criterion = nn.CrossEntropyLoss().cuda()\
if torch.cuda.is_available() else nn.CrossEntropyLoss()

# Specify the learnable parameters of the model
params = list(decoder.parameters()) + list(encoder.embed.parameters()) + list(encoder.bn.parameters())

# Define the optimizer
optimizer = torch.optim.Adam(params=params, lr=0.001)

# Calculate total number of training steps per epoch
total_train_step = math.ceil(len(train_loader.dataset.captions_len) / train_loader.batch_sampler.batch_size)
print ("Number of training steps:", total_train_step)
total_val_step = math.ceil(len(val_loader.dataset.captions_len) / val_loader.batch_sampler.batch_size)
print ("Number of training steps:", total_val_step)

Number of training steps: 267
Number of training steps: 114


In [6]:
import time
import os
from model_utils import train, validate
train_losses = []
val_losses = []
val_bleus = []
best_val_bleu = float("-INF")

start_time = time.time()
for epoch in range(1, num_epochs + 1):
    train_loss = train(train_loader, encoder, decoder, criterion, optimizer, 
                       vocab_size, epoch, total_train_step)
    train_losses.append(train_loss)
    val_loss, val_bleu = validate(val_loader, encoder, decoder, criterion,
                                  train_loader.dataset.vocab, epoch, total_val_step)
    val_losses.append(val_loss)
    val_bleus.append(val_bleu)
    if val_bleu > best_val_bleu:
        print ("Validation Bleu-4 improved from {:0.4f} to {:0.4f}, saving model to best-model.pkl".
               format(best_val_bleu, val_bleu))
        best_val_bleu = val_bleu
        filename = os.path.join("./models", "best-model.pkl")
        save_epoch(filename, encoder, decoder, optimizer, train_losses, val_losses, 
                   val_bleu, val_bleus, epoch)
    else:
        print ("Validation Bleu-4 did not improve, saving model to model-{}.pkl".format(epoch))
    # Save the entire model anyway, regardless of being the best model so far or not
    filename = os.path.join("./models", "model-{}.pkl".format(epoch))
    save_epoch(filename, encoder, decoder, optimizer, train_losses, val_losses, 
               val_bleu, val_bleus, epoch)
    print ("Epoch [%d/%d] took %ds" % (epoch, num_epochs, time.time() - start_time))
    if epoch > 5:
        # Stop if the validation Bleu doesn't improve for 3 epochs
        if early_stopping(val_bleus, 3):
            break
    start_time = time.time()

Epoch 1, Train step [100/267], 232s, Loss: 3.4151, Perplexity: 30.4186
Epoch 1, Train step [128/267], 65s, Loss: 3.8966, Perplexity: 49.2345

KeyboardInterrupt: 

## Part III. Train Model/ Load Model Weights

### To train model

In [5]:
train_image_ids, test_image_ids, gender_train, gender_test = train_test_split(training_image_ids, im_gender_summary, test_size = 0.3)

                                                    
                                                    

### To load pretrained weights

Download model weights from XXX to ./model/ of this repo.

## Part IV. Predict on selected images