# Gender Neutral Image Captioning

## Part I. Preparing Dataset for Training

In [1]:
from data_utils import get_activity_list, get_gender_nouns, get_qualified_dataset

In [2]:
annotations_path = './data/annotations/'
get_activity_list(save_file = True)
get_gender_nouns(save_file = True)
get_qualified_dataset(annotations_path, save_file = True)

activity_image_ids saved  as ~/obj/activity_image_ids.pkl
gender_nouns_lookup saved  as ~/obj/gender_nouns_lookup.pkl
Loading gender_nouns_lookup from ~/obj/gender_nouns_lookup.pkl

Evaluating ground truth labels in train set
Caption 100000 processed, out of 414113 captions
No. of qualified images processed: 4504
Caption 200000 processed, out of 414113 captions
No. of qualified images processed: 9333
Caption 400000 processed, out of 414113 captions
No. of qualified images processed: 24380

Evaluating ground truth labels in val set
Caption 0 processed, out of 202654 captions
No. of qualified images processed: 24941
Caption 100000 processed, out of 202654 captions
No. of qualified images processed: 31457
Caption 200000 processed, out of 202654 captions
No. of qualified images processed: 41733
saved in ./data/list/qualified_image_ids.csv
captions_dict saved  as ~/obj/captions_dict.pkl
im_gender_summary saved  as ~/obj/im_gender_summary.pkl


## Part II. Training Model

### Select method to generate training set 

One of our motivation of the project is to counter the bias in the dataset. As ground truth labels are not availabie from the original COCO dataset, we are experimenting with different methods of balancing the dataset. In the **get_training_indices** function in data_utils.py, there are 8 different modes of generating data.

    - random: randomized selection of qualified images
    - balanced_mode: balanced ratio between male, female and neutral
    - balanced_clean: balanced ratio between male, female and neutral, only use images when all captions agree on using the same gender
    - balanced_gender_only: same as balanced_mode, but without neutral captions
    - balanced_clean_noun: balanced ratio between male, female and neutral, only use images when all captions agree on using the same noun
    - clean_noun: only use images when all captions agree on the same noun
    - activity_balanced: from activity tagged image sets, choose same ratio of male, female, neutral image
    - activity_balanced_clean: similar to activity_balanced, but all captions must agree on the same gender
    
Note that it is possible that output size may be smaller than training_size, especially for activity_balanced and activity_balanced_clean. As for certain activities, the sample size of clean data might be limited for some classes, e.g. women wearing tie.

In [12]:
from data_utils import get_training_indices, train_test_split

sample_size = 10
test_size = 0.3
training_image_ids, training_captions_dict = get_training_indices(sample_size = sample_size, mode = "balanced_clean")
train_image_ids, val_image_ids, gender_train, gender_val = train_test_split(training_image_ids, test_size = test_size)

Loading im_gender_summary from ~/obj/im_gender_summary.pkl
Loading captions_dict from ~/obj/captions_dict.pkl
Loading activity_image_ids from ~/obj/activity_image_ids.pkl
Loading im_gender_summary from ~/obj/im_gender_summary.pkl


### Train

In [13]:
from model_utils import load_data

image_folder_path = './data/images/'

train_loader = load_data(train_image_ids, image_folder_path, mode = 'train')
val_loader = load_data(val_image_ids, image_folder_path, mode = 'val')

Loading captions_dict from ~/obj/captions_dict.pkl
Tokenize captions: (0, 23)
vocab saved  as ~/obj/vocab.pkl
Vocabulary successfully created
Loading captions_dict from ~/obj/captions_dict.pkl
Loading vocab from ~/obj/vocab.pkl
Vocabulary successfully loaded


In [14]:
import torch.utils.data as data
# Sample a subset of captions with a randomized length
indices = train_loader.dataset.get_indices()

# Create and assign batch sampler to retrieve a batch with the sampled indices
new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
train_loader.batch_sampler.sampler = new_sampler
    
# Load one batch
# images, captions = next(iter(data_loader))

# Obtain the batch
for batch in train_loader:
    images, captions = batch[0], batch [1]
    
print('images.shape:', images.shape)
print('captions.shape:', captions.shape)

images.shape: torch.Size([10, 3, 224, 224])
captions.shape: torch.Size([10, 15])


In [15]:
import torch
import torch.nn as nn
from model import EncoderCNN, DecoderRNN
import math

batch_size = 32
embed_size = 256
hidden_size = 512
num_epochs = 10
vocab_size = len(train_loader.dataset.vocab)

# Initialize CNN and RNN
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)

# Use GPU if available
if torch.cuda.is_available():
    encoder.cuda()
    decoder.cuda()  
    
# Define the loss function
criterion = nn.CrossEntropyLoss().cuda()\
if torch.cuda.is_available() else nn.CrossEntropyLoss()

# Specify the learnable parameters of the model
params = list(decoder.parameters()) + list(encoder.embed.parameters()) + list(encoder.bn.parameters())

# Define the optimizer
optimizer = torch.optim.Adam(params=params, lr=0.001)

# Calculate total number of training steps per epoch
total_train_step = math.ceil(len(train_loader.dataset.captions_len) / train_loader.batch_sampler.batch_size)
print ("Number of training steps:", total_train_step)
total_val_step = math.ceil(len(val_loader.dataset.captions_len) / val_loader.batch_sampler.batch_size)
print ("Number of training steps:", total_val_step)

Number of training steps: 3
Number of training steps: 2


In [None]:
import time
import os
from model_utils import train, validate, save_epoch, early_stopping
train_losses = []
val_losses = []
val_bleus = []
best_val_bleu = float("-INF")

start_time = time.time()
for epoch in range(1, num_epochs + 1):
    train_loss = train(train_loader, encoder, decoder, criterion, optimizer, 
                       vocab_size, epoch, total_train_step)
    train_losses.append(train_loss)
    val_loss, val_bleu = validate(val_loader, encoder, decoder, criterion,
                                  train_loader.dataset.vocab, epoch, total_val_step)
    val_losses.append(val_loss)
    val_bleus.append(val_bleu)
    if val_bleu > best_val_bleu:
        print ("Validation Bleu-4 improved from {:0.4f} to {:0.4f}, saving model to best-model.pkl".
               format(best_val_bleu, val_bleu))
        best_val_bleu = val_bleu
        filename = os.path.join("./models", "best-model.pkl")
        save_epoch(filename, encoder, decoder, optimizer, train_losses, val_losses, 
                   val_bleu, val_bleus, epoch)
    else:
        print ("Validation Bleu-4 did not improve, saving model to model-{}.pkl".format(epoch))
    # Save the entire model anyway, regardless of being the best model so far or not
    filename = os.path.join("./models", "model-{}.pkl".format(epoch))
    save_epoch(filename, encoder, decoder, optimizer, train_losses, val_losses, 
               val_bleu, val_bleus, epoch)
    print ("Epoch [%d/%d] took %ds" % (epoch, num_epochs, time.time() - start_time))
    if epoch > 5:
        # Stop if the validation Bleu doesn't improve for 3 epochs
        if early_stopping(val_bleus, 3):
            break
    start_time = time.time()

Epoch 1, Val step [2/2], 0s, Loss: 1.8816, Perplexity: 6.5639, Bleu-4: 0.3375Validation Bleu-4 improved from -inf to 0.3501, saving model to best-model.pkl
Epoch [1/10] took 3s
Epoch 2, Val step [2/2], 0s, Loss: 1.6607, Perplexity: 5.2630, Bleu-4: 0.4903Validation Bleu-4 improved from 0.3501 to 0.3937, saving model to best-model.pkl
Epoch [2/10] took 2s
Epoch 3, Val step [2/2], 0s, Loss: 1.5851, Perplexity: 4.8800, Bleu-4: 0.1242Validation Bleu-4 did not improve, saving model to model-3.pkl
Epoch [3/10] took 2s
Epoch 4, Val step [2/2], 0s, Loss: 1.4941, Perplexity: 4.4551, Bleu-4: 0.4523Validation Bleu-4 did not improve, saving model to model-4.pkl
Epoch [4/10] took 1s
Epoch 5, Val step [2/2], 0s, Loss: 1.6074, Perplexity: 4.9896, Bleu-4: 0.2006Validation Bleu-4 did not improve, saving model to model-5.pkl
Epoch [5/10] took 1s
Epoch 6, Val step [2/2], 0s, Loss: 1.6262, Perplexity: 5.0846, Bleu-4: 0.5100Validation Bleu-4 improved from 0.3937 to 0.4038, saving model to best-model.pkl


## Part III. Predict on test images

### Load pretrained weights

Download model weights from XXX to ./model/ of this repo.

In [2]:
import os
import torch
# Load the best model
checkpoint = torch.load(os.path.join('./models', 'best-model.pkl'))

In [7]:
from data_utils import get_test_indices
sample_size = 5
test_image_ids = get_test_indices(training_image_ids, sample_size, mode = 'balanced_clean')

Loading im_gender_summary from ~/obj/im_gender_summary.pkl
Loading captions_dict from ~/obj/captions_dict.pkl


In [8]:
test_image_ids

{174028: ['two girls sitting on a bench eating lunch',
  'A girl demonstrating the act of shaving sitting on a bench.',
  'two females sitting on a bench one is being the arms for the other.'],
 385626: ['Woman riding a bicycle down an empty street.',
  'A woman in green is riding a bike.',
  'a woman wearing a bright green sweater riding a bicycle',
  'A woman on a bicycle is going down the small town street.',
  'A woman bikes down a one way street.'],
 9469: ['A teenage boy is in a field looking at a wire.',
  'The boy is inspecting the wire for a project.',
  'A boy building something with wires and poles.'],
 348595: ['A man in a parade has red and white face and umbrella.',
  'Soccer fan is showing his support by dressing up. ',
  'A man with a painted red and white face is standing in the street.'],
 25174: ["A closeup of someone's fingers as they use a keyboard.",
  'A kid is typing on a laptop keyboard.',
  "A child's hands hitting keys on an old computer."]}

In [11]:
test_loader = load_data(test_image_ids.keys(), image_folder_path, mode = 'test')
original_image, image = next(iter(test_loader))
transformed_image = image.numpy()
transformed_image = np.squeeze(transformed_image)\
                    .transpose((1, 2, 0))

# Print sample image, before and after pre-processing
plt.imshow(np.squeeze(original_image))
plt.title('example image')
plt.show()
plt.imshow(transformed_image)
plt.title('transformed image')
plt.show()

Loading captions_dict from ~/obj/captions_dict.pkl
Loading vocab from ~/obj/vocab.pkl
Vocabulary successfully loaded


RuntimeError: stack expects each tensor to be equal size, but got [410, 500, 3] at entry 0 and [640, 546, 3] at entry 2

In [None]:
# Specify values for embed_size and hidden_size - we use the same values as in training step
embed_size = 256
hidden_size = 512

# Get the vocabulary and its size
vocab = test_loader.dataset.vocab
vocab_size = len(vocab)

# Initialize the encoder and decoder, and set each to inference mode
encoder = EncoderCNN(embed_size)
encoder.eval()
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)
decoder.eval()

# Load the pre-trained weights
encoder.load_state_dict(checkpoint['encoder'])
decoder.load_state_dict(checkpoint['decoder'])

# Move models to GPU if CUDA is available.
if torch.cuda.is_available():
    encoder.cuda()
    decoder.cuda()

features = encoder(image).unsqueeze(1)
output = decoder.sample_beam_search(features)
print('example output:', output)

## Part IV. Evaluate performance of model

In [8]:
from data_utils import get_test_indices
sample_size = 100
test_indices = get_test_indices(training_image_ids, sample_size, mode = 'balanced_clean')

Loading im_gender_summary from ~/obj/im_gender_summary.pkl
Loading captions_dict from ~/obj/captions_dict.pkl
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images are added
captions of 0 images a