### Download dataset

In [1]:
from datasets import load_dataset

dset = load_dataset("SEACrowd/uit_viic", "uit_viic_seacrowd_imtext", trust_remote_code=True)


  from .autonotebook import tqdm as notebook_tqdm
  return _bootstrap._gcd_import(name[level:], package, level)
Generating train split: 13481 examples [00:00, 29354.25 examples/s]
Generating validation split: 4620 examples [00:00, 30156.49 examples/s]
Generating test split: 1155 examples [00:00, 29055.14 examples/s]


In [2]:
print(dset)
print(dset['train'][0])

DatasetDict({
    train: Dataset({
        features: ['id', 'image_paths', 'texts', 'metadata'],
        num_rows: 13481
    })
    validation: Dataset({
        features: ['id', 'image_paths', 'texts', 'metadata'],
        num_rows: 4620
    })
    test: Dataset({
        features: ['id', 'image_paths', 'texts', 'metadata'],
        num_rows: 1155
    })
})
{'id': '4990', 'image_paths': ['http://images.cocodataset.org/train2017/000000157656.jpg', 'http://farm3.staticflickr.com/2566/3796096212_bd02c4c56c_z.jpg'], 'texts': 'Người đàn ông đang đánh tennis ngoài sân.', 'metadata': {'context': '', 'labels': [0]}}


## Data Preparation

This tutorial demonstrates how to prepare the data in advance. If you would like to utilize our trained model, you can simply download the annotations and checkpoints from [Release](https://github.com/FeiElysia/ViECap/releases/tag/checkpoints) and skip the data preparation section!

In order to run the code with your custom dataset, you need to preprocess your dataset initially. This preprocessing will generate a list containing captions. For instance, considering the Flickr30k dataset, the training dataset is structured as a dictionary, where the image name serves as the key and the corresponding captions are the values. You should transform the data format into a list: [caption 1, caption 2, ..., caption n].

In [None]:
import json
from typing import List

def load_flickr30k_captions(path: str) -> List[str]:   
    with open(path, 'r') as infile:
        annotations = json.load(infile) # dictionary -> {image_path: List[caption1, caption2, ...]}
    punctuations = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', ' ', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
    captions = []
    for image_path in annotations:                  
        temp_captions = annotations[image_path]
        for caption in temp_captions:
            caption = caption.strip()
            if caption.isupper():
                caption = caption.lower()
            caption = caption[0].upper() + caption[1:]
            if caption[-1] not in punctuations:
                caption += '.'
            captions.append(caption)
    return captions

captions = load_flickr30k_captions('../annotations/flickr30k/train_captions.json')
print(f'The type of the processed caption is: {type(captions)}, the total training samples are: {len(captions)}')
captions[:10]

FileNotFoundError: [Errno 2] No such file or directory: '../annotations/flickr30k/train_captions.json'

After obtaining the correct format for the training dataset, you can proceed to execute the following function from ```entities_extraction.py``` to pre-extract entities for each caption.

In [2]:
import nltk
import pickle
from nltk.stem import WordNetLemmatizer


def entities_extraction(captions: List[str], path: str) -> None:
    """
    Args:
        captions: [caption 1, caption 2, ..., caption n]
        path: the output path of training data with the format of List[List[List, str]] i.e., [[[entity1, entity2,...], caption], ...]  
    """

    lemmatizer = WordNetLemmatizer()
    new_captions = []
    for caption in captions:
        detected_entities = []
        pos_tags = nltk.pos_tag(nltk.word_tokenize(caption))
        for entities_with_pos in pos_tags:
            if entities_with_pos[1] == 'NN' or entities_with_pos[1] == 'NNS':
                entity = lemmatizer.lemmatize(entities_with_pos[0].lower().strip())
                detected_entities.append(entity)
        detected_entities = list(set(detected_entities))
        new_captions.append([detected_entities, caption])
    
    with open(path, 'wb') as outfile:
        pickle.dump(new_captions, outfile)

captions = captions[:20] # take the first 20 captions as an example
outpath = './training_set_with_entities.pickle'
entities_extraction(captions, outpath)

Now you can check the generated file, which contains entities for each training sample.

In [3]:
with open(outpath, 'rb') as infile:
    captions_with_entities = pickle.load(infile)
print(f'The type of the processed file is: {type(captions_with_entities)}, the total training samples are: {len(captions_with_entities)}')
captions_with_entities[:10]

The type of the processed file is: <class 'list'>, the total training samples are: 20


[[['guy', 'look', 'yard', 'hand', 'hair'],
  'Two young guys with shaggy hair look at their hands while hanging out in the yard.'],
 [['bush', 'male'], 'Two young, White males are outside near many bushes.'],
 [['shirt', 'yard', 'men'], 'Two men in green shirts are standing in a yard.'],
 [['man', 'shirt', 'garden'], 'A man in a blue shirt standing in a garden.'],
 [['friend', 'time'], 'Two friends enjoy time spent together.'],
 [['system', 'pulley', 'hat', 'men'],
  'Several men in hard hats are operating a giant pulley system.'],
 [['piece', 'worker', 'equipment'],
  'Workers look down from up above on a piece of equipment.'],
 [['machine', 'hat', 'men'],
  'Two men working on a machine wearing hard hats.'],
 [['structure', 'top', 'men'], 'Four men on top of a tall structure.'],
 [['rig', 'men'], 'Three men on a large rig.']]

To speed up the training process, run the following function from ```texts_features_extraction.py``` to pre-extract text features.

In [4]:
import torch
import clip

@torch.no_grad()
def texts_features_extraction(device: str, clip_type: str, inpath: str, outpath: str):

    device = device
    encoder, _ = clip.load(clip_type, device)

    with open(inpath, 'rb') as infile:
        captions_with_entities = pickle.load(infile) # [[[entity1, entity2, ...], caption], ...]

    for idx in range(len(captions_with_entities)):
        caption = captions_with_entities[idx][1]
        tokens = clip.tokenize(caption, truncate = True).to(device)
        embeddings = encoder.encode_text(tokens).squeeze(dim = 0).to('cpu')
        captions_with_entities[idx].append(embeddings)
    
    with open(outpath, 'wb') as outfile:
        pickle.dump(captions_with_entities, outfile)
    
    return captions_with_entities

device = 'cuda:0'
clip_type = 'ViT-B/32'
inpath = './training_set_with_entities.pickle'
outpath = './training_set_texts_features_ViT-B32.pickle'
text_features = texts_features_extraction(device, clip_type, inpath, outpath)

Check the extracted text features!

In [5]:
print(f'The type of the extracted text features are: {type(captions_with_entities)}, the total training samples are: {len(captions_with_entities)}')
text_features[0]

The type of the extracted text features are: <class 'list'>, the total training samples are: 20


[['guy', 'look', 'yard', 'hand', 'hair'],
 'Two young guys with shaggy hair look at their hands while hanging out in the yard.',
 tensor([ 1.7798e-01,  1.9336e-01, -3.8989e-01, -3.6011e-01,  1.0211e-01,
          1.4661e-01,  4.4312e-01, -2.3450e-01, -1.2268e-01, -4.8364e-01,
          1.5894e-01,  8.6548e-02, -3.8623e-01, -3.8989e-01, -3.1055e-01,
          2.3108e-01, -3.9337e-02, -2.2290e-01,  2.0496e-01,  2.6147e-01,
          3.0444e-01, -6.5063e-02,  1.4771e-01, -5.9424e-01,  2.3877e-01,
         -1.6724e-01,  6.1890e-02,  4.0088e-01,  1.4014e-01, -1.8225e-01,
          9.3323e-02, -2.4365e-01,  2.4414e-01, -4.4586e-02, -6.9775e-01,
          1.3428e-01, -2.5195e-01, -6.6162e-02,  2.7466e-02,  6.5430e-02,
         -2.9395e-01,  9.3933e-02,  2.2168e-01, -8.6670e-03,  2.4390e-01,
          2.0483e-01,  3.3813e-01,  4.4800e-02,  7.6172e-02,  1.5601e-01,
          1.2817e-01,  3.6401e-01,  8.7524e-02, -5.1904e-01, -2.6025e-01,
          7.3471e-03, -4.1895e-01,  6.6338e-03, -2.9517e-

Following this notebook, you can easily make modifications to ```entities_extraction.py```, ```texts_features_extraction.py```, and ```load_annotations.py```, enabling you to train ViECap using your own dataset.

To evaluate the trained model on your customized validation set, run the following function from ```images_features_extraction.py``` to pre-extract the image features, ensuring accelerated evaluation. It is advisable to transform your validation set into a dictionary format, where the image name serves as the key and the corresponding caption as the value. Once again, considering the Flickr30k validation set for reference, you can run the following code to check the format of this set.

In [6]:
path_val_flickr30k = '../annotations/flickr30k/val_captions.json'

# format = {image_path: [caption1, caption2, ...]} -> [[image_path, image_features, [caption 1, caption 2, ...]], ...]
with open(path_val_flickr30k, 'r') as infile:
    val_flickr30k = json.load(infile)

print(f'The type of the validation set is: {type(val_flickr30k)}, the total validation samples are: {len(val_flickr30k)}')
for key in val_flickr30k:
    value = val_flickr30k[key]
    break

key, value

The type of the validation set is: <class 'dict'>, the total validation samples are: 1014


('1018148011.jpg',
 ['A group of people stand in the back of a truck filled with cotton.',
  'Men are standing on and about a truck carrying a white substance.',
  'A group of people are standing on a pile of wool in a truck.',
  'A group of men are loading cotton onto a truck',
  'Workers load sheared wool onto a truck.'])

Now you can extract image features using the function from ```images_features_extraction.py``` with datasets in the format aforementioned. If you're using a custom dataset, ensure to adjust the conditional statements in Lines 12-17 of ```images_features_extraction.py```. This will allow you to navigate into the appropriate branch. (i.e., ```if datasets == 'coco' or datasets == 'flickr30k' or datasets == 'your dataset name'``` and ```elif datasets == 'your dataset name': rootpath = 'the path of image source'```)

Note that if you choose not to use the provided image features from us, you should download the image source files for the COCO and Flickr30k dataset from their official websites. Afterwards, you should place these files into the 'ViECap/annotations/coco/val2014' directory for COCO images and the 'ViECap/annotations/flickr30k/flickr30k-images' directory for Flickr30k images.

In [7]:
from PIL import Image

def images_features_extraction(datasets, encoder, proprecess, annotations, outpath):
    
    results = []
    if datasets == 'coco' or datasets == 'flickr30k': # coco, flickr30k
        # format = {image_path: [caption1, caption2, ...]} -> [[image_path, image_features, [caption1, caption2, ...]], ...]
        if datasets == 'coco':
            rootpath = '../annotations/coco/val2014/'
        elif datasets == 'flickr30k':
            rootpath = '../annotations/flickr30k/flickr30k-images/'

        flag = 0 # add flag for testing the code
        for image_id in annotations:
            flag += 1
            if flag > 20:
                break
            caption = annotations[image_id]
            image_path = rootpath + image_id
            image = proprecess(Image.open(image_path)).unsqueeze(dim = 0).to(device)
            image_features = encoder.encode_image(image).squeeze(dim = 0).to('cpu') # clip_hidden_size
            results.append([image_id, image_features, caption])

    else: # nocaps
        # format = [{'split': 'near_domain', 'image_id': '4499.jpg', 'caption': [caption1, caption2, ...]}, ...]
        # format = [[image_path, image_split, image_features, [caption1, captions2, ...]], ...]
        rootpath = './annotations/nocaps/'
        for annotation in annotations:
            split = annotation['split']
            image_id = annotation['image_id']
            caption = annotation['caption']
            image_path = rootpath + split + '/' + image_id
            image = proprecess(Image.open(image_path)).unsqueeze(dim = 0).to(device)
            image_features = encoder.encode_image(image).squeeze(dim = 0).to('cpu') # clip_hidden_size
            results.append([image_id, split, image_features, caption])

    with open(outpath, 'wb') as outfile:
        pickle.dump(results, outfile)

encoder, proprecess = clip.load(clip_type, device)
outpath = './validation_set_images_features_ViT-B32.pickle'
images_features_extraction('flickr30k', encoder, proprecess, val_flickr30k, outpath)

Check the generated file!

In [8]:
with open(outpath, 'rb') as infile:
    image_features = pickle.load(infile)
print(f'The type of the extracted image features are: {type(image_features)}, the total validation samples are: {len(image_features)}')
image_features[0]

The type of the extracted image features are: <class 'list'>, the total validation samples are: 20


['1018148011.jpg',
 tensor([-8.8318e-02,  4.5288e-01,  1.1218e-01,  5.0342e-01, -6.2195e-02,
          6.6504e-01,  3.8037e-01, -8.9905e-02,  3.9258e-01,  6.2256e-02,
         -3.2056e-01, -1.3525e-01,  3.6133e-02, -6.1554e-02,  6.9763e-02,
          2.7908e-02, -3.6255e-01, -5.8887e-01, -1.9885e-01,  1.0925e-01,
         -8.0957e-01,  1.7700e-01,  8.0109e-03, -5.7666e-01, -3.5376e-01,
          4.4580e-01,  9.7900e-02, -6.3965e-01, -4.2542e-02,  3.5303e-01,
         -3.3691e-01,  4.5801e-01,  1.0632e-01,  4.4922e-01,  2.7271e-01,
          1.6321e-01, -1.4233e-01,  7.3340e-01, -7.0862e-02,  4.5288e-01,
          5.7831e-02, -5.8643e-01, -2.0959e-01,  1.1841e-01,  2.6904e-01,
          7.7148e-01,  1.0028e-01,  3.1250e-02, -2.8174e-01, -1.5173e-01,
          2.6343e-01, -9.6130e-02,  1.3513e-01, -8.7463e-02, -6.1035e-01,
          4.5825e-01, -5.6343e-03,  2.2766e-01,  1.9385e-01,  4.0625e-01,
         -6.9434e-01,  1.0522e-01,  1.6382e-01,  3.7671e-01, -2.0398e-01,
          4.0796e-0

As indicated in the [paper](https://arxiv.org/pdf/2307.16525.pdf), if you intend to change the vocabulary to suit your needs, you should first handle this vocabulary as a List, For example:

In [9]:
vocabulary = ['mouse', 'cow', 'tiger', 'rabbit', 'dragon', 'snake', 'horse', 'sheep', 'monkey', 'chicken', 'dog', 'pig']

Run the following function to obtain the corresponding feature for each prompted category in the vocabulary.

In [10]:
import os
from tqdm import tqdm

@torch.no_grad()
def generate_ensemble_prompt_embeddings(
    device: str,
    clip_type: str,
    entities: List[str],
    prompt_templates: List[str],
    outpath: str,
):
    if os.path.exists(outpath):
        with open(outpath, 'rb') as infile:
            embeddings = pickle.load(infile)
            return embeddings

    model, _ = clip.load(clip_type, device)
    model.eval()
    embeddings = []
    for entity in tqdm(entities):
        texts = [template.format(entity) for template in prompt_templates] # ['a picture of dog', 'photo of a dog', ...]
        tokens = clip.tokenize(texts).to(device)               # (len_of_template, 77)
        class_embeddings = model.encode_text(tokens).to('cpu') # (len_of_templates, clip_hidden_size)
        class_embeddings /= class_embeddings.norm(dim = -1, keepdim = True) # (len_of_templates, clip_hidden_size)
        class_embedding = class_embeddings.mean(dim = 0)       # (clip_hidden_size, ) 
        class_embedding /= class_embedding.norm()              # (clip_hidden_size, ) 
        embeddings.append(class_embedding)                     # [(clip_hidden_size, ), (clip_hidden_size, ), ...]
    embeddings = torch.stack(embeddings, dim = 0).to('cpu')
   
    with open(outpath, 'wb') as outfile:
        pickle.dump(embeddings, outfile)
    return embeddings

if __name__ == '__main__':

    # prompts from CLIP
    prompt_templates = [
        'itap of a {}.',
        'a bad photo of the {}.',
        'a origami {}.',
        'a photo of the large {}.',
        'a {} in a video game.',
        'art of the {}.',
        'a photo of the small {}.'
    ]
  
    outpath = './vocabulary_embedding_with_ensemble.pickle'
    vocabulary_embeddings = generate_ensemble_prompt_embeddings(device, clip_type, vocabulary, prompt_templates, outpath)

100%|██████████| 12/12 [00:00<00:00, 53.45it/s]


Check the generated prompted features!

In [11]:
assert vocabulary_embeddings.size()[0] == len(vocabulary)
vocabulary_embeddings.size()

torch.Size([12, 512])

If you would like to include your dataset and vocabulary, you need to incorporate the functions ```load_your_dataset()``` and ```load_your_vocabulary()``` into ```load_annotations.py```, following the pattern of other functions within this script.

Congratulations, you have now completed the data preparation. You can proceed to train the model with your own settings. Enjoy!