## COCO Dataset Generation and Pre-processing for Image Captioning

This notebook processes the COCO 2014 dataset for training, validation and testing of image captioning and generates the necessary json files such as the word dictionary, image path files and caption files.

__REQUIREMENTS:__  
1. train2014.zip, val2014.zip and test2014.zip in data/coco folder from http://cocodataset.org/#download (2014 Train, Val, Test Images)  
2. image_info_test2014.json in data/coco folder from http://cocodataset.org/#download (2014 Testing Image Info)  
3. dataset.json in data/coco folder from http://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip (Andrej Karpathy's training, validation and testing splits - used in previous works)

In [None]:
# Import the necessary functions from the custom generate_data file
from generate_data import generate_trainval_json_data, generate_test_json_data

In [None]:
split_path = '/data/coco/dataset.json'
test_path = '/data/coco/image_info_test2014.json'
data_path = '/data/coco'
max_captions_per_image = 5
min_word_count = 5

In [None]:
# Creation of json files of training and validation image paths, captions and the complete word dictionary in data_path.
generate_trainval_json_data(split_path, data_path, max_captions_per_image, min_word_count)

In [None]:
# Creation of json file of testing image paths in data_path
generate_test_json_data(test_path, data_path)

__GENERATED FILES:__  
1. Word dictionary: data/coco/word_dict.json
2. Training image paths: data/coco/train_img_paths.json
3. Validation image paths: data/coco/val_img_paths.json
4. Training captions: data/coco/train_captions.json
5. Validation captions: data/coco/val_captions.json
6. Testing image paths: data/coco/test_img_paths.json

Before beginning training, validation or testing, we need to extract the images from the zipped folders. This is done below.

In [None]:
import zipfile

# Extraction of training images
with zipfile.ZipFile(data_path + '/train2014.zip', 'r') as zip_ref:
    zip_ref.extractall('data/coco/imgs')
    
# Extraction of validation images    
with zipfile.ZipFile(data_path + 'val2014.zip', 'r') as zip_ref:
    zip_ref.extractall('data/coco/imgs')

# Extraction of testing images    
with zipfile.ZipFile(data_path + 'test2014.zip', 'r') as zip_ref:
    zip_ref.extractall('data/coco/imgs')

#### Processing of MS COCO data and generation of necessary json files is complete. This data can now be used for training, validation and testing of Image Captioning network.