# Convert dataset from MS COCO to RecordIO format

I am using a collection of existing resources from GluonCV and MXNet to perform this task. There is no need to install MXNet if this notebook is executed on SageMaker Notebook Instance (just use `conda_mxnet_***` kernel), though you may still need to install GluonCV.

This notebook comes with a tiny dataset in MS COCO format, located under `data` folder. Follow the structure of the sample dataset if you want to create your own.

In [None]:
!pip install gluoncv

In [None]:
import os
import random
from matplotlib import pyplot as plt

import cv2
import mxnet as mx
import gluoncv as gcv
from gluoncv.data import COCODetection, RecordFileDetection
from gluoncv.utils import viz

import importlib
import utils

print(f'Using MXNet: {mx.__version__}')
print(f'Using GluonCV: {gcv.__version__}')

In [None]:
# Specify your own classes here
CLASSES = ['monitor', 'vase', 'camera']

DATA_ROOT = 'data/coco_like_sample'  # This would be path to folder 2017 if you are using a full COCO 2017 dataset 
DATA_SPLIT_NAME = 'test'
IMAGE_EXT = '.png'

lst_dir_path = DATA_ROOT
images_dir_path = os.path.join(lst_dir_path, DATA_SPLIT_NAME)
lst_file_path = os.path.join(lst_dir_path, DATA_SPLIT_NAME) + '.lst'
rec_file_path = lst_file_path.replace('.lst', '.rec')
colors = [[random.randint(0, 255) for _ in range(3)] for _ in range(len(CLASSES))]
sample_idx = [0, 1]

##### Use a wrapper class to specify your own classes and image format

In [None]:
class COCOLike(COCODetection):
    CLASSES = CLASSES
    def __init__(self, root, splits):
        super(COCOLike, self).__init__(root, splits)

##### Load the included dataset and create an `.lst` file

Also display the dataset images with bounding boxes, you may want to skip plotting those images for large datasets

In [None]:
print(f'Creating LST file {lst_file_path}')
coco_dataset = COCOLike(root=DATA_ROOT, splits=DATA_SPLIT_NAME)
print('Dataset length:', len(coco_dataset))
sample_image_path = None
sample_image = None
count = 0
with open(lst_file_path, 'w') as lst_out:
    for idx in range(len(coco_dataset)):
        image, labels = coco_dataset[idx]
        h, w = image.shape[:2]
        image_name = os.path.split(coco_dataset._items[idx])[-1]
        bboxes, cids = labels[:, :4], labels[:, 4:5]
        lst_record = utils.build_lst_record(image_name, w, h,  bboxes, cids, idx)
        lst_out.write(lst_record + '\n')
        if idx in sample_idx:
            viz.plot_bbox(image, bboxes=bboxes, labels=cids, class_names=CLASSES)
        
print(f'- finished, {idx+1} records written')
plt.show()

### Build RecordIO file

Use an existing script from Apache MXNet complete this task. If you struggle with a download then you can use an copy of the script included alongside this Notebook named `im2rec_local.py`

In [None]:
!wget https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/im2rec.py

##### Build

Run the downloaded or included script. Don't worry if you see `count:0` message after the cell finishes - script only reports once in 1000 records so the message is a bit misleading.

In [None]:
!python im2rec.py $lst_dir_path $images_dir_path --encoding $IMAGE_EXT --num-thread 4 --pack-label

# or uncommend and run the following if you want to use an included script
#!python im2rec_local.py $lst_dir_path $images_dir_path --encoding $IMAGE_EXT --num-thread 4 --pack-label

##### Test

Extract the same sample from the created RecordIO file and show with boxes

In [None]:
print(f'Loading records from {rec_file_path}')
rec_dataset = RecordFileDetection(rec_file_path)
print('Dataset length:', len(rec_dataset))
for idx in sample_idx:
    img, labels = rec_dataset[idx]
    viz.plot_bbox(img, bboxes=labels[:, :4], labels=labels[:, 4:5], class_names=CLASSES)
plt.show()
