<h1><center>Loading input data in the COCO format</center></h1>
<center><img src = "https://github.com/slawekslex/random/blob/main/segmentation.png?raw=true"/></center>

## **<span style="color:blue;">Introduction</span>**

COCO: https://cocodataset.org/ is a large, popular dataset for image object detection, segmentation, and captioning. It stores its annotations in the json format describing object classes, bounding boxes and bitmasks.

I've created a dataset: https://www.kaggle.com/slawekbiel/sartorius-cell-instance-segmentation-coco that converts the input data given in the competition into the COCO format. This allows to easly explore the data with [pycocotools](https://github.com/cocodataset/cocoapi) and directly load it into [detectron](https://github.com/facebookresearch/detectron2)

In this notebook I'll show how we can use this to load images and annotations in just few lines of code

In [None]:
!pip install pycocotools

In [None]:
from pycocotools.coco import COCO
import skimage.io as io
import matplotlib.pyplot as plt
from pathlib import Path
from PIL import Image

### Load the annotations file into a COCO dataset

In [None]:
dataDir=Path('../input/sartorius-cell-instance-segmentation')
annFile = Path('../input/sartorius-cell-instance-segmentation-coco/annotations_all.json')
coco = COCO(annFile)
imgIds = coco.getImgIds()
imgs = coco.loadImgs(imgIds[-3:])

### Load the first three images and display objects bitmasks and bounding boxes. This is done by the `COCO.showAnns` function

In [None]:
imgs = coco.loadImgs(imgIds[-3:])
_,axs = plt.subplots(len(imgs),2,figsize=(40,15 * len(imgs)))
for img, ax in zip(imgs, axs):
    I = io.imread(dataDir/img['file_name'])
    annIds = coco.getAnnIds(imgIds=[img['id']])
    anns = coco.loadAnns(annIds)
    ax[0].imshow(I)
    ax[1].imshow(I)
    plt.sca(ax[1])
    coco.showAnns(anns, draw_bbox=True)

## **<span style="color:blue;">How is that generated</span>**

### **Update**: See improved version of the generation code by Adriano Passos here: https://www.kaggle.com/coldfir3/coco-dataset-generator It's faster and generates smaller files


Below are the functions I used to translate the original CSV dataset into the COCO formatted json file. 
Note that translation of RLE representations is done in a naive way, decoding into bitmasks and encoding it back. This makes the whole dataset take around 20 minutes to process. But since I only needed to do it once I didn't spend time on trying to optimize it.

In [None]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import json,itertools

In [None]:
# From https://www.kaggle.com/stainsby/fast-tested-rle
def rle_decode(mask_rle, shape):
    '''
    mask_rle: run-length as string formated (start length)
    shape: (height,width) of array to return 
    Returns numpy array, 1 - mask, 0 - background

    '''
    s = mask_rle.split()
    starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
    starts -= 1
    ends = starts + lengths
    img = np.zeros(shape[0]*shape[1], dtype=np.uint8)
    for lo, hi in zip(starts, ends):
        img[lo:hi] = 1
    return img.reshape(shape)  # Needed to align to RLE direction

# From https://newbedev.com/encode-numpy-array-using-uncompressed-rle-for-coco-dataset
def binary_mask_to_rle(binary_mask):
    rle = {'counts': [], 'size': list(binary_mask.shape)}
    counts = rle.get('counts')
    for i, (value, elements) in enumerate(itertools.groupby(binary_mask.ravel(order='F'))):
        if i == 0 and value == 1:
            counts.append(0)
        counts.append(len(list(elements)))
    return rle

In [None]:
def coco_structure(train_df):
    cat_ids = {name:id+1 for id, name in enumerate(train_df.cell_type.unique())}    
    cats =[{'name':name, 'id':id} for name,id in cat_ids.items()]
    images = [{'id':id, 'width':row.width, 'height':row.height, 'file_name':f'train/{id}.png'} for id,row in train_df.groupby('id').agg('first').iterrows()]
    annotations=[]
    for idx, row in tqdm(train_df.iterrows()):
        mk = rle_decode(row.annotation, (row.height, row.width))
        ys, xs = np.where(mk)
        x1, x2 = min(xs), max(xs)
        y1, y2 = min(ys), max(ys)
        enc =binary_mask_to_rle(mk)
        seg = {
            'segmentation':enc, 
            'bbox': [int(x1), int(y1), int(x2-x1+1), int(y2-y1+1)],
            'area': int(np.sum(mk)),
            'image_id':row.id, 
            'category_id':cat_ids[row.cell_type], 
            'iscrowd':0, 
            'id':idx
        }
        annotations.append(seg)
    return {'categories':cats, 'images':images,'annotations':annotations}

In [None]:
## run it on first three images for demonstration:
train_df = pd.read_csv('../input/sartorius-cell-instance-segmentation/train.csv')
all_ids = train_df.id.unique()
train_sample = train_df[train_df.id.isin(all_ids[:3])]
root = coco_structure(train_sample)

with open('annotations_sample.json', 'w', encoding='utf-8') as f:
    json.dump(root, f, ensure_ascii=True, indent=4)

In [None]:
!head -n 100 annotations_sample.json