# Create mask dataset
The training dataset in this competition is small enough that all data can be kept in memory. In this notebook we will create a dictionary with the masks for use in training (good for TPU etc.) and save it to a pickle file.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import json
from tqdm import tqdm
from PIL import Image
import pickle

## Load data

In [None]:
df = pd.read_csv('../input/sartorius-cell-instance-segmentation/train.csv')

First, check the number of classes in the dataset:

In [None]:
df.cell_type.unique()

And the distribution:

In [None]:
hist = df.cell_type.hist()

Three classes, save to .json here for later use. 

In [None]:
CLASS_LABELS = {'shsy5y': 1, 'astro':2, 'cort':3} # used for class labels

with open('classes.json', 'w') as fp:
    json.dump(CLASS_LABELS, fp, indent=4)

## Create mask set
The mask creation is pretty slow below (~30min), but this does not matter, as they will be saved to file for easy use in other notebooks. Note that there is only one class per image, so the masks are binary. But we keep information about cell type for train/test stratification later on.

In [None]:
# ref: https://www.kaggle.com/inversion/run-length-decoding-quick-start
def rle_decode(mask_rle, mask, color=1):
    '''
    mask_rle: run-length as string formated (start length)
    shape: (height, width, channels) of array to return 
    color: color for the mask
    Returns numpy array (mask)

    '''
    s = mask_rle.split()
    
    starts = list(map(lambda x: int(x) - 1, s[0::2]))
    lengths = list(map(int, s[1::2]))
    ends = [x + y for x, y in zip(starts, lengths)]
    
    img = mask.reshape((mask.shape[0] * mask.shape[1]))
            
    for start, end in zip(starts, ends):
        img[start : end] = color
    
    return img.reshape(mask.shape)

In [None]:
IMAGE_SIZE = [520, 704]

# create mask for a single id
def create_mask(iid, img_size):
    mask = np.zeros(img_size, dtype=np.uint8)
    for i in range(len(df[df.id == iid])):
        mask = rle_decode(df[df.id == iid].annotation.iloc[i], mask) 
    return mask

# create masks for all training set
def create_mask_dict(df):
    mdict={}
    ids = df.id.unique()
    for i in tqdm(range(len(ids))):
        iid = ids[i]
        mdict[iid] = {'mask': create_mask(iid, IMAGE_SIZE), 'class': CLASS_LABELS[df[df.id == iid].cell_type.iloc[0]]}
    return mdict

In [None]:
mask_dict = create_mask_dict(df)

Finally write the dictionary to file:

In [None]:
pickle.dump(mask_dict, open('mask_dict.pkl', 'wb'))

# Check mask data
Let's plot a few images and masks to check that everything is OK.

In [None]:
ids = df.id.unique()[:4]
fig = plt.figure(figsize=(16,24))
for i in range(len(ids)):
    axes = fig.add_subplot(4, 2, 2*i+1)
    plt.setp(axes, xticks=[], yticks=[])
    iid = ids[i]
    img = Image.open('../input/sartorius-cell-instance-segmentation/train/{}.png'.format(iid))
    plt.imshow(img, cmap='gray')
    axes = fig.add_subplot(4, 2, 2*i+2)
    plt.setp(axes, xticks=[], yticks=[])
    mask = mask_dict[iid]['mask']
    plt.title('Cell type: {}'.format([key for key in CLASS_LABELS.items() if key[1] == mask_dict[iid]['class']][0][0]))
    plt.tight_layout()
    plt.imshow(mask)

Seems OK! Next step is to create a [TPU compatible dataset](https://www.kaggle.com/mistag/sartorius-create-tpu-compatible-tf-dataset) for Tensorflow.