# Using Sartorious and LiveCell data

We've been given new and old data. I come at these problems from an image processing background. As such, I like to have the binary masks to work with in an image format. 

The purpose of this notebook is to show how to get a binary mask generated for both the Sartorious training data and the LiveCell training data.

import libraries first

In [None]:
#from pycocotools.coco import COCO
import skimage.io as io
import matplotlib.pyplot as plt
from pathlib import Path
from PIL import Image

## Sartorious data
First let's handle the new competition data

### Load the training data
First step is loading the training data and checking out the first few lines.

It is important to recognize that none of the "metadata" will be provided in the test set. so... without good reason, we really shouldn't use it except perhaps to organize our thoughts about the data.

we'll import pandas to read the csv and numpy because we will need it for manipulating numerical values at some point.

In [None]:
import pandas as pd
import numpy as np
dataDir='../input/sartorius-cell-instance-segmentation'

traindf = pd.read_csv(dataDir+'/train.csv')
traindf.head()

## Annotations
are defined in the csv as a run length encoded (RLE) mask. The following function will be used to convert that RLE annotation into a binary mask image 

In [None]:
# ht to https://www.kaggle.com/shivansh002/getting-started for this particular version of the function
def rle_decode(mask_rle, shape=(520, 704)):
    '''
    mask_rle: run-length as string formated (start length)
    shape: (height,width) of array to return 
    Returns numpy array, 1 - mask, 0 - background

    '''
    s = mask_rle.split()
    starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
    starts -= 1
    ends = starts + lengths
    img = np.zeros(shape[0]*shape[1], dtype=np.uint8)
    for lo, hi in zip(starts, ends):
        img[lo:hi] = 1
    return img.reshape(shape)  # Needed to align to RLE direction

## Images and masks

Now we've got our function to generate the masks from the csv data, let's see if we can visualize an image and it's corresponding mask.

we can find an individual image name from the 'id' field. There are multiple entries per image, so we will just grab the first unique entry

In [None]:
imagename = traindf['id'].unique()[0]
print(imagename)

let's load and plot that image.

We'll use skimage to read in the image and matplotlib to plot

In [None]:
import skimage
from skimage.io import imread
import matplotlib.pyplot as plt

image = imread(dataDir + '/train/' + imagename + '.png')

## Now plot that image (selecting for grayscale since it is not a color image)
plt.imshow(image,cmap='gray', vmin=0, vmax=255)

Now let's plot the mask (using matplotlib again). 

First we need to make a sub dataframe that has only masks pertaining to the image we are interested in.

Next, we will make one giant mask for all the images

In [None]:

dfimage=traindf[traindf['id']==imagename]

print("Number of masks for " + imagename + ": " + str(len(dfimage)))

allMask = np.zeros(image.shape)
for i in range(len(dfimage)):
    allMask = allMask + rle_decode(dfimage['annotation'][i])
    
plt.imshow(allMask==1)


In [None]:
#and we can look at the mask overlaid on top like this:
plt.imshow(image,cmap='gray', vmin=0, vmax=255)
plt.imshow(allMask==1, cmap='jet', alpha=0.5) # interpolation='none'

## Livecell data

Let's also load some data from the LiveCell dataset. it's in coco format. I tried using pycocotools, but got errors, so I was going to brute force it, but it turns out from much wrestling that the LiveCell data won't load using pycocotools, and even though it is in coco format, the annotations are in a polygon format, not a RLE format. 

So let me walk you through how to load in a LiveCell image and it's corresponding segmentations.

First, we really do need pycocotools installed

In [None]:
!pip install pycocotools

## Loading data via json

Now because of the pycocotools errors I'm getting, I can't load directly with the COCO function. So I load it first with json

In [None]:
import json
from pycocotools.coco import COCO
from pycocotools import _mask
annFile = '../input/sartorius-cell-instance-segmentation/LIVECell_dataset_2021/annotations/LIVECell_single_cells/shsy5y/livecell_shsy5y_train.json' 

with open(annFile) as f:
    data = json.loads(f.read())


And then you can load in an annotation by combining two pycocotools functions
1. `frPoly`
2. `decode`

If someone finds a better way to do this I'm happy to hear it, but I wasn't able to find a way to convert straight from the polygon format into the mask. But this seems to work

In [None]:
for key in data['annotations'].keys():
#         print(type(data['annotations'][key]['segmentation']))

    rle = _mask.frPoly(data['annotations'][key]['segmentation'],520,704 )
    print(rle)
    mask = _mask.decode(rle)
    print(mask.shape)

    break

In [None]:
plt.imshow(mask)

In [None]:
from pycocotools import _mask

for image in data['images']:
    print(image['file_name'] + ' ' + str(image['id']))
    tempimg = imread('../input/sartorius-cell-instance-segmentation/LIVECell_dataset_2021/images/livecell_train_val_images/SHSY5Y/' + image['file_name'])

    allMask = np.zeros(tempimg.shape)
    
    for key in data['annotations'].keys():
    #         print(type(data['annotations'][key]['segmentation']))
        if image['id']==data['annotations'][key]['image_id']:
            rle = _mask.frPoly(data['annotations'][key]['segmentation'],520,704 )
#             print(rle)
            mask = _mask.decode(rle)
#             print(mask.shape)
            allMask = allMask + mask[:,:,0]

    plt.imshow(tempimg,cmap='gray', vmin=0, vmax=255)
    plt.imshow(allMask==1, cmap='jet', alpha=0.5) # interpolation='none'
    break
# for annotation in livecelldatajson['annotations'].items():
#     print(annotation['imageid'])
   

In [None]:
plt.imshow(tempimg,cmap='gray', vmin=0, vmax=255)
# plt.imshow(allMask==1, cmap='jet', alpha=0.5) # interpolation='none'

Next steps will include exporting these masks for use in building training datasets.