# <center> Sartorius - Cell Instance Segmentation - EDA </center>

<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/30201/logos/header.png" style="float: left; margin-right: 5px;" />


The overview of the problem and a summary of the data can be found [here](https://www.kaggle.com/c/sartorius-cell-instance-segmentation/overview) and [here](https://www.kaggle.com/c/sartorius-cell-instance-segmentation/data), respectively.

# Task

The task of this challenge is to detect and delineate distinct objects of interest in biological images depicting neuronal cell types commonly used in the study of neurological disorders. More specifically, this will be done using phase contrast microscopy images to train and test computer vision model for instance segmentation of neuronal cells.








In [None]:
import os
import cv2

import numpy as np 
import pandas as pd

import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ['svg']

df_train = pd.read_csv('../input/sartorius-cell-instance-segmentation/train.csv')

# 1. Meta Data

The meta data is given only for the train data, meaning that they cannot be used to predict the test data but they should be used to construct a solid cross validation strategy. So let's start understanding the statistical properties of the meta data.

The meta data, which is given in the `train.csv` file, contains 7 categorical and 2 numerical features (see table below). Each row points to an image with the `id` column and its associated mask with the `annotation` column. The annotations are given in the "run length encoded pixels" format. Furthermore each row contains the `cell type` information.

In [None]:
df_train.info()

In [None]:
df_train.head()

In [None]:
df_train['sample_id'].nunique()

## 1.1 Images

There are only 606 images in the train set, which is small for training neural network models and can easily lead to an overfitting problem. However, it's well known that this problem can be easily mitigated with the use of appropriate augmentation techniques (e.g. [Ronneberger et al. 2015](https://arxiv.org/pdf/1505.04597.pdf)).

All the images have the same shape: (704 x 520) px. This is nice to have in a dataset because there won't be any complications due to a variable image resolution.

In [None]:
print(f'Number of unique images: {df_train.id.nunique()}')
print(f'Do all the images have a width of 704: {(df_train["width"]==704).all()}')
print(f'Do all the images have a height of 520: {(df_train["height"]==520).all()}')

The number of instances in each image is remarkably variable (see figure below). Some statistical measures are as follows:
- Most of the images have more than 47 instances annotated.
- The minimum number of instances is 4.
- The maximum number of instances is 790.

These numbers are extremely critical to train an instance segmentation model. For instance, the famous Mask RCNN model requires the information of "maximum number of detections".

In [None]:
fig, ax = plt.subplots()

ninstances_per_image = df_train[['id']].value_counts().sort_values()
ninstances_per_image.index = range(606)
ninstances_per_image.median()
ninstances_per_image.plot.bar(ax=ax)

ax.set_xticklabels([])
ax.set_xlabel('Images')
ax.set_ylabel('Number of Instances')
plt.show()


The number of rows in the `train.csv` file is more than 606. This means that there are more than 1 instances in a given image.

## 1.2 Cell Types

Each image is annotated with one of the three cell types *shsy5y*, *asto*, and *cort*. The distribution of the cell types is shown in the figure below. While the most represented cell type is *shsy5y* with ~52k instances in the train set, cell types *cort* and *astro* are annotated only ~10.5k times each. This means that the data is biased towards cell type *shsy5y*.

In [None]:
fig, ax = plt.subplots(1, 1)
df_train[['cell_type']].value_counts().plot.bar(ax=ax)
ax.set_ylabel('Number of Instances')
fig.tight_layout()
plt.show()

The number of unique `id` and `cell_type` combinations is equal to the number of unique images. This means that each image is associated with a unique cell type! See the result below. This also explains the distribution of the image numbers in the train.csv file. The images observed abundantly are associated with the most observed cell type shsy5y.

In [None]:
df_train[['id', 'cell_type']].value_counts()

Since each image is associated with only 1 cell type, we can count the number of images associated with each cell type. The figure below shows that most of the images are associated with the cell type `cort`, which agrees with our previous findings. 

In [None]:
fig, ax = plt.subplots(1, 1)
df_train.groupby(['id','cell_type'])['cell_type'].first().value_counts().plot.bar(ax=ax)
ax.set_ylabel('Number of Images')
ax.set_xlabel('Cell Types')
fig.tight_layout()
plt.show()

# 2. Train Images

It's time to look at the images and the masks now. The figure below shows randomly selected images corresponding to each of the three distinct cell types. Each cell type has its own unique morphological properties. 

- `astro` instances are the biggest in shape. They cover a lot of space in the masks.
- `cort` instances are smaller than the other cell types in general and they are in circle-like shapes. They don't cover much space in the masks.
- `shsy5y` instances are slightly bigger, elongated and more abundant than the cort instances. They cover more space than the `cort` cells.

In [None]:
def make_mask(mask_files, image_shape=(520, 704), color=False):
    mask = np.zeros(image_shape).ravel()
    for i, mask_file in enumerate(mask_files):
        couples = np.array(mask_file.split()).reshape(-1, 2).astype(int)
        couples[:, 1] = couples[:, 0] + couples[:, 1]
        for couple in couples:
            if color:
                mask[couple[0]: couple[1]] = i
            else:
                mask[couple[0]: couple[1]] = 1
    mask = mask.reshape(520, 704)
    return mask

def plot_image(image_id='0030fd0e6378'):
    fig, ax = plt.subplots(1, 2, figsize=(14,5))
    cell_type = df_train.loc[df_train['id'] == image_id, 'cell_type'][0:1].values
    
    file_name = os.path.join(
        '../input/sartorius-cell-instance-segmentation',
        'train', image_id + '.png')
    image = plt.imread(file_name)
    mask_files = df_train.loc[df_train['id'] == image_id, 'annotation']
    mask = make_mask(mask_files)

    ax[0].imshow(
        image,
        cmap = plt.get_cmap('winter'), 
        origin = 'upper',
        vmax = np.quantile(image, 0.99),
        vmin = np.quantile(image, 0.05)
    )
    ax[0].set_title(f'Source [{image_id}]')
    ax[0].axis('off')
    
    ax[1].imshow(
        image,
        cmap = plt.get_cmap('winter'), 
        origin = 'upper',
        vmax = 255,
        vmin = 0)
    ax[1].imshow(mask, alpha=1, cmap=plt.get_cmap('seismic'))
    ax[1].set_title(f'Source [{image_id}] + Mask {cell_type}')
    ax[1].axis('off')
    plt.show()

select_image_ids = []
select_image_ids.append(df_train.loc[df_train['cell_type'] == 'astro', 'id'].sample(1).to_list()[0])
select_image_ids.append(df_train.loc[df_train['cell_type'] == 'cort', 'id'].sample(1).to_list()[0])
select_image_ids.append(df_train.loc[df_train['cell_type'] == 'shsy5y', 'id'].sample(1).to_list()[0])

for image_id in select_image_ids:
    plot_image(image_id)

# 3. Overlapping Instances

The quote paraphrased from the [data page](https://www.kaggle.com/c/sartorius-cell-instance-segmentation/data) clearly indicates that that there are overlapping instances. Let's calculate the percentage of these overlaps.

> Note: while predictions are not allowed to overlap, the training labels are provided in full (with overlapping portions included). This is to ensure that models are provided the full data for each object. Removing overlap in predictions is a task for the competitor.

Method: Decode the run-length-encoded masks into pixels and count the number of each pixel's occurence. Any pixel counted more than once indicates an overlap (see code below).

In [None]:
def overlap_percentage(image_id, return_pixel_counts=False):
    mask_files = df_train.loc[df_train['id'] == image_id, 'annotation']
    result = np.array([]).astype(int)
    for mask_file in mask_files:
        couples = np.array(mask_file.split()).reshape(-1, 2).astype(int)
        couples[:, 1] = couples[:, 0] + couples[:, 1]
        for i, couple in enumerate(couples):
            result = np.append(result, np.arange(couple[0], couple[1]))
            
    pixel_counts = pd.DataFrame(result, columns=['Pixels'])['Pixels'].value_counts()
    overlap_percentage = pixel_counts[pixel_counts > 1].sum() / pixel_counts.sum()
    
    if return_pixel_counts:
        return overlap_percentage, pixel_counts
    else: 
        return overlap_percentage

df_id_cell = df_train[['id', 'cell_type']].drop_duplicates().reset_index(drop=True)
df_id_cell['Overlap'] = 0.

for image_id in df_id_cell['id'].unique():
    df_id_cell.loc[df_id_cell['id']==image_id, 'Overlap'] = overlap_percentage(image_id)

The figure below shows the overlap fractions in each image as a function of the cell type. There are 3 images with very big overlap fractions! Two of them belongs to shsy5y cell type and one to cort cell type.

In [None]:
ax = sns.stripplot(x='cell_type', y='Overlap', data=df_id_cell, size=4, color=".3", linewidth=0)
ax.set(ylabel='Overlap Fraction')
plt.show()

In [None]:
print(f'Median overlap fraction of entire train data set: {np.round(df_id_cell["Overlap"].median(), 2)*100} %')

Let's identify these three images with the highest overlap fractions and visualize them and their respective masks to see what's unusual.

Method: We will decode the run-length-encoded masks into an image of shape (520, 704, numberofinstances) and add the pixels on the last axis. Any overlapping instances will have a value greater than 1 and non-overlapping pixels will have a value 0 or 1.

In [None]:
outliers = df_id_cell.sort_values(by='Overlap', ascending=False).head(3)
outliers

In [None]:
def make_mask(mask_files, image_shape=(520, 704), color=False):
    masks = np.zeros(image_shape + (len(mask_files),))

    for i, mask_file in enumerate(mask_files):
        mask = np.zeros(image_shape).ravel()
        couples = np.array(mask_file.split()).reshape(-1, 2).astype(int)
        couples[:, 1] = couples[:, 0] + couples[:, 1]
        for couple in couples:
            if color:
                mask[couple[0]: couple[1]] = i
            else:
                mask[couple[0]: couple[1]] = 1
        mask = mask.reshape(520, 704)
        masks[:, :, i] = mask
        del mask
    return masks

def plot_outlier(idx):
    image_id = outliers.iloc[idx]['id']
    cell_type = outliers.iloc[idx]['cell_type']
    mask_files = df_train.loc[df_train['id'] == image_id, 'annotation']
    masks = make_mask(mask_files)

    fig = px.imshow(masks.sum(-1))
    fig.update_layout(title_text=f'id: {image_id}, cell_type: {cell_type}', title_x=0.5)
    fig.show()

## Outlier 1

Looks like we have an interesting case here. Outlier 1 mainly consist of duplicate instances! Note that purple pixels are the pixels that are seen only once (no overlap) whereas the other colors (e.g.) indicates at least 1 overlapping pixel.



In [None]:
plot_outlier(0)

## Outlier 2

The overlaps in Outlier 2 is different than the ones in Outlier 1. Here, small instances overlap with larger instances.

In [None]:
plot_outlier(1)

## Outlier 3

The overlaps in Outlier 3 is not as severe as the ones in the Outlier 1 and Outlier 2. There are only a small number of small instances overlapping with bigger instances.

In [None]:
plot_outlier(2)

These outliers have to be taken care of to avoid use of duplicated instances!

# 4. Bounding Box Properties

Instance segmentation models uses bounding boxes to locate the instances. Therefore, the sizes of the bounding boxes are important for the models. Let's examine the bounding boxes of the instances.

In [None]:
def rle2mask(rle, img_w, img_h):
    array = np.fromiter(rle.split(), dtype = np.uint32)
    array = array.reshape(-1, 2)
    array[:,0] = array[:, 0] - 1
    
    mask_decompressed = np.concatenate([np.arange(i[0], i[0] + i[1], dtype=np.uint32) for i in array])

    msk_img = np.zeros(img_w * img_h, dtype = np.uint8)
    msk_img[mask_decompressed] = 1
    msk_img = msk_img.reshape((img_h, img_w))
    msk_img = np.asfortranarray(msk_img)
    
    return msk_img

def instance_hg_wd(rle):
    mask = rle2mask(rle, 520, 704)
    nonzerocoords = np.transpose(np.nonzero(mask))
    height = nonzerocoords[:,0].max() - nonzerocoords[:,0].min()
    width = nonzerocoords[:,1].max() - nonzerocoords[:,1].min()
    return height, width

def get_ins_hw():
    df_train[['ins_height', 'ins_width']] = 0
    hw = pd.DataFrame(df_train['annotation'].apply(instance_hg_wd).to_list())
    df_train['ins_height'] = hw[0]
    df_train['ins_width'] = hw[1]
    return df_train

df_train = get_ins_hw()

In [None]:
fig = px.scatter(
    df_train, x="ins_width", y="ins_height", color="cell_type",
    labels={
         "ins_width": "Instance Height (px)",
         "ins_height": "Instance Width (px)",
         "cell_type": "Cell Type"
     },)
fig.show()

The figure above shows the instance heights and the instance widths as a function of the cell types. The instance heights and the instance widths of the astro cells (red) are strikingly different than those of cort cells (green) and shsy5y cells (blue). Astro cell instances are in general wider and longer, consistent with our visual inspection.


In [None]:
print(f"Minimum instance height: {df_train['ins_height'].min()} px")
print(f"Maximum instance height: {df_train['ins_height'].max()} px")
print(f"Minimum instance width: {df_train['ins_width'].min()} px")
print(f"Maximum instance width: {df_train['ins_width'].max()} px")