## Introduction to the Sartorius Data Challenge ##

This notebook is intended for a new user to get acquainted with the Kaggle challenge
titled "[Sartorius - Cell Instance Segmentation][k_challenge]". It will not do the
work necessary to submit a prediction file, but it will help to get started.
(I am not doing any predictions in this notebook.)

The challenge involves segmenting individual neuronal cells in microscopic images.
Current solutions using computer vision have limited accuracy, depending on the
cell lines and shapes in the images. The goal of this challenge will be to find
an efficient solution that will save time and effort.

The scoring is based on assigning pixels correctly, as signal or background. That challenge is not just distinguishing how many cell bodies are in the image, but to determine the borders of the bodies for irregular shapes.

(The issue of assigning pixels when two cells overlap needs to be dealt with, because the submission requires assigning non-overlapping cells. In the discussion forum, the [competition host posted a comment][comment_challenge] that mentioned "the amount of overlap in the evaluation dataset is very low (lower than the training dataset)".)

The [challenge data directory][d_challenge] consists of:
- train.csv, which will be used to describe the images to be used for the training set
- train directory, which has 606 PNG images which are roughly 250 kB in size
- test directory, which has 3 PNG images. The hidden test set of roughly 240 images can only be accessed by your notebook when you submit.
- train_semi_supervised and LIVECell_dataset_2021 directories, for additional information and testing. Those will not be examined here.

[k_challenge]: https://www.kaggle.com/c/sartorius-cell-instance-segmentation/overview
[d_challenge]: https://www.kaggle.com/c/sartorius-cell-instance-segmentation/data
[comment_challenge]: https://www.kaggle.com/c/sartorius-cell-instance-segmentation/discussion/280250#1571081

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as img

%matplotlib inline

# Set to false if using local copies
if (True):
    dataDir = '/kaggle/input/sartorius-cell-instance-segmentation/'
else:
    dataDir = ''

train_file = dataDir + 'train.csv'
org_df = pd.read_csv(train_file)

### General stats for the CSV file ###

The training CSV file has 606 file identifiers, which matches the number of image files
in the train folder.
The average number of cells annotated for each image is 120, with as few as 4 and as many
as 790.

The image files are all the same size (704 x 520 pixels), but it will be useful to keep
the image size flexible, in case we want to use other images while developing the code.

The 3 cell types are called 'shsy5y', 'astro', and 'cort'.

(I have not explored the other information yet, and will not use them in this notebook.)

In [None]:
print(org_df.info(show_counts=True))
print(org_df.nunique())

num_ids = org_df['id'].nunique()
annot_list = np.ndarray(shape=(num_ids),dtype=int)
#                        'int16')
for count,id in enumerate(org_df['id'].unique()):
    annot_list[count] = org_df[org_df['id'] == id].shape[0]

print("Unique image files: ", annot_list.size)
print('Average number of annotations per file: ',  \
      np.average(annot_list))
print('Min number of annotations per file: ',  \
      np.min(annot_list))
print('Max number of annotations per file: ',  \
      np.max(annot_list))
print("Image width : ",org_df['width'].unique())
print("Image height: ",org_df['height'].unique())
print("Cell types: ",org_df['cell_type'].unique())

### Load an image ###

To start, I downloaded (manually) the first 5 PNG files that I saw in the Kaggle Data Explorer. In this case, I want to use the 5th one, named "029e5b3b89c7.png".

At first, I used the pypng library to load the image, where I learned that the image was
in greyscale. I decided to drop that library, and just use matplotlib (and assume that
all images are greyscale). The dimensions of the image are what we expected, and
the pixels range in value from 0 to 1 (float).

I found it difficult to see the details in the plot without making the image larger.
To help with that, I made another image where all of the pizels with values > 0.5
were assigned to 1. This made the background bright, and helps me to see the
structures better.

In [None]:
image_list = [
    "01ae5a43a2ab",
    "026b3c2c4b32",
    "029e5b3b89c7",
    "0030fd0e6378",
    "0140b3c8f445"
]
id_image = 2

# reading png image file
pngFile = "{}train/{}.png".format(dataDir, image_list[id_image])
im = img.imread(pngFile)

im2 = im.copy()
im2[im2 > 0.5] = 1.0

# show image(s)
fig, axs = plt.subplots(1, 2)
fig.set_size_inches(14.0, 8.0)
axs[0].set_title('normal')
axs[0].imshow(im, cmap='gray')
axs[1].set_title('altered')
axs[1].imshow(im2, cmap='gray')
plt.show()

print("file size ",im.shape)
print("min pixel value = ",np.min(im))
print("max pixel value = ",np.max(im))

### Annotated cells ###

Next I wanted to see where the identified cells were in the image.

The annotation format is under [Overview/Evaluation][a_challenge], under the subheading
"Submission File"

Each line in the train.CSV file is a cell in an image. The "annotation" is a string of
number pairs. The first number is the starting pixel index, and the second number is
the run length.

For this initial scan of the data, I didn't want to worry about coloring the plots
or dealing with alpha transparency values. I made a black & white mask showing the
cell locations, and put that next to the image for comparison.

[a_challenge]: https://www.kaggle.com/c/sartorius-cell-instance-segmentation/overview/evaluation

In [None]:
temp_df = org_df[org_df['id'] == image_list[id_image]]

def annot_mask(masker, annot_string, val= 0.0):
    an_arr = np.fromstring(annot_string, dtype=int, sep=' ')
    for ind in range(0, an_arr.size, 2):
        ix = an_arr[ind]
        iy = ix + an_arr[ind+1]
        masker[ix:iy] = val
    return masker

masker = np.ones(im.shape[0]*im.shape[1])
for an_string in temp_df['annotation']:
    masker = annot_mask(masker, an_string)

masker = masker.reshape(im.shape)

# show image(s)
fig, axs = plt.subplots(1, 2)
fig.set_size_inches(14.0, 8.0)
axs[0].imshow(im2, cmap='gray')
axs[0].set_title('altered')
axs[1].imshow(masker, cmap='gray')
axs[1].set_title('annotated cells')
plt.show()

print("Number of annotations: ",temp_df.shape[0])

### Conclusion ###

There is much to do before a submission can be made, but sometimes you need to tread
water in the shallow part of the pool before swimming in the deep end. Good luck
with your algorithms!