<center>
    <h1>HuBMAP - Quick Exploratory Data Analysis</h1>
<center>

<center>
<img src="https://hubmapconsortium.org/wp-content/uploads/2019/01/HuBMAP-Retina-Logo-Color.png">
</center>

## Introduction

This is a quick exploratory analysis with the to get familiar with the dataset and to identify possible hurdles the might pop up down the line.

> Credit to the original [notebook](https://www.kaggle.com/code/yuriikochurovskyi/hubmap-image-eda-step-by-step-beginner-friendly) which I based this one off.

In [None]:
import os
import cv2
import tifffile
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

<a id='section1'></a>
    
## Importing and processing image data
### The Dataset
The dataset is comprised of TIFF files. 
- The training set is a collection of ".tiff" files
- The public test set has some additional ".tiff" files. 

The training set includes annotations in both RLE-encoded and unencoded (JSON) forms. The annotations denote segmentations of glomeruli. 

File **train.csv** contains the unique IDs for each image, as well as an RLE-encoded representation of the mask for the objects in the image. 

RLE or Run Length Encoding converts a matrix into a vector and returns the position/starting point of the first pixel from where we observe an object (identified by a 1) and gives us a count of how many pixels from that pixel we see the series of 1s. For example coded Mask will look like [1 1 1 0 0 1 1], running RLE would give us 1 3 6 2, which means 3 pixels from the zeroth pixel (inclusive) and 2 pixels from the 5th pixel we see a series of 1s

For the begining, let's open end review **train.csv**, it contains all RLE-masks related to each images_IDs

In [None]:
df = pd.read_csv('../input/hubmap-organ-segmentation/train.csv')

image_list = ['10044', '10274', '10392']
input_dir = '../input/hubmap-organ-segmentation/train_images'
output_dir = '.'
df['id'] = df['id'].astype(str)
df

TBD

In [None]:
def resize_im(im_name, scale_percent):
    image_path = os.path.join(input_dir, im_name+'.tiff')
    im_read = tifffile.imread(image_path)
    width = int(im_read.shape[1] * scale_percent / 100)
    height = int(im_read.shape[0] * scale_percent / 100)
    dim = (width, height)
    print('File name: {}, original size: {}, resized to: {}'.format(im_name, (im_read.shape[0], im_read.shape[1]), (width, height)))
    resized = cv2.resize(im_read, dim, interpolation=cv2.INTER_AREA)
    image_path = os.path.join(output_dir, ('r_' + im_name+'.tiff'))
    tifffile.imwrite(image_path, resized)    

Resizing results:

In [None]:
for im in image_list:
    resize_im(im, 5)

Let's do the same with masks. I will decode relevant masks from the train.csv file and then resize and save it in separate file.

The function for RLE encoding:

In [None]:
def rle2mask(mask_rle, shape):
    '''
    mask_rle: run-length as string formated (start length)
    shape: (width,height) of array to return
    Returns numpy array, 1 - mask, 0 - background
    '''
    s = mask_rle.split()
    starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
    starts -= 1
    ends = starts + lengths
    img = np.zeros(shape[0]*shape[1], dtype=np.uint8)
    for lo, hi in zip(starts, ends):
        img[lo:hi] = 1
    return img.reshape(shape).T

Here is the function, which read RLE-mask from the DataFrame, resize it with some scale (in percent) and store it in the folder /output

In [None]:
def resize_mask(im_name, scale_percent):    
    im_read = tifffile.imread(os.path.join(input_dir, im_name +'.tiff'))
    mask_rle = df[df["id"] == im_name]["rle"].values[0]
    mask = rle2mask(df[df["id"] == im_name]["rle"].values[0], (im_read.shape[1], im_read.shape[0]))*255
    width = int(im_read.shape[1] * scale_percent / 100)
    height = int(im_read.shape[0] * scale_percent / 100)
    dim = (width, height)
    print('File name: {}, original size: {}, resized to: {}'.format(im_name, (im_read.shape[0], im_read.shape[1]), (width, height)))
    resized = cv2.resize(mask, dim, interpolation=cv2.INTER_AREA)
    image_path = os.path.join(output_dir, (im_name + '.tiff'))
    tifffile.imwrite(image_path, resized)    

Resizing results:

In [None]:
for im in image_list:
    print(im)
    resize_mask(im, 5)

All resized files:

In [None]:
os.listdir(output_dir)

<a id='section2'></a>
## Plotting samples of training images

Another step is plotting these files with relevant masks. Now, when images are resized it’s not a problem.
The function for plotting:

In [None]:
def show_image(image_id):
    fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(16, 32))
    image_path = os.path.join(output_dir, '{}.tiff'.format(image_id))
    mask_path = os.path.join(output_dir, 'r_{}.tiff'.format(image_id))    
    image = tifffile.imread(image_path)
    mask = tifffile.imread(mask_path)
    if len(mask.shape) == 2: hybr = image[:, :] + mask[:, :]/2
    else: hybr = image[:, :] + mask[:,: , 0]/2
    ax[0].imshow(image)
    ax[0].axis('off')
    ax[0].set_title('Real Image')
    ax[1].imshow(hybr)
    ax[1].axis('off')
    ax[1].set_title('Masks')
    plt.show()    

In [None]:
%matplotlib inline
show_image(image_list[0])

In [None]:
%matplotlib inline
show_image(image_list[1])

In [None]:
%matplotlib inline
show_image(image_list[2])

<a id='section3'></a>
## Image tiling

For the beginning I will split **'10044.tiff'** with original size for tiles of size 1024x1024 and store all files into the folder **split**:
-	Images will be stored in the folder **split/images/**
-	Mask-files will be stored in the folder **split/masks/**


In [None]:
os.makedirs('../working/split/images', exist_ok = True)
os.makedirs('../working/split/masks', exist_ok = True)
im_name = '10044.tiff'
image_path = os.path.join(input_dir, im_name)
df = pd.read_csv('../input/hubmap-organ-segmentation/train.csv')
df['id'] = df['id'].astype(str)
split_size = 1024
im = tifffile.imread(os.path.join(input_dir, im_name))
mask_rle = df[df["id"] == im_name[:-5]]["rle"].values[0]
mask = rle2mask(df[df["id"] == im_name[:-5]]["rle"].values[0], (im.shape[1], im.shape[0]))*255
for r in range(0, im.shape[0], split_size):
    for c in range(0, im.shape[1], 1024):
        im_tile = im[r: r + split_size, c: c + split_size]
        mask_tile = mask[r: r + split_size, c: c + split_size]
        # here I filter images with 0-mask and white borders around.
        if (np.sum(mask_tile)==0):
            if ((2 * split_size <= r <= (im.shape[0] - 2 * split_size)) and \
                (2 * split_size <= c <= (im.shape[1] - 2 * split_size))):
                tifffile.imwrite(f"split/images/img{r}_{c}.png", im_tile)
                tifffile.imwrite(f"split/masks/img{r}_{c}.png", mask_tile)
        else:
            tifffile.imwrite(f"split/images/img{r}_{c}.png", im_tile)
            tifffile.imwrite(f"split/masks/img{r}_{c}.png", mask_tile)
            

As a result I received a set of images and masks (label) for a model training. Let’s count just for information: 

In [None]:
len(os.listdir('split/images'))

And some statistics. Let's calculate the areas of the masks at each images, but for the beginning files with 0-mask. For comfortable calculations I will use Pandas DataFrame where index is file name:

In [None]:
mask_list = os.listdir('split/masks')
df=pd.DataFrame(index=mask_list)

Area calculation function:

In [None]:
def area_calc(image_id):
    mask_path = os.path.join(mask_dir, '{}'.format(image_id))
    mask = cv2.imread(mask_path)
    return int(np.count_nonzero(mask) / 3)


Calculate and write mask areas (sum) per image to the DataFrame:

In [None]:
mask_dir = 'split/masks'
mask_areas=[]
for msk in mask_list:
    mask_areas.append(area_calc(msk))
df['area'] = mask_areas


In [None]:
print('Total images:', len(df))
print('Non-zero images:', len(df[df['area']!=0]))

<a id='section4'></a>
## Mask-area per image distribution and some statistics

Non-zero Image distribution: 

In [None]:
%matplotlib inline
fig, ax = plt.subplots(1,1,figsize=(18,8))
ax.hist(df[df['area']!=0].values, bins=50, color='deeppink', edgecolor='black')  # `density=False` would make counts
ax.set_title('Non-zero Image destribution. Image File: {}     Total images: {}'.format(im_name, 
                                                                                       str(len(df[df['area']!=0]))), 
             fontsize=20)
ax.set_ylabel('Quantity', fontsize=16)
ax.set_xlabel('Area(pixels)', fontsize=16);
ax.grid()


In [None]:
df_sorted = df[df['area']!=0].sort_values(by=['area'])
smallest_list = df_sorted.head(5)['area'].index
largest_list = df_sorted.tail(5)['area'].index
zero_list = df[df['area']==0].head(5)['area'].index
print('Smallest:', list(smallest_list))
print('Largest:', list(largest_list))
print('Zero_list:', list(zero_list))


Let's take a look at the pictures with the largest and the smallest areas, bur first of all I will sort non-zero mask areas:

<a id='section5'></a>
## Plotting images with the smallest and the largest mask ares

A bit modified function for small image plotting:

In [None]:
def show_image(image_name):
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(32, 16))
    image_path = os.path.join('../working/split/images', image_name)
    mask_path = os.path.join('../working/split/masks', image_name)
    image = tifffile.imread(image_path)
    mask = tifffile.imread(mask_path)
    if len(mask.shape)==2:    
        hybr = image[:, :, 0] + mask[:, :]/2
    else:
        hybr = image[:, :, 0] + mask[:,: , 0]/2
    ax[0].imshow(image)
    ax[0].axis('off')
    ax[0].set_title('Real Image')
    ax[1].imshow(hybr)
    ax[1].axis('off')
    ax[1].set_title('Masks')
    plt.show()
    

Plot 5 images with the smallest mask areas

In [None]:
%matplotlib inline
for file in smallest_list:
    show_image(file)

Plot 5 images with the largest mask areas

In [None]:
%matplotlib inline
for file in largest_list:
    show_image(file)

And some Zero-mask images:

In [None]:
%matplotlib inline
for file in zero_list:
    show_image(file)

<a id='section6'></a>
## Conclusion

Here is brief overview of the images provided by **kaggle** for this semantic segmentation competition.
Let's see what UNet Neural Network monsters you guys are going to come up with!

[Jump on top](#section0)

> writing code under pressure is so much adrenaline.. I won't sleep for a week!