## HuBMAP Let's Visualize and Understand Dataset

### Release note

- Version1: First release

- Version2: Reorganized and added information to easier understand.

- Version3: Add how to visualize anatomical-structure.json.

- Version4: Add how to visualize glomerulus_segmentation_file.

- Version5: Add explanation for background.

- Version6: Maintained with additional data.

## Contents

1. [Introduction](#1)
1. [Loading and overviewing dataset](#2)
1. [Read and show Kidney image data](#3)
1. [Visualization of HuBMAP-20-dataset_information](#4)

<a id="1"></a> <br>
# <div class="alert alert-block alert-info">Introduction</div>

## Goal

Goal of this competition is development of a segmentation algorithm to identify the "Glomerulus" in the kidney.

We are given histological images of the kidney and annotation information representing the glomerular segmentation. Also we can use anatomical structure segmentation information and additional information (including anonymized patient data) about each image. 

## About Glomerulus

The glomerulus is one of the components of the "nephron". Nephron is said to be one million in one kidney. How many nephrons there are can be seen [later](#3) when you visualize the dataset. I'll put image of nephron from [reference[1]](#101). Nephron has roughly three components, glomerulus, bowman's capsule and tubule. 

Glomerulus is a mass of capillaries surrounded by a Bowman's capsule. The name comes from the fact that they look just like a hairball when viewed under a microscope. They act like filter paper.
Plasma (the non-cellular components of blood) sent to the kidney is filtered out during its passage through the capillaries of the glomerulus. Some of it comes out of the Bowman's capsule as the original urine. 

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/18/Bowman%27s_capsule_and_glomerulus.svg/1280px-Bowman%27s_capsule_and_glomerulus.svg.png" width="250">

## Notes for new participants

We should be aware that notebooks that have been around for some time may not be in line with the current situation. We may want to pay attention to when the notebook was last run and the data used. 

This competition [updated dataset](https://www.kaggle.com/c/hubmap-kidney-segmentation/discussion/224826), and [this discussion](https://www.kaggle.com/c/hubmap-kidney-segmentation/discussion/207884) will help us understand how it happened and the precautions we need to take with updating. 

<a id="2"></a> <br>
# <div class="alert alert-block alert-success">Loading and overviewing dataset</div>

## Load Library

In [None]:
import collections
import json
import os
import uuid

import cv2
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image, ImageDraw, ImageFilter
import tifffile as tiff 
import seaborn as sns

## Load data

There are three csv data and two directories.

In [None]:
!ls ../input/hubmap-kidney-segmentation/

### train.csv

There are 8 training set. This csv includes ids corresponding to data in train directory. Also it has mask data in "encoding" column. This data is encoded with RLE encoding. 

In [None]:
train = pd.read_csv("../input/hubmap-kidney-segmentation/train.csv")
train.info()

In [None]:
train.head()

### HuBMAP-20-dataset_information.csv

This file includes additional information (including anonymized patient data) about each image. I'll visualize this information in following part.

In [None]:
ds_info = pd.read_csv("../input/hubmap-kidney-segmentation/HuBMAP-20-dataset_information.csv")
ds_info.info()

In [None]:
ds_info.head()

### train directory

tiff files are kidney image data. json files include unencoded annotations. 

In [None]:
!ls ../input/hubmap-kidney-segmentation/train

There are two kinds of json files. About glomerulus segmentation file files, I'll explain it in [here](#99), and about anatomical structure file in [here](#100).

### test directory

In [None]:
!ls ../input/hubmap-kidney-segmentation/test

<a id="3"></a> <br>
# <div class="alert alert-block alert-warning">Read and show Kidney image data</div>

We are given histological images of the kidney. These images are tiff format. We can load this data with tifffile module. Let's load and show them.

In [None]:
img_id_1 = "aaa6a05cc"
image_1 = tiff.imread('../input/hubmap-kidney-segmentation/train/' + img_id_1 + ".tiff")
print("This image's id:", img_id_1)
image_1.shape

In [None]:
plt.figure(figsize=(15, 15))
plt.imshow(image_1)

Glomerulus is...

In [None]:
plt.figure(figsize=(8,8))
plt.imshow(image_1[5200:5600, 5600:6000, :])

Some tiff files are saved by different shape. We can view them in the same way by reading and then taking the transposition.

In [None]:
img_id_4 = "e79de561c"
image_4 = tiff.imread('../input/hubmap-kidney-segmentation/train/' + img_id_4 + ".tiff")
print("This image's id:", img_id_4)
image_4.shape
image_4 = image_4[0][0].transpose(1, 2, 0)
plt.figure(figsize=(10, 10))
plt.imshow(image_4)

## mask

We can decode mask from encoding column of train.csv.

In [None]:
# https://www.kaggle.com/paulorzp/rle-functions-run-lenght-encode-decode
def mask2rle(img):
    '''
    img: numpy array, 1 - mask, 0 - background
    Returns run length as string formated
    '''
    pixels= img.T.flatten()
    pixels = np.concatenate([[0], pixels, [0]])
    runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
    runs[1::2] -= runs[::2]
    return ' '.join(str(x) for x in runs)
 
def rle2mask(mask_rle, shape=(1600,256)):
    '''
    mask_rle: run-length as string formated (start length)
    shape: (width,height) of array to return 
    Returns numpy array, 1 - mask, 0 - background

    '''
    s = mask_rle.split()
    starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
    starts -= 1
    ends = starts + lengths
    img = np.zeros(shape[0]*shape[1], dtype=np.uint8)
    for lo, hi in zip(starts, ends):
        img[lo:hi] = 1
    return img.reshape(shape).T

In [None]:
mask_1 = rle2mask(train[train["id"]==img_id_1]["encoding"].iloc[-1], (image_1.shape[1], image_1.shape[0]))
mask_1.shape

Show the mask of the kidney image.

In [None]:
plt.figure(figsize=(10,10))
plt.imshow(mask_1, cmap='coolwarm', alpha=0.5)

If we want to see with the image, 

In [None]:
plt.figure(figsize=(10,10))
plt.imshow(image_1)
plt.imshow(mask_1, cmap='coolwarm', alpha=0.5)

Let's see one more example.

In [None]:
mask_4 = rle2mask(train[train["id"]==img_id_4]["encoding"].iloc[-1], (image_4.shape[1], image_4.shape[0]))
mask_4.shape
plt.figure(figsize=(10,10))
plt.imshow(image_4)
plt.imshow(mask_4, cmap='coolwarm', alpha=0.5)

According to the [report](https://www.kaggle.com/c/hubmap-kidney-segmentation/discussion/198116) in the discussion, some annotations still be missing. For more information on the impact of this missing annotation data on prediction performance and data handling, [this discussion](https://www.kaggle.com/c/hubmap-kidney-segmentation/discussion/227616#1250442) is helpful.

## Processing

To create dataset for model training, we have to generate dataset from these pictures. I'll explain how to process and save image.

### Crop

In [None]:
plt.figure(figsize=(8,8))
plt.imshow(image_1[5200:6200, 5000:6000, :])
plt.imshow(mask_1[5200:6200, 5000:6000], cmap='coolwarm', alpha=0.5)

### Rotate

In [None]:
plt.figure(figsize=(8,8))
plt.imshow(np.rot90(image_1[5200:6200, 5000:6000, :]))
plt.imshow(np.rot90(mask_1[5200:6200, 5000:6000]), cmap='coolwarm', alpha=0.5)

### flip

In [None]:
plt.figure(figsize=(8,8))
plt.imshow(np.fliplr(image_1[5200:6200, 5000:6000, :]))
plt.imshow(np.fliplr(mask_1[5200:6200, 5000:6000]), cmap='coolwarm', alpha=0.5)

### with filters

We can use filters with [ImageFilter Module](https://pillow.readthedocs.io/en/stable/reference/ImageFilter.html). The current version of Pillow provides the following set of predefined image enhancement filters:

- BLUR
- CONTOUR
- DETAIL
- EDGE_ENHANCE
- EDGE_ENHANCE_MORE
- EMBOSS
- FIND_EDGES
- SHARPEN
- SMOOTH
- SMOOTH_MORE

After converting image to PIL image with Image.fromarray function, we can use image filter. And we can return it to np.array.

In [None]:
im_filterd = Image.fromarray(image_1)
im_filterd = np.array(im_filterd.filter(ImageFilter.EDGE_ENHANCE_MORE))
image_1 = np.array(im_filterd)

plt.figure(figsize=(8,8))
plt.imshow(image_1[5200:6200, 5000:6000, :])
plt.imshow(mask_1[5200:6200, 5000:6000], cmap='coolwarm', alpha=0.5)

### Save as image

With pillow, we can save our processed kideny images and masks as image file. For create dataset, I'll try.

In [None]:
os.makedirs(f"./image/{img_id_1}/")
os.makedirs(f"./mask/{img_id_1}/")

In [None]:
pil_img = Image.fromarray(image_1[5200:6200, 5000:6000, :])
print(pil_img.mode)

img_uuid = str(uuid.uuid4())

pil_img.save(f'./image/{img_id_1}/{img_id_1}_{img_uuid}.jpg')
np.save(f'./mask/{img_id_1}/{img_id_1}_{img_uuid}', mask_1[5200:6200, 5000:6000])

## With Annotation json file

We have also two kinds of annotation files. I'll explain what information they have and how to visualize them.

<a id="99"></a>
### Glomerulus segmentation file

According to the description of dataset, the same information as the rle-encoded mask is stored.

In [None]:
with open("../input/hubmap-kidney-segmentation/train/e79de561c.json") as f:
    e79de561c_json = json.load(f)
    
print("lenght of json:", len(e79de561c_json))
print(e79de561c_json[0])

I'll define utility function. By this function, we can get PIL.Image.Image instance with line.

In [None]:
def flatten(l):
    for el in l:
        if isinstance(el, collections.abc.Iterable) and not isinstance(el, (str, bytes)):
            yield from flatten(el)
        else:
            yield el

def draw_structure(structures, im):
    """
    anatomical_structure: list of points of anatomical_structure poligon.
    im: numpy array of image read from tiff file.
    """
    
    im = Image.fromarray(im)
    draw = ImageDraw.Draw(im)
    for structure in structures:
        structure_flatten = list(flatten(structure["geometry"]["coordinates"][0]))
        structure = []
        for i in range(0, len(structure_flatten), 2):
            structure.append(tuple(structure_flatten[i:i+2]))
        
        draw.line(structure, width=100, fill='Red')
    return im

In [None]:
plt.figure(figsize=(8,8))
image_4_with_line = draw_structure(e79de561c_json, image_4)
plt.imshow(image_4_with_line)

<a id="100"></a>
### Anatomical structure file

In the same way with glomerulus segmentation file, we can show anatomical structure segmentations. This file contains anatomical structure segmentations. They are intended to help us identify the various parts of the tissue.

In [None]:
with open(f"../input/hubmap-kidney-segmentation/train/{img_id_1}-anatomical-structure.json") as f:
    anatomical_structure_json = json.load(f)
    
anatomical_structure_json

In [None]:
plt.figure(figsize=(8,8))
image_1_with_line = draw_structure(anatomical_structure_json, image_1)
plt.imshow(image_1_with_line)

<a id="4"></a> <br>
# <div class="alert alert-block alert-success">Visualization of HuBMAP-20-dataset_information</div>

I'll try to visualize HuBMAP-20-dataset_information to easy understand.

In [None]:
ds_info.head()

In [None]:
ds_info.shape

There are 20 data. Each data has 16 colmuns.

15 data are for training, and rest are test. It includes anonymized patient data.

In [None]:
def train_or_test(image_file):
    id, _ = image_file.split(".")
    if id in list(train["id"]):
        return "train"
    else:
        return "test"
    
ds_info["category"] = ds_info["image_file"].map(train_or_test)

In [None]:
plt.style.use("Solarize_Light2")

In [None]:
plt.figure(figsize=(15, 5))
g = sns.countplot(data=ds_info, x="patient_number", hue="category", palette=sns.color_palette("Set2", 8))
g.set_title("Number of images per patient")

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10,5), gridspec_kw=dict(wspace=0.1, hspace=0.6))
fig.suptitle("race and ethnicity", fontsize=15)
g = sns.countplot(data=ds_info, x="race", hue="category", palette=sns.color_palette("Set2", 8),ax=axes[0])
g.set_title("distribution of race", fontsize=12)
g = sns.countplot(data=ds_info, x="ethnicity", palette=sns.color_palette("Set2", 8), hue="category",ax=axes[1])
g.set_title("distribution of ethnicity", fontsize=12)

In [None]:
#Create figure and Axes. And set title.
fig, axes = plt.subplots(2, 2, figsize=(10,6), gridspec_kw=dict(wspace=0.1, hspace=0.6))
fig.suptitle("Sex and age", fontsize=15)

#Too check layout, I'll show text on each Axes.
gs = axes[0, 1].get_gridspec()
axes[0, 0].remove()
axes[1, 0].remove()
#Add gridspec we got
axbig = fig.add_subplot(gs[:, 0])

g = sns.countplot(data=ds_info, x="sex", hue="category", palette=sns.color_palette("Set2", 8),ax=axbig)
g.set_title("distribution of sex", fontsize=12)

#Add three plots.
g = sns.distplot(ds_info[ds_info["category"]=="train"]["age"], color="tomato", kde=False, rug=False,ax=axes[0,1])
g.set(xlim=(30,80))
g.set(ylim=(0,3))
g.set_title("distribution of age for train", fontsize=12)

g = sns.distplot(ds_info[ds_info["category"]=="test"]["age"], color="teal", kde=False, rug=False, ax=axes[1,1])
g.set(xlim=(30,80))
g.set(ylim=(0,3))
g.set_title("distribution of age for test", fontsize=12)

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(10,10), gridspec_kw=dict(wspace=0.1, hspace=0.4))
fig.suptitle("Physical information for train", fontsize=15)


g = sns.distplot(ds_info[ds_info["category"]=="train"]["weight_kilograms"], color="tomato", kde=False, rug=False, ax=axes[0,0])
g.set(xlim=(55,135))
g.set(ylim=(0,5))
g.set_title("weight_kilograms", fontsize=12)

g = sns.distplot(ds_info[ds_info["category"]=="train"]["height_centimeters"], color="tomato", kde=False, rug=False, ax=axes[0,1])
g.set(xlim=(155,195))
g.set(ylim=(0,5))
g.set_title("height_centimeters", fontsize=12)

g = sns.distplot(ds_info[ds_info["category"]=="train"]["bmi_kg/m^2"], color="tomato", kde=False, rug=False, ax=axes[1,0])
g.set(xlim=(22,37.5))
g.set(ylim=(0,5))
g.set_title("bmi_kg/m^2", fontsize=12)

g = sns.countplot(ds_info[ds_info["category"]=="train"]["laterality"], ax=axes[1,1])
g.set_title("laterality", fontsize=12)


fig, axes = plt.subplots(2, 2, figsize=(10,10), gridspec_kw=dict(wspace=0.1, hspace=0.4))
fig.suptitle("Physical information for test", fontsize=15)


g = sns.distplot(ds_info[ds_info["category"]=="test"]["weight_kilograms"], color="teal", kde=False, rug=False, ax=axes[0,0])
g.set(xlim=(55,135))
g.set(ylim=(0,5))
g.set_title("weight_kilograms", fontsize=12)

g = sns.distplot(ds_info[ds_info["category"]=="test"]["height_centimeters"], color="teal", kde=False, rug=False, ax=axes[0,1])
g.set(xlim=(155,195))
g.set(ylim=(0,5))
g.set_title("height_centimeters", fontsize=12)

g = sns.distplot(ds_info[ds_info["category"]=="test"]["bmi_kg/m^2"], color="teal", kde=False, rug=False, ax=axes[1,0])
g.set(xlim=(22,37.5))
g.set(ylim=(0,5))
g.set_title("bmi_kg/m^2", fontsize=12)

g = sns.countplot(ds_info[ds_info["category"]=="test"]["laterality"], ax=axes[1,1])
g.set_title("laterality", fontsize=12)

In [None]:
ds_info["Ratio_of_medulla_to_cortex"] = ds_info["percent_medulla"] / ds_info["percent_cortex"] 

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10,5), gridspec_kw=dict(wspace=0.1, hspace=0.6))
fig.suptitle("distribution of ratio of medulla to cortex", fontsize=15)
g = sns.distplot(ds_info[ds_info["category"]=="train"]["Ratio_of_medulla_to_cortex"], color="tomato",kde=False, rug=False, ax=axes[0])
g.set(ylim=(0,5))
g.set_title("train", fontsize=12)
g = sns.distplot(ds_info[ds_info["category"]=="test"]["Ratio_of_medulla_to_cortex"], color="teal", kde=False, rug=False, ax=axes[1])
g.set(ylim=(0,5))
g.set_title("test", fontsize=12)

------------
<a id="101"></a> <br>
## Reference

[1] https://en.wikipedia.org/wiki/Glomerulus_(kidney) 

[2] https://www.kaggle.com/c/hubmap-kidney-segmentation/discussion/197552