# Exploratory data analysis
This notebook performs basic exploratory analysis and visualization of the HuBMAP competition dataset. The first two sections deal with the training and test dataset. The third section focuses on the analysis of the distribution of patients over the training and test datasets, and its possible effects. This notebook does not include analysis of patient-specific information such as age, ethnicity, etc.
1. [Training dataset](#section-one)<br>
    1.1 [Image resolutions](#section-one-one)<br>
    1.2 [RGB and HSV color spaces](#section-one-two)<br>
    1.3 [Glomeruli count](#section-one-three)<br>
    1.4 [Glomeruli size distribution](#section-one-four)<br>
    1.5 [Anatomical structures](#section-one-five)<br>
    1.6 [Mask visualization](#section-one-six)<br>
2. [Test datatset](#section-two)<br>
    2.1 [Image resolutions](#section-two-one)<br>
    2.2 [Anatomical structures](#section-two-two)<br>
    2.3 [Mask visualization](#section-two-three)<br>
3. [Patient images](#section-three)<br>

In [None]:
import cv2
import glob
import json
import numpy as np
import os
import pandas as pd
import tifffile as tiff
from matplotlib import colors
from matplotlib import pyplot as plt
from matplotlib.lines import Line2D
from matplotlib_venn import venn2_unweighted

Get file names of training and test images.

In [None]:
train_images = glob.glob('/kaggle/input/hubmap-kidney-segmentation/train/*.tiff')
test_images = glob.glob('/kaggle/input/hubmap-kidney-segmentation/test/*.tiff')

train_images = list(map(lambda x: os.path.basename(x), train_images))
test_images = list(map(lambda x: os.path.basename(x), test_images))

Read dataset information file which contains general information about images and patients, calculate the total number of pixels for each image and store it in a new column `pixels_total`.

In [None]:
pd.set_option('display.max_colwidth', None)
dataset_information = pd.read_csv('/kaggle/input/hubmap-kidney-segmentation/HuBMAP-20-dataset_information.csv')
dataset_information['pixels_total'] = dataset_information.width_pixels * dataset_information.height_pixels

Split the dataset information file into training and test partitions.

In [None]:
train_dataset_information = dataset_information[dataset_information['image_file'].isin(train_images)].reset_index(drop=True)
test_dataset_information = dataset_information[dataset_information['image_file'].isin(test_images)].reset_index(drop=True)

# 1. Training dataset<a id="section-one"></a>
The following section deals exclusively with data from the __training__ dataset. The training dataset is comprised of the following data:
- 15 images in TIFF format
- 15 glomeruli masks in JSON format (alternatively `train.csv` file with RLE-encoded masks)
- 15 anatomical masks in JSON format

## 1.1 Image resolutions<a id="section-one-one"></a>
This table shows image resolutions in descending order. As can be seen in the table, the images are quite large. 

In [None]:
train_dataset_information.sort_values('pixels_total', ascending=False)[['image_file', 'width_pixels','height_pixels']].reset_index(drop=True)

Image resolutions as a bar chart.

In [None]:
train_dataset_information.plot.bar(x='image_file', y='pixels_total', rot=90)

## 1.2 RGB and HSV color spaces<a id="section-one-two"></a>
The following section shows RGB and HSV color spaces of `aaa6a05cc.tiff`. HSV color space is often used in segmentation tasks and can be useful to separate tissue pixels from background pixels.

In [None]:
# open and resize image
image = cv2.imread('/kaggle/input/hubmap-kidney-segmentation/train/aaa6a05cc.tiff')
image_resize = cv2.resize(image,(image.shape[1]//10,image.shape[0]//10), interpolation = cv2.INTER_CUBIC)

RGB scatter plot of `aaa6a05cc.tiff`.

In [None]:
# calculate colors
pixel_colors = image_resize.reshape((np.shape(image_resize)[0]*np.shape(image_resize)[1], 3))
norm = colors.Normalize(vmin=-1.,vmax=1.)
norm.autoscale(pixel_colors)
pixel_colors = norm(pixel_colors).tolist()

# split channels
b, g, r = cv2.split(image_resize)

# scatter plot
fig = plt.figure()
axis = fig.add_subplot(1, 1, 1, projection='3d')
axis.scatter(r.flatten(), g.flatten(), b.flatten(), facecolors=pixel_colors, marker='.')
axis.set_xlabel('Red')
axis.set_ylabel('Green')
axis.set_zlabel('Blue')
plt.show()

HSV scatter plot of `aaa6a05cc.tiff`.

In [None]:
# convert to hsv
hsv_image = cv2.cvtColor(image_resize, cv2.COLOR_BGR2HSV)
h, s, v = cv2.split(hsv_image)

# scatter plot
fig = plt.figure()
axis = fig.add_subplot(1, 1, 1, projection='3d')
axis.scatter(s.flatten(), h.flatten(), v.flatten(), facecolors=pixel_colors, marker='.')
axis.set_xlabel('Saturation')
axis.set_ylabel('Hue')
axis.set_zlabel('Value')
plt.show()

## 1.3 Glomeruli count<a id="section-one-three"></a>
This section provides information about the number of glomeruli masks per image in descending order.

In [None]:
train_glom_seg_files = train_dataset_information['glomerulus_segmentation_file'].to_list()
train_glomeruli_dict = {}

for file_name in train_glom_seg_files:
    file_id = file_name[:9]
    with open(f'/kaggle/input/hubmap-kidney-segmentation/train/{file_name}') as json_file:
        data = json.load(json_file)
        train_glomeruli_dict[file_id] = 0
        for entry in data:
            if entry['type'] == 'Feature' and entry['properties']['classification']['name'] == 'glomerulus':
                    train_glomeruli_dict[file_id] += 1
            else:
                raise Exception(f"Unexpected json format: {entry['type']}, {entry['properties']['classification']['name']}")
                
train_nr_glom = pd.DataFrame(list(train_glomeruli_dict.items()), columns=['file_id', 'nr_glomeruli'])
train_nr_glom.sort_values('nr_glomeruli', ascending=False).reset_index(drop=True)

Total number of glomeruli in the training dataset.

In [None]:
train_nr_glom['nr_glomeruli'].sum()

## 1.4 Glomeruli size distribution<a id="section-one-four"></a>
It is also possible to calculate sizes of glomeruli masks. The following section provides basic information about the distrubution of glomeruli mask sizes.

In [None]:
train_glom_seg_files = train_dataset_information['glomerulus_segmentation_file'].to_list()
train_glomeruli_polys_dict = {}

for file_name in train_glom_seg_files:
    with open(f'/kaggle/input/hubmap-kidney-segmentation/train/{file_name}') as json_file:
        data = json.load(json_file)
        train_glomeruli_polys_dict[file_name] = []
        for entry in data:
            if entry['type'] == 'Feature' and entry['properties']['classification']['name'] == 'glomerulus':
                geom = np.array(entry['geometry']['coordinates']).astype(np.float32)
                x,y,w,h = cv2.boundingRect(geom.squeeze(axis=0))
                train_glomeruli_polys_dict[file_name].append((h,w)) # height, width!
            else:
                raise Exception(f"Unexpected json format: {entry['type']}, {entry['properties']['classification']['name']}")
                
train_res_glom = pd.DataFrame(list(train_glomeruli_polys_dict.items()), columns=['glomerulus_segmentation_file', 'glomeruli_height_width'])
train_res_glom = train_res_glom.explode('glomeruli_height_width')
train_res_glom['height'], train_res_glom['width'] = zip(*train_res_glom['glomeruli_height_width'])
train_res_glom = train_res_glom.drop(columns=['glomeruli_height_width'])
train_res_glom.describe()

## 1.5 Anatomical structures<a id="section-one-five"></a>
In addition to glomeruli masks and images itself, the challenge dataset contains segmentaion masks of some antomical structures that can be used for training. Each image contains a different set of anatomical masks. This table shows all available anatomical masks for each image from the training dataset.

__Note:__ To our knowledge, the private test dataset doesn't provide segmentation masks for anatomical structures. Therefore the use of anatomical structures is limited.

In [None]:
def list_structures(seg_files, folder):
    train_anatomical_dict = {}
    folder_path = os.path.join('/kaggle/input/hubmap-kidney-segmentation/', folder)

    for file_name in seg_files:
        file_id = file_name[:9]
        with open(os.path.join(folder_path, file_name)) as json_file:
            data = json.load(json_file)
            train_anatomical_dict[file_id] = []
            for entry in data:
                if entry['type'] == 'Feature':
                    train_anatomical_dict[file_id].append(entry['properties']['classification']['name'])
                else:
                    raise Exception(f"Unexpected json format: {entry['type']}, {entry['properties']['classification']['name']}")

    return pd.DataFrame(list(train_anatomical_dict.items()), columns=['file_id', 'anatomical_structure'])

In [None]:
train_anatomical_seg_files = train_dataset_information['anatomical_structures_segmention_file'].to_list()
train_anatomical = list_structures(train_anatomical_seg_files, 'train')
train_anatomical

List of all unique anatomical structures that can be found in the training dataset. 

In [None]:
pd.DataFrame(train_anatomical.explode('anatomical_structure')['anatomical_structure'].unique(), columns=['unique_structures'])

## 1.6 Mask visualization<a id="section-one-six"></a>
This section provides visualization of glomeruli and anatomical masks. The function `make_grid` (borrowed from https://www.kaggle.com/leighplt/pytorch-fcn-resnet50) creates a grid for the sliding window operation.

In [None]:
def make_grid(shape, window=256, min_overlap=32):
    """
    source: https://www.kaggle.com/leighplt/pytorch-fcn-resnet50
    
    function to generate a grid layout for sliding window
    :param shape: height and width of the image
    :param window: size of the window
    :param min_overlap: minimal window overlap
    :return: array of window coordinates (x1,x2,y1,y2)
    """
    x, y = shape
    nx = x // (window - min_overlap) + 1
    x1 = np.linspace(0, x, num=nx, endpoint=False, dtype=np.int64)
    x1[-1] = x - window
    x2 = (x1 + window).clip(0, x)
    ny = y // (window - min_overlap) + 1
    y1 = np.linspace(0, y, num=ny, endpoint=False, dtype=np.int64)
    y1[-1] = y - window
    y2 = (y1 + window).clip(0, y)
    slices = np.zeros((nx,ny, 4), dtype=np.int64)
    
    for i in range(nx):
        for j in range(ny):
            slices[i,j] = x1[i], x2[i], y1[j], y2[j]    
    return slices.reshape(nx*ny,4)

The following visualization is inspired by https://www.kaggle.com/mpware/masks-quick-eda-updated-data. The visualization contains anatomical structures and glomeruli which are shown as small circles. Since the provided images are large, it is necessary to split the images into smaller frames using sliding window operation in order to use them as input for a machine learning algorithm. The perpendicular lines represent the way the images will be split into frames. In this example, the frame size is 1024x1024 pixels with 256 pixels overlap. As can be seen in the images, glomeruli are mostly located in the cortex. However, there are also some glomeruli masks outside the cortex.

In [None]:
def plot_masks(dataset_information, folder, frame_size, frame_overlap, plot_glom):
    folder_path = os.path.join("/kaggle/input/hubmap-kidney-segmentation", folder)
    
    for i in range(len(dataset_information)):

        # create new figure
        plt.figure(figsize=(32, 30))
        obj_line_thickness = 60

        # find metadata row for json file
        image_metadata = dataset_information.iloc[i]

        # open image file
        image = tiff.imread(os.path.join(folder_path, image_metadata['image_file']))
        print(image_metadata['image_file'])

        # reshape image if necessary
        if len(image.shape) == 5:
            image = image.squeeze()
        if image.shape[0] == 3:
            image = image.transpose(1, 2, 0)

        # create a copy of the image
        image = image.copy()

        # draw sliding window boxes
        frame_grid = make_grid((image.shape[1], image.shape[0]), frame_size, frame_overlap)

        for frame in frame_grid:
            x1, y1 = frame[0], frame[2]
            x2, y2 = frame[1], frame[3]
            image = cv2.rectangle(image, (x1, y1), (x2, y2), color=(255,255,255), thickness=16)

        # draw glomeruli polygons
        if plot_glom:
            # open glomeruli json file
            read_glom_seg_file = open(os.path.join(folder_path, image_metadata['glomerulus_segmentation_file']), 'r')
            glom_seg_data = json.load(read_glom_seg_file)
            
            for k in range(len(glom_seg_data)):
                glom_poly = np.array(glom_seg_data[k]['geometry']['coordinates']).astype(np.int32) # get coordinates of glomeruli
                cv2.polylines(image, glom_poly, True,(255,0,0), thickness=obj_line_thickness)

        # open anatomical json file
        read_anatomical_seg_file = open(os.path.join(folder_path, image_metadata['anatomical_structures_segmention_file']), 'r')
        anatomical_seg_data = json.load(read_anatomical_seg_file)

        # scan anatomical json file and draw lines
        for n in range(len(anatomical_seg_data)):
            obj_name = anatomical_seg_data[n]['properties']['classification']['name']
            obj_coords = anatomical_seg_data[n]['geometry']['coordinates']

            if (obj_name == 'Cortex'): # draw line around cortex
                cv2.polylines(image, np.expand_dims(np.array(obj_coords[0]).astype(np.int32), axis=0), True, (0,0,255), thickness=obj_line_thickness)
            elif (obj_name == 'Medulla'): # draw line around medulla
                cv2.polylines(image, np.array(obj_coords).astype(np.int32), True, (0,255,0), thickness=obj_line_thickness)
            elif (obj_name == 'Inner medulla'): # draw line around inner medulla
                cv2.polylines(image, np.array(obj_coords).astype(np.int32), True, (255,255,0), thickness=obj_line_thickness)
            elif (obj_name == 'Outer Medulla'): # draw line around outer medulla
                cv2.polylines(image, np.array(obj_coords).astype(np.int32), True, (0,255,255), thickness=obj_line_thickness)
            elif (obj_name == 'Outer Stripe'): # draw line around outer stripe
                cv2.polylines(image, np.array(obj_coords).astype(np.int32), True, (255,0,255), thickness=obj_line_thickness)
            else:
                raise Exception(f'Unknown anatomical object: {obj_name}')

        # down-scale the image
        image_resize = cv2.resize(image,(image.shape[1]//10,image.shape[0]//10), interpolation = cv2.INTER_CUBIC)

        # add legend and view the image
        custom_lines = [Line2D([0], [0], color=(0.,0.,1.), lw=4),
                    Line2D([0], [0], color=(0.,1.,0.), lw=4),
                    Line2D([0], [0], color=(1.,1.,0.), lw=4),
                    Line2D([0], [0], color=(0.,1.,1.), lw=4),
                    Line2D([0], [0], color=(1.,0.,1.), lw=4),]

        plt.legend(custom_lines, ['Cortex', 'Medulla', 'Inner medulla', 'Outer Medulla', 'Outer Stripe'])
        plt.axis('off')
        plt.title(image_metadata['image_file'])
        plt.imshow(image_resize)
        plt.show()

In [None]:
plot_masks(train_dataset_information, 'train', 1024, 256, True)

# 2. Test dataset<a id="section-two"></a>
The following section deals exclusively with data from the __test__ dataset. The test dataset is comprised of the following data:
- 5 images in TIFF format
- 5 anatomical masks in JSON format

## 2.1 Image resolutions<a id="section-two-one"></a>
This table shows image resolutions in descending order.

In [None]:
test_dataset_information.sort_values('pixels_total', ascending=False)[['image_file', 'width_pixels','height_pixels']]

Image resolutions as a bar chart.

In [None]:
test_dataset_information.plot.bar(x='image_file', y='pixels_total')

## 2.2 Anatomical structures<a id="section-two-two"></a>
This table shows all available anatomical masks for each image from the test dataset.

In [None]:
test_anatomical_seg_files = test_dataset_information['anatomical_structures_segmention_file'].to_list()
test_anatomical = list_structures(test_anatomical_seg_files, 'test')

List of all unique anatomical structures in the test dataset. 

In [None]:
pd.DataFrame(test_anatomical.explode('anatomical_structure')['anatomical_structure'].unique(), columns=['unique_structures'])

## 2.3 Mask visualization<a id="section-two-three"></a>
As already mentioned, this visualization is inspired by https://www.kaggle.com/mpware/masks-quick-eda-updated-data. The visualization shows anatomical structures. The perpendicular lines represent the way the images will be split into frames. In this example, the frame size is 1024x1024 pixels with 256 pixels overlap.

In [None]:
plot_masks(test_dataset_information, 'test', 1024, 256, False)

# 3. Patient images<a id="section-three"></a>
Some patients from the training dataset have more than one image, as shown in the table below. This must be considered when partitioning training data into training, validation and test sets in order to avoid data leakage.

In [None]:
train_dataset_information.groupby('patient_number').agg(list)['image_file'].to_frame()

Patients from the test dataset have only one image each.

In [None]:
test_dataset_information.groupby('patient_number').agg(list)['image_file'].to_frame()

Furthermore, there are patients that are both in training __and__ test dataset, as shown below.

In [None]:
train_patients = set(train_dataset_information['patient_number'].unique().tolist())
test_patients = set(test_dataset_information['patient_number'].unique().tolist())

train_patients.intersection(test_patients)

Patients that exclusively belong to the  __training dataset__.

In [None]:
train_excl_pat = train_patients - test_patients
train_excl_pat

Patients that exclusively belong to the __test dataset__.

In [None]:
test_excl_pat = test_patients - train_patients
test_excl_pat

Venn diagram representation of patient distribution over the training and test datasets.

In [None]:
venn = venn2_unweighted([train_patients, test_patients], ('Training', 'Test'))
venn.get_label_by_id('10').set_text('\n'.join(sorted(map(str, train_patients - test_patients))))
venn.get_label_by_id('11').set_text('\n'.join(sorted(map(str, train_patients.intersection(test_patients)))))
venn.get_label_by_id('01').set_text('\n'.join(sorted(map(str, test_patients - train_patients))))

As mentioned before, it is necessary to prevent data leakage when splitting the training data into training, validation and test partitions. Therefore the test partition isn't allowed to contain images of patients whose images were used during the training of a machine learning model. This means that only patients from the left side of the Venn diagram can be selected for the test partition. Moreover, due to the limited training dataset, it can be beneficial to select images of different patients for the test partition rather than having multiple images from the same patient. Based on these statements, the requirements for test partition can be defined as follows:
- Patient belongs exclusively to the training dataset
- Patient has one image at most

The following section shows possible candidates for the test partition.

In [None]:
patient_images = train_dataset_information.groupby('patient_number').agg(list)['image_file'].to_frame().reset_index()
patient_images = patient_images[patient_images['patient_number'].isin(train_excl_pat)]
patient_images = patient_images[patient_images['image_file'].map(len) < 2]
patient_images = patient_images['image_file'].apply(lambda x: x[0]).to_frame().reset_index(drop=True)
patient_images