# Overview

I have tried to make this EDA as concise and to the point as possible. Instead of making the notebook lengthy and wasting the readers time, I have included only the important points.

For more exploration and understanding of the data, I recommend the readers to copy and edit this notebook. As there is no better exploration available that what is done yourself ;)

# Please upvote the notebook if you find the content useful. This will motivate me to create more of these content :)

# Before proceeding with the EDA, I will advice the readers to get a little bit of domain knowledge.

Follow the link for a concised and consolidated domain knowledge of this competition

https://www.kaggle.com/prvnkmr/better-understanding-of-the-problem-statement

# Libraries import

In [None]:
import PIL
import gc
import os
import random
import tifffile
import cv2
import json
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image

import warnings
warnings.filterwarnings("ignore")

PATH = "../input/hubmap-kidney-segmentation/"
CFG = {
        'PATH' : "../input/hubmap-kidney-segmentation/",
        'PATH_TRAIN' : PATH + "train/",
        'PATH_TEST' : PATH + "test/",
}

# Utility Functions

In [None]:
os.makedirs('../output')
output_dir = '../output'

def resize_im(im_name, scale_percent):
    
    image_path = os.path.join(CFG['PATH_TRAIN'], im_name+'.tiff')
    im_read = tifffile.imread(image_path)
    width = int(im_read.shape[1] * scale_percent / 100)
    height = int(im_read.shape[0] * scale_percent / 100)
    dim = (width, height)
    print('File name: {}, original size: {}, resized to: {}'.format(im_name, 
                                                                    (im_read.shape[0], im_read.shape[1]), 
                                                                    (width, height)))
    resized = cv2.resize(im_read, dim, interpolation=cv2.INTER_AREA)
    image_path = os.path.join(output_dir, ('r_' + im_name + '.tiff'))
    tifffile.imwrite(image_path, resized)

def rle2mask(mask_rle, shape):
    
    '''
    mask_rle: run-length as string formated (start length)
    shape: (width,height) of array to return
    Returns numpy array, 1 - mask, 0 - background
    '''
    s = mask_rle.split()
    starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
    starts -= 1
    ends = starts + lengths
    img = np.zeros(shape[0]*shape[1], dtype=np.uint8)
    for lo, hi in zip(starts, ends):
        img[lo:hi] = 1
    return img.reshape(shape).T

def resize_mask(im_name, scale_percent):
    
    im_read = tifffile.imread(os.path.join(CFG['PATH_TRAIN'], im_name+'.tiff'))
    mask_rle = df_train[df_train["id"] == im_name]["encoding"].values[0]
    mask = rle2mask(df_train[df_train["id"] == im_name]["encoding"].values[0], (im_read.shape[1], im_read.shape[0]))*255
    width = int(im_read.shape[1] * scale_percent / 100)
    height = int(im_read.shape[0] * scale_percent / 100)
    dim = (width, height)
    print('File name: {}, original size: {}, resized to: {}'.format(im_name, 
                                                                (im_read.shape[0], im_read.shape[1]), 
                                                                (width, height)))
    resized = cv2.resize(mask, dim, interpolation=cv2.INTER_AREA)
    image_path = os.path.join(output_dir, ('r_' + im_name+'_m.tiff'))
    tifffile.imwrite(image_path, resized)

# Train.csv

As we know, it contains the training data for the segmentation models.
Column info below:

1. **id**       - id of each image
2. **encoding** - RLE encoded segmentation masks

In [None]:
df_train = pd.read_csv(CFG['PATH'] + 'train.csv')

In [None]:
df_train.head()

In [None]:
df_train.info()

In [None]:
for i in range(len(df_train)):

    print(len(df_train['encoding'][i]))

Important points about train.csv:

1. 8 images in training data
2. RLE Encoding length of each image is huge but very small as compared to the total pixel size of the images. Which is why instead of segmented image, only RLE encoding of the segmentation masks are to be returned.

# HuBMAP-20-dataset_information.csv

Columns of csv are as follows:

**image_file** - name of image file in .tiff format

**width_pixels** - image pixel width

**height_pixels** - image pixel height

**anatomical_structures_segmention_file** - name of .json file, storing segments(polygons) of kidney parts(cortex/medulla)

**glomerulus_segmentation_file** - name of .json file, storing segments(polygons) of glomerulus cells

**patient_number** - patient number

**race** - race of patient

**sex** - patient gender

**ethnicity** ethnicity of patient

**age** - patient age

**weight_kilograms** - weight of patient in kg

**height_centimeters** - height of patient in cm

**bmi_kg/m^2** - body mass index(weight_kilograms / height_centimeters^2)

**laterality** - laterality of kidney(left / right)

**percent_cortex** percent of cortex(outer part of the kidney)

**percent_medulla** percent of medulla(inner part of the kidney)

In [None]:
df_info = pd.read_csv(PATH + 'HuBMAP-20-dataset_information.csv')

In [None]:
df_info.info()

**We will only require the column no 0 to 4 for our computations. Though I am not saying that the rest of the columns are entirely useless**

In [None]:
df_info.head()

In [None]:
len(df_info)

# Imp points about HuBMAP-20-dataset_information.csv

* 13 records in the file : 8 for training and 5 for testing
* Contains the names of the images and the pixel sizes
* Contains polygon co-ordinates of glomerulus cells and cortex

# Visualizing the images

In [None]:
df_info[['image_file', 'width_pixels', 'height_pixels']]

**Selecting the images to be visualized**

In [None]:
im_list = [
    df_train['id'][0],
    df_train['id'][1],
    df_train['id'][2],
    df_train['id'][3],
]

In [None]:
im_list

**Resizing of the mask and the images to make the operations a bit more faster**

In [None]:
for im in im_list:
    resize_im(im, 5)

In [None]:
for im in im_list:
    resize_mask(im, 5)

**Resized Images:**

In [None]:
os.listdir(output_dir)

**Function used for visualization of the images**

In [None]:
def show_image(image_id):
    
    fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(16, 32))
    image_path = os.path.join(output_dir, 'r_{}.tiff'.format(image_id))
    mask_path = os.path.join(output_dir, 'r_{}_m.tiff'.format(image_id))
    
    image = tifffile.imread(image_path)
    mask = tifffile.imread(mask_path)
    
    if len(mask.shape)==2:    
        hybr = image[:, :, 0] + mask[:, :]/2
    else:
        hybr = image[:, :, 0] + mask[:,: , 0]/2
    
    ax[0].imshow(image)
    ax[0].axis('off')
    ax[0].set_title('Real Image')
    
    ax[1].imshow(hybr)
    ax[1].axis('off')
    ax[1].set_title('Masks')
    
    plt.show()

**Visualing the images from im_list**

In [None]:
show_image(im_list[0])

In [None]:
show_image(im_list[2])

In [None]:
show_image(im_list[1])

# 'Real image' is the image without the mask on. And when put on the mask, it reveals the glomeruli FTUs inside the cell.

# Afte this I will advise the readers to play around with the mask. Based on its 'size' and 'number' inside a cell

I have taken some help from these two notebooks. Please upvote them as well !!

https://www.kaggle.com/yuriikochurovskyi/hubmap-image-eda-step-by-step-beginner-friendly

https://www.kaggle.com/kiruganko/hubmap-eda#Images

# Please upvote if you liked the content :)