# HuBMAP - an analysis and methodology

---

Hello there, and welcome to my notebook. This is a simple analysis and a potential guidelines for the HuBMAP: Hacking the Kidney competition currently on Kaggle. 

This is the first time doing image segmentation on Kaggle so I need heavy feedback as to what I'm doing - so let's get started with a quick run-through of the images and data.

## Primary setup - load 256x256 images

In [None]:
!unzip ../input/256x256-images/train.zip 

As you can see GM iafoss has processed the images into 256x256 tiles which makes it far easier to perform analysis on them. This is merely a preliminary step before carrying on.

# Introduction (from hosts)

This competition, “Hacking the Kidney," starts by mapping the human kidney at single cell resolution.

Your challenge is to detect functional tissue units (FTUs) across different tissue preparation pipelines. An FTU is defined as a “three-dimensional block of cells centered around a capillary, such that each cell in this block is within diffusion distance from any other cell in the same block” (de Bono, 2013). The goal of this competition is the implementation of a successful and robust glomeruli FTU detector.

You will also have the opportunity to present your findings to a panel of judges for additional consideration. Successful submissions will construct the tools, resources, and cell atlases needed to determine how the relationships between cells can affect the health of an individual.

Advancements in HuBMAP will accelerate the world’s understanding of the relationships between cell and tissue organization and function and human health. These datasets and insights can be used by researchers in cell and tissue anatomy, pharmaceutical companies to develop therapies, or even parents to show their children the magnitude of the human body.

# Getting started

First of all, we have to import the required libraries. 

In [None]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt, seaborn as sns
import cv2, tifffile
import os;list_ims = os.listdir('../working')
from plotly.offline import iplot
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')

"""From https://www.kaggle.com/kool777/hubmap-extensive-eda"""

def mask2rle(img):
    '''
    img: numpy array, 1 - mask, 0 - background
    Returns run length as string formated
    '''
    pixels= img.T.flatten()
    pixels = np.concatenate([[0], pixels, [0]])
    runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
    runs[1::2] -= runs[::2]
    return ' '.join(str(x) for x in runs)
 
def rle2mask(mask_rle, shape=(1600,256)):
    '''
    mask_rle: run-length as string formated (start length)
    shape: (width,height) of array to return 
    Returns numpy array, 1 - mask, 0 - background

    '''
    s = mask_rle.split()
    starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
    starts -= 1
    ends = starts + lengths
    img = np.zeros(shape[0]*shape[1], dtype=np.uint8)
    for lo, hi in zip(starts, ends):
        img[lo:hi] = 1
    return img.reshape(shape).T

Now we basically have to view an assortment of tiles prepared for us. 

In [None]:
def view_images(images, title = '', aug = None):
    width = 6
    height = 4
    fig, axs = plt.subplots(height, width, figsize=(15,5))
    for im in range(0, height * width):  
        data = cv2.imread(images[im])
        i = im // width
        j = im % width
        axs[i,j].imshow(data, cmap=plt.cm.bone) 
        axs[i,j].axis('off')

    plt.suptitle(title)
    plt.show()

In [None]:
view_images(list_ims, title="First 20 256x256 tiles")

Now we have to view a giant image at a glance, the image being the full scan from which we have obtained these small chunks. Also prepare to deal with memory issues if you want to use the huge image, it takes quite some time to read :-)

In [None]:
im = tifffile.imread('../input/hubmap-kidney-segmentation/train/e79de561c.tiff')
plt.figure(figsize=(16, 16))
plt.imshow(im[0, 0, :, :, :].transpose(1, 2, 0))
plt.axis("off");

Zoom in, so that we can see some portions of the scan with better clarity and if we can potentially observe any shifts in the image as have been pointed out in the public discussion forum.

In [None]:
plt.figure(figsize=(10, 10))
plt.imshow(im[0, 0, :, :, :].transpose(1, 2, 0)[7000:8500, 7000:8500])
plt.axis("off");

So you might be as confused as I am when it comes to these little things here, and apparently this is a zoomed-in visual of several glomeruli. Glomeruli are a tuft of blood capillaries in essence, and they act as a sieve, a filter for your blood as it passes through the nephrons in your kidney.

In [None]:
train = pd.read_csv('../input/hubmap-kidney-segmentation/train.csv')
example_mask = rle2mask(train['encoding'][4], (im.shape[4], im.shape[3]))

We can now try to view the image with the annotations created by Run Length Encoding to generate a mask for the image.

In [None]:
plt.figure(figsize=(15, 15))
plt.imshow(im[0, 0, :, :, :].transpose(1, 2, 0))
plt.imshow(example_mask, alpha=0.5)
plt.title("Image + Mask", fontsize=16);
plt.axis("off");

Zoom in and see if we can find any potential irregularities in the image?

In [None]:
plt.figure(figsize=(15, 15))
plt.imshow(im[0, 0, :, :, :].transpose(1, 2, 0)[7000:8500, 7000:8500])
plt.imshow(example_mask[7000:8500, 7000:8500], alpha=0.5)
plt.title("Image + Mask", fontsize=16);
plt.axis("off");

Again we can see a shift between the image and the mask, so this will be another inaccuracy to deal with while modelling. On the forums it is estimated to be about 50 px which seems like a reasonable estimate.

# Metadata

This is a brief analysis of the metadata files contained, there is very little but we can utilize it to get a better sense of what we are dealing with.

## Analysis by Race

In [None]:
m = pd.read_csv('../input/hubmap-kidney-segmentation/HuBMAP-20-dataset_information.csv')
m['race'].value_counts(normalize=True).iplot(kind='bar',
                                                      yTitle='Race', 
                                                      linecolor='black', 
                                                      opacity=0.7,
                                                      color='red',
                                                      theme='pearl',
                                                      bargap=0.8,
                                                      gridcolor='white',
                                                      title='Distribution of the Race column')

## Analysis by Gender

In [None]:
m['sex'].value_counts(normalize=True).iplot(kind='bar',
                                                      yTitle='Gender', 
                                                      linecolor='black', 
                                                      opacity=0.7,
                                                      color='steelblue',
                                                      theme='pearl',
                                                      bargap=0.8,
                                                      gridcolor='white',
                                                      title='Distribution of the Sex column')