# Introduction

This is an exploratory notebook on the HuBMAP challenge introducing the tool rasterio to read and play with tiff

![img](https://i.imgur.com/F4naKqh.png)

## Part-1 : Crops and masks with rasterio

### a. Load an image with rasterio

Loading an image, checking the coordinate system and the linear transformation

*rasterio*

### b. Visualise Anatomical Structures

We use the annotated json to visualise the differente structues inside an image

*json, rasterio, rasterio.mask*

### c. Visualise glomerulus

We use the annotated json to generate masks and visualise the glomeruli

*json, rasterio, rasterio.mask*

### d. Generate overlay mask of Glomerulus on Anatomical Structure with rasterio

In this last part, we create a full downsized image with masks of glomeruli overlayed

*json, rasterio, rasterio.mask*

## Part-2 : EDA

### a. Usefull surface, Glomeruli surface

Some global stats, id per id:

1. We want to check the total surface in pixel of the images
2. We want to check the total space covered by glomeruli
3. Rather than ploting the ratio of surface_glomeruli/total_surface, we check only the "usefull" surface, in which there is actually glomeruli

*cv2, rasterio, rasterio.mask, json*

### a. Height, width and surface of the glomeruli in an image

It is important to know how much space take a full glomerulus in order to calibrate correctly the crops

### b. Size and oriention of glomeruli

As the glomeruli can have an orientation, we would like as well to know the global dimension of the glomeruli. 

### c. Biggest and smallest glomeruli in the set

Visualisation of the extrema. Helpfull to identify possible mistakes in labels

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import glob

from tqdm import tqdm

import plotly.graph_objects as go

In [None]:
df1 = pd.read_csv('/kaggle/input/hubmap-kidney-segmentation/train.csv')
df1.head()

# Part-1 : Crops and masks with rasterio

In [None]:
id_ = 'cb2d976f4'
path_tiff = f'/kaggle/input/hubmap-kidney-segmentation/train/{id_}.tiff'
path_json1 = f'/kaggle/input/hubmap-kidney-segmentation/train/{id_}-anatomical-structure.json'
path_json2 = f'/kaggle/input/hubmap-kidney-segmentation/train/{id_}.json'

## Load an image using rasterio

Rasterio is a powerfull tool to read and filter files in .tiff format.
.tiff includes arrays of pixels as well as metadata

The metadata are:

crs: when existing, the coordinate system of the tiff

transform: the linear transformation, including coordinates of a pixel as well as its size

In [None]:
import rasterio

raster = rasterio.open(path_tiff)
img = raster.read()
crs = raster.crs
transform = raster.transform
print(f'shape : {img.shape}')
print(f'crs : {crs}')
print(f'transform : {transform}')

## Crop an image with rasterio.mask and polygons

### Anatomical Structure

The image is to big. We can use the polygons given in the dataset to crop directly around the areas of interest.

In [None]:
import json
with open(path_json1, 'r') as f:
    aois = json.load(f)
    
aoi = aois[0]['geometry']
aoi

rasterio.mask.mask take as input the raster file as well as a list of aoi.

Several parameters are also available. Most important are:
* crop: if True, crop the image around the aoi, overwise, only set the pixels outside the aoi to 0
* filled: wether or not the pixels outside the bounding box are set to 0. 

In [None]:
from rasterio.mask import mask

cropped_img, transform_c = mask(raster, [aoi], crop = True, filled = False)
print(cropped_img.shape)

The file is much smaller. To visualise it, we need now to transpose it

In [None]:
cropped_img_t = cropped_img.transpose(1,2,0)
print(cropped_img_t.shape)

In [None]:
plt.figure(figsize = (20,20))
plt.imshow(cropped_img_t)
plt.show()

## Glomerulus

Glomerulus are the areas to identify in the main tiffs. Let's plot a sample of them.

In [None]:
with open(path_json2, 'r') as f:
    aois_g = json.load(f)

plt.figure(figsize = (30,30))
for i in range(25):
    aoi_g = aois_g[i]['geometry']
    cropped_img_g, _ = mask(raster, [aoi_g], crop = True)
    cropped_img_g_t = cropped_img_g.transpose(1,2,0)
    plt.subplot(5,5,i+1)
    fig = plt.imshow(cropped_img_g_t)
    fig.axes.get_xaxis().set_visible(False)
    fig.axes.get_yaxis().set_visible(False)
plt.show()

In [None]:
import cv2
def resize(img,s=1, shape=None):
    imgt = img.transpose(1,2,0)
    if shape:
        new_img = np.zeros((shape[0],shape[1],3)).astype(np.uint8)
    else:
        new_img = np.zeros((int(imgt.shape[0]/s),int(imgt.shape[1]/s),3)).astype(np.uint8)
    for i in range(3):
        temp = imgt[:,:,i]
        new_img[:,:,i] = cv2.resize(temp,(new_img.shape[1],new_img.shape[0]))
    new_img = new_img.transpose(2,0,1)
    return new_img

## Generate overlay mask of Glomerulus on Anatomical Structure

In this part, we generate a subtiff of the anatomical structure, and generate an overlay mask of glomerulus 

### Save the subraster generated earlier

In [None]:
cropped_img.shape

We need to save the raster using the tranform metadata used to generate the crop

In [None]:
with rasterio.open('temp.tif','w',driver='GTiff',
                   height=cropped_img.shape[1],
                   width=cropped_img.shape[2],
                   count=cropped_img.shape[0],
                   dtype=cropped_img.dtype,
                   crs = crs,
                   transform=transform_c,) as f:
    f.write(cropped_img)

subraster = rasterio.open('temp.tif')

### Generate masks of glomerulus

In [None]:
all_aois_g = [elmt['geometry'] for elmt in aois_g]
cropped_img_g, transform = mask(subraster, all_aois_g, crop = False)
cropped_img_g = cropped_img_g.transpose(1,2,0)
print(cropped_img_g.shape)

In [None]:
plt.figure(figsize = (20,20))
plt.imshow(cropped_img_g)
plt.show()

## Overlay the mask

In [None]:
glom_mask = cropped_img_g.copy()
glom_mask[glom_mask>0]=255

In [None]:
plt.figure(figsize = (20,20))
plt.imshow(cropped_img_t)
plt.imshow(glom_mask, alpha = 0.5)
plt.show()

# Part 2 - EDA

There is a total of 8 images in the train set.

In [None]:
ids = glob.glob('/kaggle/input/hubmap-kidney-segmentation/train/*.tiff')
ids = [elmt.split('/')[-1].split('.')[0] for elmt in ids]
ids

## Usefull Surface, Glomeruli Surface

Let's generate for each id the following metrics:
- tot_surface: the total number of pixel in an image
- tot_annotated: the total number of pixels covered by the principale annotations
- tot_glomeruli: the total number of pixels covered by glomeruli
- tot_usefull_surface: the total surface annotated if there is glomeruli on it

In [None]:
#In order to save memory, I reduce on the fly the size of images by 4.
r = 4

surface_summary = {}

for id_ in tqdm(ids):
    
    tot_annotated = 0
    tot_usefull_surface = 0
    
    path_tiff = f'/kaggle/input/hubmap-kidney-segmentation/train/{id_}.tiff'
    path_json1 = f'/kaggle/input/hubmap-kidney-segmentation/train/{id_}-anatomical-structure.json'
    path_json2 = f'/kaggle/input/hubmap-kidney-segmentation/train/{id_}.json'

    ########LOAD ANNOTATIONS#####
    with open(path_json1, 'r') as f:
        aois1 = json.load(f)

    with open(path_json2, 'r') as f:
        aois2 = json.load(f)

    polys1 = [elmt['geometry'] for elmt in aois1]
    polys2 = [elmt['geometry'] for elmt in aois2]
    
    #######OPEN RASTER AND RESIZE##########
    raster = rasterio.open(path_tiff)
    crs = raster.crs
    transform = raster.transform
    img = raster.read()

    # Resize
    img = resize(img, r)
    shape_ = img[0].shape
    
    #This array is used to remove pixels already visited, in case of overlapping of annotations
    visited = np.ones(shape_)
    
    #Readjust affine transformation
    transform = transform * transform.scale(r,r)

    # Save resized
    with rasterio.open(f'{id_}.tif','w',driver='GTiff',
                       height=img.shape[1],
                       width=img.shape[2],
                       count=img.shape[0],
                       dtype=img.dtype,
                       crs = crs,
                       transform=transform,) as f:
        f.write(img)

    raster = rasterio.open(f'{id_}.tif')
    
    #Generate masks
    for poly in polys1:
        mask1 = mask(raster, [poly])[0][0]
        mask1[mask1>0] = 1
        mask2 = mask(raster, polys2)[0][0]
        mask2[mask2>0]= 1
        
        cg = (mask1*mask2*visited).sum()
        tot_annotated += (mask1*visited).sum()
        
        if cg.sum()>0: #if there is glomeruli in the annotation
            tot_usefull_surface += (mask1*visited).sum()
            
        visited = visited - mask1
        visited[visited<0] = 0
            
    tot_surface = shape_[0]*shape_[1]
    tot_glomeruli = mask2.sum()
    
    surface_summary[id_] = {}
    surface_summary[id_]['tot_surface'] = tot_surface
    surface_summary[id_]['tot_glomeruli'] = tot_glomeruli
    surface_summary[id_]['tot_annotated'] = tot_annotated
    surface_summary[id_]['tot_usefull_surface'] = tot_usefull_surface
    

In [None]:
df_surface = pd.DataFrame(surface_summary).T

df_surface['glomeruli_density_tot'] = df_surface.tot_glomeruli / df_surface.tot_surface
df_surface['glomeruli_density_annotated'] = df_surface.tot_glomeruli / df_surface.tot_annotated
df_surface['glomeruli_density_usefull'] = df_surface.tot_glomeruli / df_surface.tot_usefull_surface
df_surface['tot_annotated_ratio'] = df_surface.tot_annotated / df_surface.tot_surface
df_surface['tot_usefull_ratio'] = df_surface.tot_usefull_surface / df_surface.tot_surface

df_surface = np.round(df_surface, 3)

df_surface = df_surface.sort_values('glomeruli_density_usefull',ascending = False)
df_surface

In [None]:
fig = go.Figure(
    go.Bar(
        x = df_surface.index,
        y = df_surface.glomeruli_density_usefull
    )
)

fig.update_layout(template = 'presentation', title = 'Density of Glomeruli per id')
fig.show()

## Average height, width and surface of the glomeruli

We check here height, width, and surface occupied by each glomeruli

In [None]:
glomeruli_metadata = {}
for id_ in ids:
    glomeruli_metadata[id_] = {'width':[], 'height':[], 'surface':[]}
    path_tiff = f'/kaggle/input/hubmap-kidney-segmentation/train/{id_}.tiff'
    path_json2 = f'/kaggle/input/hubmap-kidney-segmentation/train/{id_}.json'

    with open(path_json2, 'r') as f:
        aois2 = json.load(f)

    polys2 = [elmt['geometry'] for elmt in aois2]
    raster = rasterio.open(path_tiff)
    
    for poly in tqdm(polys2):
        m, _ = mask(raster, [poly], crop = True)
        m = m[0]
        glomeruli_metadata[id_]['width'].append(m.shape[1])
        glomeruli_metadata[id_]['height'].append(m.shape[0])
    
        m[m>0] = 1
        glomeruli_metadata[id_]['surface'].append(np.sum(m))

### Distributions

In [None]:
import seaborn as sns
plt.figure(figsize = (20,20))
plt.subplot(3,1,1)
plt.title('width')
for k,v in glomeruli_metadata.items():
    sns.distplot(v['width'], label = k)
plt.legend()
plt.subplot(3,1,2)
plt.title('height')
for k,v in glomeruli_metadata.items():
    sns.distplot(v['height'], label = k)
plt.legend()
plt.subplot(3,1,3)
plt.title('surface')
for k,v in glomeruli_metadata.items():
    sns.distplot(v['surface'], label = k)
plt.legend()

In [None]:
widths = []
heights = []
areas = []
for ids_, dic in glomeruli_metadata.items():
    widths+=dic['width']
    heights+=dic['height']
    areas += dic['surface']

In [None]:
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x = widths,
        y = heights,
        opacity = 0.5,
        mode = 'markers'
    )
)
fig.update_layout(template = 'presentation', title = 'Space taken by the glomeruli')
fig.update_xaxes(title = 'width')
fig.update_yaxes(title = 'height')
fig.show()

In the figure below, I plot average space taken by a glomerulus as well as maximum space, this can be use to assess the length of the crops to use later

### Length, width, and orientation of glomeruli

The insights above give indication of the space taken by a glomeruli in an image. Now we might as well know the length and width of the glomeruli, regardless of their orientation in the image. 

For that purpose, I use the feature regionprops_table from skimage

In [None]:
from skimage.measure import regionprops_table

glomeruli_metadata = {}
for id_ in tqdm(ids):
    glomeruli_metadata[id_] = {'orientation':[], 'major_axis_length':[], 'minor_axis_length':[]}
    path_tiff = f'/kaggle/input/hubmap-kidney-segmentation/train/{id_}.tiff'
    path_json2 = f'/kaggle/input/hubmap-kidney-segmentation/train/{id_}.json'

    with open(path_json2, 'r') as f:
        aois2 = json.load(f)

    polys2 = [elmt['geometry'] for elmt in aois2]
    raster = rasterio.open(path_tiff)
    
    for poly in polys2:
        m, _ = mask(raster, [poly], crop = True)
        m = m[0]
        m[m>0] = 1
        if m.sum()>0:
            props = regionprops_table(m, properties=('orientation',
                                                     'major_axis_length',
                                                     'minor_axis_length'))

            glomeruli_metadata[id_]['orientation'].append(props['orientation'])
            glomeruli_metadata[id_]['major_axis_length'].append(props['major_axis_length'])
            glomeruli_metadata[id_]['minor_axis_length'].append(props['minor_axis_length'])

In [None]:
import seaborn as sns
plt.figure(figsize = (20,20))
plt.subplot(3,1,1)
plt.title('orientation')
for k,v in glomeruli_metadata.items():
    sns.distplot(np.abs(v['orientation']), label = k)
plt.legend()
plt.subplot(3,1,2)
plt.title('major_axis_length')
for k,v in glomeruli_metadata.items():
    sns.distplot(v['major_axis_length'], label = k)
plt.legend()
plt.subplot(3,1,3)
plt.title('minor_axis_length')
for k,v in glomeruli_metadata.items():
    sns.distplot(v['minor_axis_length'], label = k)
plt.legend()
plt.show()

# Representation of the average glomeruli

In [None]:
widths = []
heights = []
for ids_, dic in glomeruli_metadata.items():
    widths+=[elmt[0] for elmt in dic['major_axis_length']]
    heights+=[elmt[0] for elmt in dic['minor_axis_length']]
    
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x = widths,
        y = heights,
        opacity = 0.5,
        mode = 'markers'
    )
)

fig.update_layout(template = 'presentation', title = 'Size of the glomeruli')
fig.update_xaxes(title = 'major_axis_length')
fig.update_yaxes(title = 'minor_axis_length')
fig.show()

# Visualising the outliers

We observe from the figure above some very small glomeruli, and some glomeruli with strange proportions. 
In the next plot, I will isolate those glomeruli and plot them with their dedicated masks

## Very small glomeruli

In [None]:
size = 500
p = int(size/2)
image_set = {}
for id_, dic in glomeruli_metadata.items():
    glomeruli = np.where(np.hstack(dic['major_axis_length'])<100)[0]
    if len(glomeruli):
        print(f'{id_} -> {glomeruli}')
        path_tiff = f'/kaggle/input/hubmap-kidney-segmentation/train/{id_}.tiff'
        path_json2 = f'/kaggle/input/hubmap-kidney-segmentation/train/{id_}.json'
        with open(path_json2, 'r') as f:
            aois_g = json.load(f)
            
        polys2 = [elmt['geometry'] for elmt in aois_g]
        #Open the associated tiff with rasterio
        raster = rasterio.open(path_tiff)
        #Get polygons
        for arg in glomeruli:
            x,y = np.array(polys2[arg]['coordinates']).mean(axis=1)[0]
            x1,x2,y1,y2 = x-p,x+p,y-p,y+p
            # Generate new square polygon with extended coordinates
            poly = {'type':'Polygon', 'coordinates':[[[x1,y1],[x1,y2],[x2,y2],[x2,y1]]]}
            # Crop the figure
            cimg, ctrans = mask(raster, [poly], crop = True)
            #Save temporary the image raster
            with rasterio.open('temp.tif','w',driver='GTiff',
                   height=cimg.shape[1],
                   width=cimg.shape[2],
                   count=cimg.shape[0],
                   dtype=cimg.dtype,
                   crs = crs,
                   transform=ctrans,) as f:
                
                f.write(cimg)
            
            #Reload the image as tiff
            craster = rasterio.open('temp.tif')
            #Generate the mask based on crop
            cmask, _ = mask(craster, [polys2[arg]], crop = False)
            cmask[cmask>0] = 1
            cmask = cmask.astype(float)
            image_set[id_ + '-' + str(arg)] = {}
            image_set[id_ + '-' + str(arg)]['image'] = cimg.transpose(1,2,0)
            image_set[id_ + '-' + str(arg)]['mask'] = cmask[0]

In [None]:
plt.figure(figsize = (30,10))
i = 0
for k,v in image_set.items():
    plt.subplot(1,5,i+1)
    cimg = v['image']
    cmask = v['mask']
    plt.title(k)
    plt.imshow(cimg)
    plt.imshow(cmask, alpha = 0.5)
    i+=1
plt.show()

# Very big glomeruli

In [None]:
size = 1000
p = int(size/2)
image_set = {}
for id_, dic in glomeruli_metadata.items():
    glomeruli = np.where(np.hstack(dic['major_axis_length'])>600)[0]
    if len(glomeruli):
        print(f'{id_} -> {glomeruli}')
        path_tiff = f'/kaggle/input/hubmap-kidney-segmentation/train/{id_}.tiff'
        path_json2 = f'/kaggle/input/hubmap-kidney-segmentation/train/{id_}.json'
        with open(path_json2, 'r') as f:
            aois_g = json.load(f)
            
        polys2 = [elmt['geometry'] for elmt in aois_g]
        #Open the associated tiff with rasterio
        raster = rasterio.open(path_tiff)
        #Get polygons
        for arg in glomeruli:
            x,y = np.array(polys2[arg]['coordinates']).mean(axis=1)[0]
            x1,x2,y1,y2 = x-p,x+p,y-p,y+p
            # Generate new square polygon with extended coordinates
            poly = {'type':'Polygon', 'coordinates':[[[x1,y1],[x1,y2],[x2,y2],[x2,y1]]]}
            # Crop the figure
            cimg, ctrans = mask(raster, [poly], crop = True)
            #Save temporary the image raster
            with rasterio.open('temp.tif','w',driver='GTiff',
                   height=cimg.shape[1],
                   width=cimg.shape[2],
                   count=cimg.shape[0],
                   dtype=cimg.dtype,
                   crs = crs,
                   transform=ctrans,) as f:
                
                f.write(cimg)
            
            #Reload the image as tiff
            craster = rasterio.open('temp.tif')
            #Generate the mask based on crop
            cmask, _ = mask(craster, [polys2[arg]], crop = False)
            cmask[cmask>0] = 1
            cmask = cmask.astype(float)
            image_set[id_ + '-' + str(arg)] = {}
            image_set[id_ + '-' + str(arg)]['image'] = cimg.transpose(1,2,0)
            image_set[id_ + '-' + str(arg)]['mask'] = cmask[0]

In [None]:
plt.figure(figsize = (20,10))
i = 0
for k,v in image_set.items():
    plt.subplot(2,3,i+1)
    cimg = v['image']
    cmask = v['mask']
    plt.title(k)
    plt.imshow(cimg)
    plt.imshow(cmask, alpha = 0.3)
    i+=1
plt.show()

Below is also a list of suspicious glomeruli identified by scrolling through the dataset