# Competition Overview

- 37 trillion cells in body
- determine function and relationship
- understand cellular activity
<br>
- Human BioMolecular Atlas Program (HuBMAP)
- development of framework for mapping human body at cellular level

**mapping kidney at single cell resolution**

- detect functional tissue units **FTUs** in different tissue preparation pipelines
- FTU: 3d block of cells around a cappilary in diffusion distance from any other cell in same block
- **goal: glomeruli FTU detector**

uses:
- understand relationships between cell and tissue organization and function
- cell and tissue anatomy
- develop pharmaceutical therapies

# Expected Submission File

Values to describe a mask<br>
including a header: img,pixels

**RLE**:
* Run-length encoding
* reduces file size
* continuous number string with only numbers and spaces
* **PIXEL_ID_A** *followed by* **RUN_LENGTH_A** *followed by* **PIXEL_ID_B** *followed by* **RUN_LENGTH_B**<br>

**PIXEL_IDs**:<br>
(1,1)(2,1)(3,1)<br>
(1,2)(2,2)(3,2)<br>
(1,3)(2,3)(3,3)<br>

turns to:<br>
(1)(2)(3)<br>
(4)(5)(6)<br>
(7)(8)(9)<br>

**Result**:<br>
img,pixels<br>
1,1 1 5 1<br>
2,1 1<br>
3,1 1<br>
etc.
        

# input data

**Test set**: 5 tiff images<br>
**Train set**: 8 tiff images<br>

**train.csv**:<br>
* mask of glomeruli in image
* RLE encoded
* id column: image id
* encoding column: RLE encoded mask data

**img.json**: <br>
* one for each image
* geometry: pixel coords of a glomerulus Polygon as [[x1,y1],[x2,y2],...]
* HAS SAME INFO AS TRAIN.CSV

**img-anatomical-structure.json**: <br>
* one for each image
* geometry: pixel coords of Medulla and Cortex as [[x1,y1],[x2,y2],...]

**HuBMAP-20-dataset_information.csv**:<br>
* additional info on image sources

# Initial Overview of Files:

In [None]:
ls ../input/hubmap-kidney-segmentation

In [None]:
ls ../input/hubmap-kidney-segmentation/train

In [None]:
ls ../input/hubmap-kidney-segmentation/test

In [None]:
import os
traindir="../input/hubmap-kidney-segmentation/train/"
testdir="../input/hubmap-kidney-segmentation/test/"
train = os.listdir(traindir)
test = os.listdir(testdir)
print(f"Train files: {len(train)}. ---> {train[:3]}")
print(f"Test files :  {len(test)}. ---> {test[:3]}")

# Image dimensions

some seem to be (h, w, channels)<br>
others seem to be (1, 1, channels, h, w) <br>      

In [None]:
import tifffile as tff
shapes=[]
for file in train:
    if '.tiff' in file:
        image = tff.imread(traindir+file)
        print(image.shape)
        shapes.append(image.shape)
#for file in test:
#    if '.tiff' in file:
#        image = tff.imread(testdir+file)
#        shapes.append[image.shape]

In [None]:
import tifffile as tff
import matplotlib.pyplot as plt
import numpy as np
import os
def get_imgs(folder_path):
    imgs=[]
    for file in os.listdir(folder_path):
        if '.tiff' in file:
            imgs.append(file)
    return imgs
def read_image(folder_path,img_name):
    image = tff.imread(folder_path+'/'+img_name)
    if image.shape[0]==1:
        print('reshape from '+str(image.shape))
        h=image.shape[3]
        w=image.shape[4]
        image=np.ravel(image, order='C')
        image=np.reshape(image, (3, h, w))
        image = np.dstack((image[0],image[1],image[2]))
        print('to '+str(image.shape))
    else:
        print('shape '+str(image.shape))
        print('no reshape')
    return image
def downscale(image,factor):
    newimg=[]
    for index in range(len(image)):
        if index%factor==0:
            row=[]
            for px_index in range(len(image[index])):
                if px_index%factor==0:
                    row.append(image[index][px_index])
            newimg.append(row)
    return np.asarray(newimg)
def show_image(image):
    plt.figure(figsize=(10, 10))
    plt.imshow(image)
    plt.show()

load list of .tiff image files

In [None]:
traindir='../input/hubmap-kidney-segmentation/train/'
imgs=get_imgs(traindir)
imgs

# Choose an image

In [None]:
IMG_NAME=imgs[3]

## choose downscaling factor (save RAM)

In [None]:
IMG_FACTOR=20
#enter a downscaling factor here if the image is too big for RAM or just to speed up loading

In [None]:
image=read_image(traindir,IMG_NAME)

In [None]:
ORIGINAL_SHAPE=image.shape
print(ORIGINAL_SHAPE)

## downscale image so it does not overload RAM

In [None]:
image=downscale(image,IMG_FACTOR)
#only necessary if downscaling image is wanted! -> otherwise adds unnecessary time

In [None]:
image.shape

In [None]:
show_image(image)

In [None]:
import pandas as pd
kidney_data=pd.read_csv("../input/hubmap-kidney-segmentation/train.csv")
print(kidney_data.shape)
kidney_data

# Utility Functions

In [None]:
import json
import numpy as np
import pandas as pd

def json_to_df(json_path):
   with open(json_path) as json_file:
       json_data = json.load(json_file)
   json_data_df = pd.json_normalize(json_data)
   return json_data_df

def PolyArea(x_list,y_list):
    #uses the shoelace formula to calculate polygon area from a set of cartesian coords
    #https://stackoverflow.com/questions/24467972/calculate-area-of-polygon-given-x-y-coordinates
    return 0.5*np.abs(np.dot(x_list,np.roll(y_list,1))-np.dot(y_list,np.roll(x_list,1)))

def json_to_coords(json_data_df):
    geom=json_data_df['geometry.coordinates']
    polygons=[]
    for x in geom:
        polygons.append(x[0])
    return polygons
def dimensions(polygon):
    x_list=[]
    y_list=[]
    for x,y in polygon:
        x_list.append(x)
        y_list.append(y)
    h=max(y_list)-min(y_list)
    w=max(x_list)-min(x_list)
    area=PolyArea(x_list,y_list)
    return h,w,area
def bbox(polygon,padding):
    """
    polygon: 2D array of 1D arrays of x and y coords
    padding: pixel buffer around polygon
    """
    x_list=[]
    y_list=[]
    for x,y in polygon:
        x_list.append(x)
        y_list.append(y)
    x1=int((min(x_list))-padding)
    x2=int((max(x_list))+padding)
    y1=int((min(y_list))-padding)
    y2=int((max(y_list))+padding)
    #image=image[y1:y2,x1:x2]
    #return image
    return x1,x2,y1,y2 #only return coords instead of img to save memory
def dim_list(polygons):
    h_list=[]
    w_list=[]
    area_list=[]
    for polygon in polygons:
        h,w,area=dimensions(polygon)
        h_list.append(h)
        w_list.append(w)
        area_list.append(area)
    combined_list=list(zip(h_list,w_list,area_list))
    result=pd.DataFrame(combined_list, columns=['height','width','area'])
    return result
def x_and_y(polygon,factor):
    x_list=[]
    y_list=[]
    for x,y in polygon:
        x_list.append(x/factor)
        y_list.append(y/factor)
    return x_list,y_list


# analyzing data

In [None]:
json_path='../input/hubmap-kidney-segmentation/train/'+IMG_NAME.split('.')[0]+'.json'
json_data_df=json_to_df(json_path)
polygons=json_to_coords(json_data_df)
dimensions=dim_list(polygons)

In [None]:
dimensions

# visualizing data

In [None]:
import seaborn as sns
sns.jointplot(x=dimensions['width'], y=dimensions['height'], kind="kde")

In [None]:
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns

f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.2, 1)})
mean=dimensions['area'].mean()
median=dimensions['area'].median()

sns.boxplot(dimensions["area"], ax=ax_box)
ax_box.axvline(mean, color='r', linestyle='--')
ax_box.axvline(median, color='g', linestyle='-')

sns.distplot(dimensions["area"], ax=ax_hist)
ax_hist.axvline(mean, color='r', linestyle='--')
ax_hist.axvline(median, color='g', linestyle='-')

plt.legend({'Mean':mean,'Median':median})

ax_box.set(xlabel='')
plt.show()

In [None]:
from matplotlib import pyplot as plt
plt.figure(figsize=(10, 10))
plt.imshow(image)
for polygon in polygons:
    x_list,y_list=x_and_y(polygon,IMG_FACTOR)
    plt.fill(x_list,y_list)
plt.show()

## show a bbox image

In [None]:
image=read_image(traindir,IMG_NAME)
x1,x2,y1,y2=bbox(polygons[0],0)
imagepatch=image[y1:y2,x1:x2]

In [None]:
imagepatch.shape

In [None]:
show_image(imagepatch)

# New approach: use RLE data

In [None]:
import pandas as pd
train_path='../input/hubmap-kidney-segmentation/train.csv'
train_data=pd.read_csv(train_path)

In [None]:
train_data.head()

In [None]:
import pandas as pd
import numpy as np

def rle_decode(img_name,csv_path):
    train_data=pd.read_csv(csv_path)
    #this function does not work properly, so I commented it out
    #rle_data=train_data.loc[train_data['id'] == img_id]['encoding'] 
    rle_location=train_data.loc[train_data['id'] == img_name.split('.')[0]]
    id=rle_location.index[0]
    rle_data=str(train_data.iloc[id,1])
    #rle_list=rle_data.split()
    #rle_list = list(map(int, rle_list)) #convert to int
    #rle_pixels=[]
    #rle_runs=[]
    #for index, element in enumerate(rle_list):
    #    if index % 2 == 0:
    #        rle_pixels.append(int(round(0.1*(rle_list[index]))))
    #    else:
    #        rle_runs.append(int(round(0.1*(rle_list[index]))))
    #results=list(zip(rle_pixels,rle_runs))
    #return results
    return rle_data

In [None]:
csv_path='../input/hubmap-kidney-segmentation/train.csv'
results=rle_decode(IMG_NAME,csv_path)

In [None]:
results

In [None]:
import numpy as np

def rleToMask(rleString,h,w):
#https://www.kaggle.com/robertkag/rle-to-mask-converter
  numbers = [int(numstring) for numstring in rleString.split(' ')]
  rledata = np.array(numbers).reshape(-1,2) # -1 means unknown value
  mask = np.zeros(h*w,dtype=np.uint8)
  for pixel,length in rledata:
    pixel -= 1
    mask[pixel:pixel+length] = 255
  mask = (mask.reshape(w,h)).T
  return mask
def combine(mask,image):
    for i,row in enumerate(image):
        for k,val in enumerate(row):
            if mask[i][k]!=0:
                #image[i][k][0]=0
                image[i][k][1]=0
                image[i][k][2]=0
    return image

In [None]:
mask=rleToMask(results,ORIGINAL_SHAPE[0],ORIGINAL_SHAPE[1])
maskpatch=combine(mask[y1:y2,x1:x2],image[y1:y2,x1:x2])

In [None]:
show_image(maskpatch)

In [None]:
mask=downscale(mask,IMG_FACTOR)
image=downscale(image,IMG_FACTOR)

In [None]:
image.shape

In [None]:
mask.shape

In [None]:
mask=combine(mask,image)

In [None]:
show_image(mask)

# Overview of all variables and functions and their assignments in this notebook 
-->Do not execute this! only meant as notes!

In [None]:
%%script echo skipping
#this is only a summary of variables and functions and should not be executed!


traindir="../input/hubmap-kidney-segmentation/train/"
testdir="../input/hubmap-kidney-segmentation/test/"
train = os.listdir(traindir)
test = os.listdir(testdir)
shapes=[] #original imageshapes
get_imgs(folder_path) #list of tiff files in a directory
read_image(folder_path,img_name) #reads image and reshapes the img array to (h,w,ch) if needed
downscale(image,factor) #downsamples image by skipping every n-th column and row (n=factor)
show_image(image) #shows an image plot with fixed size (10,10)
imgs=get_imgs(traindir)
IMG_NAME=imgs[3]
IMG_FACTOR=20
image=read_image(traindir,IMG_NAME) #ORIGINAL
ORIGINAL_SHAPE=image.shape
image=downscale(image,IMG_FACTOR) #DOWNSCALE
kidney_data=pd.read_csv("../input/hubmap-kidney-segmentation/train.csv")
json_to_df(json_path) #returns json data in dataframe
PolyArea(x_list,y_list) #gives area of a polgon
json_to_coords(json_data_df) #gets only polygons from dataframe
dimensions(polygon) #gets h,w,and area of one polygon
bbox(polygon,padding) #gets xmin,xmax,ymin,ymax of one polygon
dim_list(polygons) #makes dataframe with h,w,area of each polygon
x_and_y(polygon,factor) #gets x and y coords of one polygon as lists
json_path= filepath of json of one img
json_data_df= json data of one img of json_path
polygons= coords of json_data_df
dimensions= dataframe with h,w,area of each polygon
mean=dimensions['area'].mean() #polygons only of current img
median=dimensions['area'].median() #polygons only of current img
image=read_image(traindir,IMG_NAME) #ORIGINAL
x1,x2,y1,y2=bbox(polygons[0],0) #edge coords of first polygon in list
imagepatch=image[y1:y2,x1:x2] #patch of bbox
train_path='../input/hubmap-kidney-segmentation/train.csv'
train_data=pd.read_csv(train_path)
rle_decode(img_name,csv_path) #returns rle_data of one img as string
csv_path='../input/hubmap-kidney-segmentation/train.csv'
results=rle_decode(IMG_NAME,csv_path)
rleToMask(rleString,h,w) #returns mask as numpy array with background=0 and object=255, array is size of image
combine(mask,image) #colors mask areas onto image, output is numpy array
mask= original size img mask as numpy array
maskpatch= only bbox section of combined mask+img
mask=downscale(mask,IMG_FACTOR)
image=downscale(image,IMG_FACTOR)
mask=combine(mask,image)

# NEXT STEP: train network
https://www.kaggle.com/philipjamessullivan/p-sullivan-2-train-network