# ***Disclaimer:*** 
Hello Kagglers! I am a Solution Architect with the Google Cloud Platform. I am a coach for this competition, the focus of my contributions is on helping users to leverage GCP components (GCS, TPUs, BigQueryetc..) in order to solve large problems. My ideas and contributions represent my own opinion, and are not representative of an official recommendation by Google. Also, I try to develop notebooks quickly in order to help users early in competitions. There may be better ways to solving particular problems, I welcome comments and suggestions. Use my contributions at your own risk, I don't garantee that they will help on winning any competition, but I am hoping to learn by collaborating with everyone.


# Objective:

The objective of this notebook is to provide an example of how to transform the HubMAP Hacking the Kidney competition dataset into a form that can readily used to train models leveraging accelerators. The images in this competition have very high resolution, averaging 30,000 x 30,000 pixels, and this presents a difficult challenge in memory management. It is just not possible to read them all in memory in the Kaggle environment, and it is also not possible to build a model using the whole image as input. This notebooks provides some tips for reading the competitions images and masks, and proposes a strategy to deal with the large sizes. 
The strategy adopted in this Notebook is to tile the images in 512X512 tiles, and then transforming the tiles into TFRecords such that we can later use them as input to train models using GPU or TPU accelerators. 

This Notebook takes a long time to run because it processes all the competition files and the resulting, compressed dataset is 18.1G, almost exceeding the Kaggle VM limit. I was able to process all files and then uploaded the results to a Kaggle daset that I have made public:
--> [Link to the TFRecord Dataset Produced by this Notebook.](https://www.kaggle.com/marcosnovaes/hubmap-tfrecord-512)

I have also developed a Notebook that explains how to use the TFRecord Dataset: [https://www.kaggle.com/marcosnovaes/hubmap-looking-at-tfrecords/](https://www.kaggle.com/marcosnovaes/hubmap-looking-at-tfrecords/)

If you want to use the dataset without change you don't need to run the Notebook -- but do read through it because it provides a lot of insight on how the read the images, masks and convert them to TFRecords. I will be using this dataset on my subsequent notebooks. You can also easily costumize this Notebook if you want to produce tiles od different sizes (I used 512x512) or if you want to include more metadata for each tile. 

# Reading the Images
Some of the images are in TIFF format, some are in BigTIFF. I used the tiffile library and it seems to read the images with no problem.

In [None]:
pip install tifffile

Libs used in this Notebook

In [None]:
%matplotlib inline

import cv2
import json
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

#from imread import imread, imsave

import shutil

import tensorflow as tf

import glob
import tifffile
import gc

Let's look at the input data. Find and read the competition train.csv file.

In [None]:
!ls /kaggle/input/

In [None]:
!ls -l /kaggle/input/hubmap-kidney-segmentation/

In [None]:
basepath = '/kaggle/input/hubmap-kidney-segmentation/'
train_df = pd.read_csv(basepath + "train.csv")
train_df.head()

The id corresponds to the images provided. For each image, you are provided a .tiff image file and the Run Length Encoding Mask. 

But notice that the masks are also provided as a .json file with polygon definitions. I used this option instead in this Notebook. 

In [None]:
train_df.shape

In [None]:
!ls -l /kaggle/input/hubmap-kidney-segmentation/train

The next cell reads all tiffs and prints their shapes. It turns out that some TIFF images are channel first (number "3" first) and others channel last (number "3" last). When reading them, we must check if the "3" is first and swap the axis as needed. This loop will take a long time as each image is read just so we can tell its shape. 

In [None]:
# verify that we can read all images
def verify_read(file_list):
    for file_name in file_list:
        baseimage = tifffile.imread(file_name)
        #baseimage = tif.series[0].asarray()
        print('img id = {}, shape = {}'.format(file_name,baseimage.shape))
        gc.collect()
        
file_list = glob.glob('/kaggle/input/hubmap-kidney-segmentation/train/*.tiff')
verify_read(file_list)


# Reading and Showing a sample image and mask
The next cells show the code that can read the first image in the csv file. 

IMPORTANT: Note that in the case of a "channel first" TIFF (number "3" first) we need to swap the axis of the numpy array as noted below. You CANNOT use "reshape" instead, that will scramble the channels.

In [None]:
#select an image to investigate
working_image_index = 0
working_image_id = train_df['id'][working_image_index]
working_image_id
working_image_path = '/kaggle/input/hubmap-kidney-segmentation/train/'+working_image_id+'.tiff'

Here is the code that takes care of the difference in shapes. 
IMPORTANT: Notice that you need to use the numpy.swapaxes function to change the shape, using "reshape" will scramble the channels.

In [None]:

baseimage = tifffile.imread(working_image_path)
print ('original image shape',baseimage.shape)
baseimage = np.squeeze(baseimage)
if( baseimage.shape[0] == 3):
    baseimage = baseimage.swapaxes(0,1)
    baseimage = baseimage.swapaxes(1,2)
    print ('swaped shape',baseimage.shape)

plt.figure()

plt.imshow(baseimage)

The masks are provided in the csv files in RLE format, but we are also provided json files that describe the mask as polygons. I will be using the json files:

In [None]:
# read json mask
working_image_json_mask = '/kaggle/input/hubmap-kidney-segmentation/train/'+working_image_id+'.json'

read_file = open(working_image_json_mask, "r") 
mask_data = json.load(read_file)
mask_data[0]


The following function converts the polygons into a numpy boolean mask with the same shape as the image.

In [None]:
def read_mask(mask_file, mask_shape):
    read_file = open(mask_file, "r") 
    mask_data = json.load(read_file)
    polys = []
    for index in range(mask_data.__len__()):
        geom = np.array(mask_data[index]['geometry']['coordinates'])
        polys.append(geom)

    mask = np.zeros(mask_shape)
    cv2.fillPoly(mask, polys, 1)
    mask = mask.astype(bool)
    return mask
   

In [None]:
mask_shape = (baseimage.shape[0], baseimage.shape[1])
mask = read_mask(working_image_json_mask, mask_shape)
plt.imshow(mask)

In [None]:
baseimage.dtype

In [None]:
mask.dtype

So, now we know how to read each image and mask, and that their types are uint8 and bool respectively. But they have very large dimensions, we would not be able to train a ML model at these dimensions. So, in the next section I takes the approach of tiling up the image and working with tiles.

# Tiling the Large Images into 512x512 tiles
Here are some useful functions that use the numpy slicing capability to select specifc tiles of the image. 

NOTE: The numpy arrays have dimensions [height, width, channels]. This Notebook will tile the image using offsets for the height index and width index. So:
- a Tile with coordinate [0,0] represents the first tile on the top left corner.  
- a Tile with coordinate [1,0] represents a tile with height offset = 1*Tile Size, in this case it starts at numpy coordinates [512,0], which means it is the tile below [0,0]
- a Tile with coordinate [0,1] represents a tile with wodth offset = 1*Tile Size, in this case it starts at numpy coordinates [0,512], which means it is the tile to the right of [0,0]

In [None]:
#explore a few tiles
def show_tile_and_mask(baseimage, mask, tile_size, tile_col_pos, tile_row_pos):
    start_col = tile_col_pos*tile_size
    end_col = start_col + tile_size
    start_row = tile_row_pos * tile_size
    end_row = start_row + tile_size
    tile_image = baseimage[start_col:end_col, start_row:end_row,:]
    tile_mask = mask[start_col:end_col, start_row:end_row]
    fig, ax = plt.subplots(1,2,figsize=(20,3))
    ax[0].imshow(tile_image)
    ax[1].imshow(tile_mask)
    
def get_tile(baseimage, tile_size, tile_col_pos, tile_row_pos):
    start_col = tile_col_pos*tile_size
    end_col = start_col + tile_size
    start_row = tile_row_pos * tile_size
    end_row = start_row + tile_size
    tile_image = baseimage[start_col:end_col, start_row:end_row,:]
    return tile_image

def get_tile_mask(baseimage, tile_size, tile_col_pos, tile_row_pos):
    start_col = tile_col_pos*tile_size
    end_col = start_col + tile_size
    start_row = tile_row_pos * tile_size
    end_row = start_row + tile_size
    tile_image = baseimage[start_col:end_col, start_row:end_row]
    return tile_image

def show_tile_dist(tile):
    fig, ax = plt.subplots(1,2,figsize=(20,3))
    #ax[0].set_title("Tile ID = {} Xpos = {} Ypos = {}".format(img_mtd['tile_id'], img_mtd['tile_col_pos'],img_mtd['tile_row_pos']))
    ax[0].imshow(tile)
    ax[1].set_title("Pixelarray distribution");
    sns.distplot(tile.flatten(), ax=ax[1]);

As can be noticed in the sample image displayed, there is a black border and then a lot of white surrounding the tissue. If we select [0,0] we expect to see a black tile. If we move a little to the right and down, we are then in the white zone. So let's try the values [0,0] and [5,5] and we should be a black and a white tile respectively.

In [None]:
tile_size = 512
tile = get_tile(baseimage, tile_size, 0, 0)
show_tile_dist(tile)

Black as predicted. As we explore the tiles, I also calculate the tile histogram. If we observe the histogram we will notice that it will provide a useful way to filter black and white tiles later. The numpy.histogram function divides the color spectrum in 10 bins and shows how many pixels call within each bin. We can notice that black and white fall into the higher end of the spectrum. Black tiles have 0 pixels in the lower end, while "white" (actually "dirty gray") has only about 20 pixels in that region. We then see that tiles with some actual tissue have a more even distribution. Let's call this metric "lowpass energy". It turns out that if we later select lowpass energy > 100 we are garanteed to have actual tissue in the slide, and we can discard anything with < 100. 

In [None]:
img_hist = np.histogram(tile)
print('histogram = {}'.format(img_hist[0]))
print('histogram_lowpass = {}'.format(np.sum(img_hist[0][0:4])))

And here is the white one ([5,5]

In [None]:
tile = get_tile(baseimage, tile_size, 5, 5)
show_tile_dist(tile)

In [None]:
img_hist = np.histogram(tile)
print('histogram = {}'.format(img_hist[0]))
print('histogram_lowpass = {}'.format(np.sum(img_hist[0][0:4])))

Now let's try to find a glomerulus. If we look back at the polygon dump above, it shows that the first glom starts at pixel [10503, 4384]. If we divide both indexes by 512, we expect to find a glom in tile [8,20]

In [None]:
tile = get_tile(baseimage, tile_size, 8, 20)
show_tile_dist(tile)

In [None]:
img_hist = np.histogram(tile)
print('histogram = {}'.format(img_hist[0]))
print('histogram_lowpass = {}'.format(np.sum(img_hist[0][0:4])))

In [None]:
show_tile_and_mask(baseimage, mask, tile_size, 8, 20)

Bingo!!! We found our first glom. Let's now derive a metric for masks, so that in the future we can easily find tiles with gloms. This metric will be used when we want to filter the training dataset to make sure it includes a certain number of tiles with gloms. Simply counting the number of "TRUE" pixels in the mask is a great metric that indicate the tile contains a glom.

In [None]:
tile_mask = get_tile_mask(mask, tile_size, 8, 20)
mask_density = np.count_nonzero(tile_mask)
mask_density

Now let's move down the image by incrementing the height offset to [9,20], which should be the tile below [8,20]

In [None]:
show_tile_and_mask(baseimage, mask, tile_size, 9, 20)

In [None]:
tile_mask = get_tile_mask(mask, tile_size, 9, 20)
mask_density = np.count_nonzero(tile_mask)
mask_density

So, the glom ends in that tile, and there are fewer TRUE pixels. Going further down we find a cortex tile with no gloms.

In [None]:
show_tile_and_mask(baseimage, mask, tile_size, 10, 20)

In [None]:
tile_mask = get_tile_mask(mask, tile_size, 10, 20)
mask_density = np.count_nonzero(tile_mask)
mask_density

# Transforming the Tiles into a TFRecord Dataset
We are now ready to read all the images (one at a time or we will run out of memory!) and then writing each tile to a TFRecord file. Kaggle has a limit of 50 upper level directories, so we will create one dir for each image. We will also build a pandas dataframe that has the metadata for each tile, including the lowpass energy and mask density metrics that we derived above. 

Using the TFRecord format for storing data should be easy, but unfortunately it requires data serialization which complicates it a little bit. This is done using [protocol buffers](https://developers.google.com/protocol-buffers/) and that is a bit of a learning curve. But in ML you only need to understand the [TFExample](https://www.tensorflow.org/api_docs/python/tf/train/Example) format. In this Notebook I provide a little template code for dealing with TFExamples that can be quickly customized for any type of data. This template is explained in detail in [this tutorial](https://www.tensorflow.org/tutorials/load_data/tfrecord); but you don't need to read all this, in this Notebook I provide an example specific for image data that you can quickly customize.

For serialization using TFExample, we have to make any data fit into either one of 3 types:
* bytes_feature
* float_feature
* int_64_feature

In this Notebook and image and mask are passed as bytes_features and the other metadata as int_64. 

In [None]:
# Utilities serialize data into a TFRecord
def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


In [None]:
def image_example(image_index, image, mask, tile_id, tile_col_pos, tile_row_pos):
    image_shape = image.shape
    
    img_bytes = image.tostring()

    mask_bytes = mask.tostring()
    
    feature = {
        'img_index': _int64_feature(image_index),
        'height': _int64_feature(image_shape[0]),
        'width': _int64_feature(image_shape[1]),
        'num_channels': _int64_feature(image_shape[2]),
        'img_bytes': _bytes_feature(img_bytes),
        'mask' : _bytes_feature(mask_bytes),
        'tile_id':  _int64_feature(tile_id),
        'tile_col_pos': _int64_feature(tile_col_pos),
        'tile_row_pos': _int64_feature(tile_row_pos),
        
    }
    return tf.train.Example(features=tf.train.Features(feature=feature))

This function writes a tile to storage, notice the GZIP compression -- this makes possible for all the tiles to be stored locally without exceeding the HD allowance of the Kaggle machine.

In [None]:
def create_tfrecord( image_index, image, mask, tile_id, tile_col_pos, tile_row_pos, output_path):
    opts = tf.io.TFRecordOptions(compression_type="GZIP")
    with tf.io.TFRecordWriter(output_path, opts) as writer:
        tf_example = image_example(image_index, image, mask, tile_id, tile_col_pos, tile_row_pos)
        writer.write(tf_example.SerializeToString())
    writer.close()

Here is the function that takes an image, slices into tiles, calculates tile metadata and commits to storage. It also builds a pandas dataframe with the metadata for all tiles. 

In [None]:
def write_tfrecord_tiles( image_index, image_id, image, mask, tile_size, output_path ):
    output_dir = output_path+image_id
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)
    os.mkdir(output_dir)
    
    image_cols = image.shape[0]
    image_rows = image.shape[1]
    tile_cols = image_cols // tile_size
    tile_rows = image_rows // tile_size
    tileID = 0
    
    # create a pandas dataframe to store metadata for each tile
    tile_df = pd.DataFrame(columns = ['img_index', 'img_id','tile_id', 'tile_rel_path','tile_col_num', 'tile_row_num', 'lowband_density', 'mask_density'])
    
    # create one directory for each row of images
    for col_number in range(tile_cols):
        print('col_offset{} '.format(col_number),end='')
        dir_path = output_dir+'/col{}'.format(col_number)
        # create directory
        if os.path.exists(dir_path):
            shutil.rmtree(dir_path)
        os.mkdir(dir_path)
        for row_number in range(tile_rows):
            #print("row{}".format(row_number),end='')
            dataset_file_path = dir_path+'/col{}_row{}.tfrecords'.format(col_number,row_number)
            relative_path = image_id+'/col{}_row{}.tfrecords'.format(col_number,row_number)
            lower_col_range = col_number * tile_size
            higher_col_range = lower_col_range + tile_size
            lower_row_range = row_number * tile_size
            higher_row_range = lower_row_range + tile_size
            image_tile = image[lower_col_range:higher_col_range, lower_row_range:higher_row_range, :]
            tile_mask = mask[lower_col_range:higher_col_range, lower_row_range:higher_row_range]
            num_records = create_tfrecord( image_index, image_tile, tile_mask, tileID, col_number, row_number, dataset_file_path)
            # populate the metadata for this tile
            img_hist = np.histogram(image_tile)
            lowband_density = np.sum(img_hist[0][0:4])
            mask_density = np.count_nonzero(tile_mask)
            tile_df = tile_df.append({'img_index':image_index, 'img_id':image_id, 'tile_id': tileID, 'tile_rel_path':relative_path, 
                           'tile_col_num':col_number, 'tile_row_num':row_number,'lowband_density':lowband_density, 'mask_density':mask_density},ignore_index=True)
            tileID += 1
    return tile_df

This inline code will each image and mask in the train set, swap axes when needed, loading the image and mask into numpy arrays and then invoking the above function for each image/mask pair. This will take a long time...

In [None]:
output_dir = '/kaggle/working/train/'
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
os.mkdir(output_dir)

tile_size = 512
num_images = train_df.shape[0]
for image_index in range(num_images):
    image_id = train_df['id'][image_index]
    image_path = '/kaggle/input/hubmap-kidney-segmentation/train/'+image_id+'.tiff'
    image_json_mask = '/kaggle/input/hubmap-kidney-segmentation/train/'+image_id+'.json'

    baseimage = tifffile.imread(image_path)
    print ('original image shape',baseimage.shape)
    baseimage = np.squeeze(baseimage)
    if( baseimage.shape[0] == 3):
        baseimage = baseimage.swapaxes(0,1)
        baseimage = baseimage.swapaxes(1,2)
        print ('swaped shape',baseimage.shape)
    
    # read json mask
    mask_shape = (baseimage.shape[0], baseimage.shape[1])
    mask = read_mask(image_json_mask, mask_shape)
    
    print('writing tiles for image {}'.format(image_id))
    tile_df = write_tfrecord_tiles( image_index, image_id, baseimage, mask, tile_size, output_dir )
    
    #write the dataframe
    print('writing tile metadata for image {}'.format(image_id))
    df_path = output_dir+image_id+'_tiles.csv'
    tile_df.to_csv(df_path)
    gc.collect()

Verify all train images were written to file.

In [None]:
!ls /kaggle/working/train

In the same fashion, the code below scans the test directory and converts all the images there too. The test images will be needed for evaluating the loss during training. 

WARNING: I ran out of memory trying to write both train and tests images in one session. To build the complete dataset I built one at the time.

In [None]:
#convert all the test images
img_file_list = glob.glob('/kaggle/input/hubmap-kidney-segmentation/test/*.tiff')
mask_file_list = glob.glob('/kaggle/input/hubmap-kidney-segmentation/test/*.json')

output_dir = '/kaggle/working/test/'
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
os.mkdir(output_dir)

tile_size = 512
num_images = img_file_list.__len__()
for image_index in range(num_images):
    file_name = img_file_list[image_index]
    prefix = file_name.split('.')
    parts = prefix[0].split('/')
    image_id = parts[-1]
    image_path = img_file_list[image_index]
    image_json_mask = mask_file_list[image_index]

    baseimage = tifffile.imread(image_path)
    print ('original image shape',baseimage.shape)
    baseimage = np.squeeze(baseimage)
    if( baseimage.shape[0] == 3):
        baseimage = baseimage.swapaxes(0,1)
        baseimage = baseimage.swapaxes(1,2)
        print ('swaped shape',baseimage.shape)
    
    # read json mask
    mask_shape = (baseimage.shape[0], baseimage.shape[1])
    mask = read_mask(image_json_mask, mask_shape)
    
    print('writing tiles for image {}'.format(image_id))
    tile_df = write_tfrecord_tiles( image_index, image_id, baseimage, mask, tile_size, output_dir )
    
    #write the dataframe
    print('writing tile metadata for image {}'.format(image_id))
    df_path = output_dir+image_id+'_tiles.csv'
    tile_df.to_csv(df_path)
    gc.collect()

Verify all images were written

In [None]:
!ls /kaggle/working/test

# Verifying the TFRecord correctness
Let's now read back a sample tile to verify correctness. We introduce here the read TFRecord functions that follow the model described in [this tutorial](https://www.tensorflow.org/tutorials/load_data/tfrecord)

In [None]:
# read back a record to make sure it the decoding works
# Create a dictionary describing the features.
image_feature_description = {
    'img_index': tf.io.FixedLenFeature([], tf.int64),
    'height': tf.io.FixedLenFeature([], tf.int64),
    'width': tf.io.FixedLenFeature([], tf.int64),
    'num_channels': tf.io.FixedLenFeature([], tf.int64),
    'img_bytes': tf.io.FixedLenFeature([], tf.string),
    'mask': tf.io.FixedLenFeature([], tf.string),
    'tile_id': tf.io.FixedLenFeature([], tf.int64),
    'tile_col_pos': tf.io.FixedLenFeature([], tf.int64),
    'tile_row_pos': tf.io.FixedLenFeature([], tf.int64),
}

def _parse_image_function(example_proto):
  # Parse the input tf.Example proto using the dictionary above.
    single_example = tf.io.parse_single_example(example_proto, image_feature_description)
    img_index = single_example['img_index']
    img_height = single_example['height']
    img_width = single_example['width']
    num_channels = single_example['num_channels']
    
    img_bytes =  tf.io.decode_raw(single_example['img_bytes'],out_type='uint8')
   
    img_array = tf.reshape( img_bytes, (img_height, img_width, num_channels))
   
    mask_bytes =  tf.io.decode_raw(single_example['mask'],out_type='bool')
    
    mask = tf.reshape(mask_bytes, (img_height,img_width))
    mtd = dict()
    mtd['img_index'] = single_example['img_index']
    mtd['width'] = single_example['width']
    mtd['height'] = single_example['height']
    mtd['tile_id'] = single_example['tile_id']
    mtd['tile_col_pos'] = single_example['tile_col_pos']
    mtd['tile_row_pos'] = single_example['tile_row_pos']
    struct = {
        'img_array': img_array,
        'mask': mask,
        'mtd': mtd
    } 
    return struct

def read_tf_dataset(storage_file_path):
    encoded_image_dataset = tf.data.TFRecordDataset(storage_file_path, compression_type="GZIP")
    parsed_image_dataset = encoded_image_dataset.map(_parse_image_function)
    return parsed_image_dataset


Let's read the tile with col offset 8 and row offset 20 (i.e. [8,20]) which we previously noticed to contain a glom.

In [None]:
working_image_index = 0
working_image_id = train_df['id'][working_image_index]
working_image_id
ds_path = '/kaggle/working/train/'+working_image_id+'/col8/col8_row20.tfrecords'
ds = read_tf_dataset(ds_path)

for struct in ds.as_numpy_iterator():
    #struct = g_dataset.get_next()
    img_mtd = struct["mtd"]
    img_array  = struct["img_array"]
    img_mask = struct["mask"]
 
    fig, ax = plt.subplots(1,2,figsize=(20,3))
    ax[0].set_title("Tile ID = {} Xpos = {} Ypos = {}".format(img_mtd['tile_id'], img_mtd['tile_col_pos'],img_mtd['tile_row_pos']))
    ax[0].imshow(img_array)
    #ax[1].set_title("Pixelarray distribution");
    #sns.distplot(img_array.flatten(), ax=ax[1]);
    ax[1].imshow(img_mask)

Voila!! Let's also verify that the tile metadata file is correct.

In [None]:
!ls -l /kaggle/working/train/

In [None]:
working_image_index = 0
working_image_id = train_df['id'][working_image_index]
data_path = '/kaggle/working/train/'+working_image_id+'_tiles.csv'
tiles_df = pd.read_csv(data_path)
tiles_df.head()