# Lavin versions
* Tile Images, Save in TFRecord v. 14:  Try skipping blank (glom-less) tiles for new 15-image train DataSet
* Tile Images, Save in TFRecord v. 11:  Get mask data from "train.csv", not from "<imageid>.json" files
*  v.21, include "pixel_overlap" in output TFRecords
* v.1, after cloning v29 for regular (non-Normalized) images

In [None]:
! ls -al /kaggle/input/hubmap-kidney-segmentation/*

# ***Pre-Disclaimer:*** 
This notebook was adapted from the excellent one by Marcos Novaes https://www.kaggle.com/marcosnovaes/hubmap-read-data-and-build-tfrecords; changes:
* introduction of overlapping tiles.
* write all tiles for a single original image into a single TFrecord file.
* The notebook contains exploratory material that can be disabled by setting the variable EDA (Exploratory Data Analysis) to False.
* If P['DO_NORMALIZE'] is True, normalize each image by subtracting its 3-channel mean and (sort of) dividing by its 3-channel stddev.
* If P['SKIP_BLANK_TILES'] is True, don't output tiles for which there are NO glom markers in the hopes of saving space and training time

In [None]:
EDA = False

# Run parameters (after Wojtek's convention)

P = {}
P[ 'DO_NORMALIZE'] = False   # Don't ormalize input images to standard statistics
P[ 'TILE_SIZE' ] = 512   # Nominal Image diameter in pixels, NOT including overlap
P[ 'TILE_OVERLAP' ] = 64  # Overlap between adjacent tiles in pixels
P[ 'SKIP_BLANK_TILES' ] = True  # Ignore tiles with no glom markers

print( "Tile Images, Save in TFRecord Files" )
print( "parameters:" )
for p in P:
    print( f"{p}: {P[p]}" )
    

# Objective:

The objective of this notebook is to provide an example of how to transform the HubMAP Hacking the Kidney competition dataset into a form that can readily used to train models leveraging accelerators. The images in this competition have very high resolution, averaging 30,000 x 30,000 pixels, and this presents a difficult challenge in memory management. It is just not possible to read them all in memory in the Kaggle environment, and it is also not possible to build a model using the whole image as input. This notebooks provides some tips for reading the competitions images and masks, and proposes a strategy to deal with the large sizes. 
The strategy adopted in this Notebook is to tile the images into overlapping tiles, and then transforming the tiles into TFRecords such that we can later use them as input to train models using GPU or TPU accelerators. 

This Notebook takes a long time to run because it processes all the competition files and the resulting, compressed dataset is 18.1G, almost exceeding the Kaggle VM limit. I was able to process all files and then uploaded the results to a Kaggle daset that I have made public:
--> [Link to the TFRecord Dataset Produced by this Notebook.](https://www.kaggle.com/marcosnovaes/hubmap-tfrecord-512)

I have also developed a Notebook that explains how to use the TFRecord Dataset: [https://www.kaggle.com/marcosnovaes/hubmap-looking-at-tfrecords/](https://www.kaggle.com/marcosnovaes/hubmap-looking-at-tfrecords/)

If you want to use the dataset without change you don't need to run the Notebook -- but do read through it because it provides a lot of insight on how the read the images, masks and convert them to TFRecords. I will be using this dataset on my subsequent notebooks. You can also easily costumize this Notebook if you want to produce tiles of different sizes (P['TILE_SIZE']) or if you want to include more metadata for each tile. 

# Reading the Images
Some of the images are in TIFF format, some are in BigTIFF. I used the tiffile library and it seems to read the images with no problem.

In [None]:
pip install tifffile

Libs used in this Notebook

In [None]:
%matplotlib inline

import cv2
import json
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import rasterio
from rasterio.windows import Window

#from imread import imread, imsave

import shutil
import psutil

import tensorflow as tf

import glob
import tifffile
import gc
import sys

Let's look at the input data. Find and read the competition train.csv file.

In [None]:
!ls /kaggle/input/

In [None]:
!ls -l /kaggle/input/hubmap-kidney-segmentation/

In [None]:
basepath = '/kaggle/input/hubmap-kidney-segmentation/'
train_df = pd.read_csv(basepath + "train.csv")
train_df.head()

The id corresponds to the images provided. For each image, you are provided a .tiff image file and the Run Length Encoding Mask. 

But notice that the masks are also provided as a .json file with polygon definitions. I used this option instead in this Notebook. 

In [None]:
train_df.shape

In [None]:
!ls -l /kaggle/input/hubmap-kidney-segmentation/train

In [None]:
!ls -l /kaggle/input/hubmap-kidney-segmentation/test

The next cell reads all tiffs and prints their shapes. It turns out that some TIFF images are channel first (number "3" first) and others channel last (number "3" last). When reading them, we must check if the "3" is first and swap the axis as needed. This loop will take a long time as each image is read just so we can tell its shape. 

In [None]:
# verify that we can read all images
def verify_read(file_list):
    for file_name in file_list:
        baseimage = tifffile.imread(file_name)
        #baseimage = tif.series[0].asarray()
        print('img id = {}, shape = {}, dtype = {}'.format(file_name,baseimage.shape, baseimage.dtype ) )
        baseimage = None
        gc.collect()

if EDA:
    print( "before verify_read:", psutil.virtual_memory(), file = sys.stderr )
    file_list = glob.glob('/kaggle/input/hubmap-kidney-segmentation/train/*.tiff')
    verify_read(file_list)
    print( "after verify_read:", psutil.virtual_memory(), file = sys.stderr )


# Reading and Showing a sample image and mask
The next cells show the code that can read the first image in the csv file. 

IMPORTANT: Note that in the case of a "channel first" TIFF (number "3" first) we need to swap the axis of the numpy array as noted below. You CANNOT use "reshape" instead, that will scramble the channels.

In [None]:
#select an image to investigate
working_image_index = 0
working_image_id = train_df['id'][working_image_index]
working_image_id
working_image_path = '/kaggle/input/hubmap-kidney-segmentation/train/'+working_image_id+'.tiff'

Here is the code that takes care of the difference in shapes. 
IMPORTANT: Notice that you need to use the numpy.swapaxes function to change the shape, using "reshape" will scramble the channels.

In [None]:

if EDA:
    baseimage = tifffile.imread(working_image_path)
    print ('original image shape',baseimage.shape)
    print ('original image dtype', baseimage.dtype )
    print ('original image min/max', ( np.amin( baseimage ), np.amax( baseimage ) ) )
    baseimage = np.squeeze(baseimage)
    if( baseimage.shape[0] == 3):
        baseimage = baseimage.swapaxes(0,1)
        baseimage = baseimage.swapaxes(1,2)
        print ('swaped shape',baseimage.shape)

    plt.figure()

    plt.imshow(baseimage)

The masks are provided in the csv files in RLE format, but we are also provided json files that describe the mask as polygons. I will be using the json files:

In [None]:
# read json mask
working_image_json_mask = '/kaggle/input/hubmap-kidney-segmentation/train/'+working_image_id+'.json'
if EDA:
    read_file = open(working_image_json_mask, "r") 
    mask_data = json.load(read_file)
    print( mask_data[0] )
    mask_data = None
    gc.collect()
    print( psutil.virtual_memory(), file = sys.stderr )

The following function converts the polygons into a numpy boolean mask with the same shape as the image.

In [None]:
def read_mask(mask_file, mask_shape):
    read_file = open(mask_file, "r") 
    mask_data = json.load(read_file)
    polys = []
    for index in range(mask_data.__len__()):
        geom = np.array(mask_data[index]['geometry']['coordinates'])
        polys.append(geom)

    mask = np.zeros(mask_shape, dtype = np.uint8 )
    cv2.fillPoly(mask, polys, 1)
    mask = mask.astype(bool)
    return mask

def read_mask_from_RLE( image_id, mask_shape ):
    '''
    Gets Run-Length Encoded representation of glom mask image corresponding to train image "image_id"
    and converts to binary image, returned.
    '''
    RLE = train_df[ train_df[ 'id' ] == image_id ][ 'encoding' ].values[ 0 ]
    mask = rle2mask( RLE, mask_shape )
    return mask
   

# Run length encoding (RLE) Functions
Based on https://www.kaggle.com/friedchips/fully-correct-hubmap-rle-encoding-and-decoding:

In [None]:
def encode_RLE( mask, column_pixel_offset = 0 ):
    '''
    Given a predicted binary image tile column "mask" and a starting offset in
    column-major ordered pixels, calculate and return the string RLE, which will be
    concatenated with RLEs from other columns, to construct RLE for the entire image:
    '''
    mask = mask.T.reshape(-1) # make 1D, column-first
    mask = np.pad(mask, 1) # make sure that the 1d mask starts and ends with a 0
    starts = np.nonzero((~mask[:-1]) & mask[1:])[0] # start points
    ends = np.nonzero(mask[:-1] & (~mask[1:]))[0] # end points
    rle = np.empty(2 * starts.size, dtype=int) # interlacing...
    rle[0::2] = starts  + column_pixel_offset # ...starts...
    rle[1::2] = ends - starts # ...and lengths
    rle = ' '.join([ str(elem) for elem in rle ]) # turn into space-separated string
    return rle

def rle2mask(rle, mask_shape):
    ''' takes a space-delimited RLE string in column-first order
    and turns it into a 2d boolean numpy array of shape mask_shape '''
    
    mask = np.zeros(np.prod(mask_shape), dtype=bool) # 1d mask array
    rle = np.array(rle.split()).astype(int) # rle values to ints
    starts = rle[::2]
    lengths = rle[1::2]
    for s, l in zip(starts, lengths):
        mask[s:s+l] = True
    return mask.reshape(np.flip(mask_shape)).T # flip because of column-first order


In [None]:
if EDA:
    mask_shape = (baseimage.shape[0], baseimage.shape[1])
    mask = read_mask(working_image_json_mask, mask_shape)
    plt.imshow(mask)
    gc.collect()
    print( psutil.virtual_memory(), file = sys.stderr )

In [None]:
if EDA:
    baseimage.dtype

In [None]:
if EDA:
    mask.dtype

In [None]:
if EDA:
    mask.shape

So, now we know how to read each image and mask, and that their types are uint8 and bool respectively. But they have very large dimensions, we would not be able to train a ML model at these dimensions. So, in the next section I takes the approach of tiling up the image and working with tiles.

# Tiling the Large Images into tiles with overlap
Tiles are square arrays of pixels with two regions:   An inner region consisting of a P['TILE_SIZE'] pixel diameter square surrounded by an outer annular region with a radius
of P['TILE_OVERLAP'] pixels.  Tiles are arranged so that inner regions abut to cover the entire image, with outer regions overlapping adjacent tiles.

NOTE: The image numpy arrays have dimensions [height, width, channels]. This Notebook will tile the image using offsets for the height index and width index. So:
- a Tile with coordinate [0,0] represents the first tile on the top left corner.  
- a Tile with coordinate [1,0] represents a tile with height offset = 1*Tile Size, in this case it starts at numpy coordinates [P['TILE_SIZE'],0], which means it is the tile below [0,0]
- a Tile with coordinate [0,1] represents a tile with wodth offset = 1*Tile Size, in this case it starts at numpy coordinates [0,P['TILE_SIZE']], which means it is the tile to the right of [0,0]

Here are some useful functions that use the numpy slicing capability to select specific tiles of the image. 

In [None]:
class OverlappingTiledImage:
    '''
    A class whose objects can be used to extract overlapping tiles from a larger image,
    whose primary method get_tile extends nominal tile size by specified overlap, 
    reflecting through the boundaries of the larger image if the tile + overlap would
    extend past the boundary.  We use the convention that "row" and "col" (lower case)
    refer to location of pixels in one tile, while "ROW" and "COL" (upper case) refer
    to the location of the tile in the tableau of tiles covering the image.
    '''
    # Primary data members, specified by constructor arguments:
    # self.image:   The large "base" image from which tiles are extracted
    # self.tile_pixel_rows:  The number of rows of pixels in a tile
    # self.tile_pixel_cols:  The number of columns of pixels in a tile
    # self.tile_pixel_overlap: The overlap, in pixels, between adjacent tiles, in all directions
    # self.image_pixel_row_start:  The pixel row number of the upper left corner of the (ROW=0, COL=0) tile
    # self.image_pixel_col_start:  The pixel column number " " "
    # self.image_pixel_row_stop: The pixel row number past the lower right corner of the (ROW=tile_ROWS-1,COL=tile_COLS-1) tile
    # self.image_pixel_col_stop: The pixel column number " " "
    # Derived data members:
    # self.tile_ROWS:  The number of ROWS of tiles in the tableau of overlapping tiles
    # self.tile_COLS:  The number of COLUMNS of tiles in the tableau of overlapping tiles
    
    # "public" methods:
    def __init__ ( self, image, pixel_rows, pixel_cols, pixel_overlap, 
                   image_pixel_row_start = 0, image_pixel_col_start = 0,
                   image_pixel_row_stop = None, image_pixel_col_stop = None ):
        # Process defaults for last two args:
        if image_pixel_row_stop is None:
            image_pixel_row_stop = image.shape[ 0 ]
        if image_pixel_col_stop is None:
            image_pixel_col_stop = image.shape[ 1 ]
        # Sanity checks
        assert pixel_rows > 0
        assert pixel_cols > 0
        assert pixel_overlap >= 0
        assert image_pixel_row_start >= 0
        assert image_pixel_row_stop <= image.shape[ 0 ]
        assert image_pixel_row_start < image_pixel_row_stop
        assert image_pixel_col_start >= 0
        assert image_pixel_col_stop <= image.shape[ 1 ]
        assert image_pixel_col_start < image_pixel_row_stop
        # Copy primary data members
        self.image = image
        self.tile_pixel_rows = pixel_rows
        self.tile_pixel_cols = pixel_cols
        self.tile_pixel_overlap = pixel_overlap
        self.image_pixel_row_start = image_pixel_row_start
        self.image_pixel_col_start = image_pixel_col_start
        self.image_pixel_row_stop = image_pixel_row_stop
        self.image_pixel_col_stop = image_pixel_col_stop
        # Derive data members
        self.tile_ROWS = ( image_pixel_row_stop - image_pixel_row_start ) // pixel_rows
        self.tile_COLS = ( image_pixel_col_stop - image_pixel_col_start ) // pixel_cols
        '''
        print( "image_pixel_row_stop", image_pixel_row_stop, "image_pixel_row_start", image_pixel_row_start, file = sys.stderr )
        print( "image_pixel_col_stop", image_pixel_col_stop, "image_pixel_col_start", image_pixel_col_start, file = sys.stderr )
        print( "pixel_rows", pixel_rows, "pixel_cols", pixel_cols, file = sys.stderr )
        '''
        
    def SHAPE( self ):
        '''
        Returns:  Shape of overlapping image in tiles
        '''
        return ( self.tile_ROWS, self.tile_COLS )
    
    def tile_shape( self ):
        '''
        Returns:  Shape + overlap of individual tile, in pixels
        '''
        return ( self.tile_pixel_rows, self.tile_pixel_cols, self.tile_pixel_overlap )
    
    def image_shape( self ):
        '''
        Returns:  shape of underlying image, ignoring image_pixel_row_start, etc.
        '''
        return self.image.shape
    
    def get_tile( self, tile_ROW, tile_COL ):
        assert ( tile_ROW >= 0 ) & ( tile_ROW < self.tile_ROWS )
        assert ( tile_COL >= 0 ) & ( tile_COL < self.tile_COLS )
        pixel_row_start = self.image_pixel_row_start + tile_ROW * self.tile_pixel_rows - self.tile_pixel_overlap
        pixel_col_start = self.image_pixel_col_start + tile_COL * self.tile_pixel_cols - self.tile_pixel_overlap
        pixel_row_stop_no = min( self.image_pixel_row_start + ( 1 + tile_ROW ) * self.tile_pixel_rows, self.image_pixel_row_stop )
        pixel_col_stop_no = min( self.image_pixel_col_start + ( 1 + tile_COL ) * self.tile_pixel_cols, self.image_pixel_col_stop )
        pixel_row_stop = pixel_row_stop_no + self.tile_pixel_overlap
        pixel_col_stop = pixel_col_stop_no + self.tile_pixel_overlap
        '''
        print( "  pixel_row_stop_no", pixel_row_stop_no, file = sys.stderr )
        print( "  pixel_col_stop_no", pixel_col_stop_no, file = sys.stderr )
        print( "  pixel_row_start", pixel_row_start, "pixel_col_start", pixel_col_start, file = sys.stderr )
        print( "  pixel_row_stop", pixel_row_stop, "pixel_col_stop", pixel_col_stop, file = sys.stderr )
        '''
        
        tile = self.image[ max( pixel_row_start, self.image_pixel_row_start ) : min( pixel_row_stop, self.image_pixel_row_stop ),
                           max( pixel_col_start, self.image_pixel_col_start ) : min( pixel_col_stop, self.image_pixel_col_stop ) ]
        if ( pixel_row_start < self.image_pixel_row_start ):
            tile = self.extend_top( tile, pixel_row_start )
            # print( "  extend_top", file = sys.stderr )
        if ( pixel_row_stop > self.image_pixel_row_stop ):
            tile = self.extend_bottom( tile, pixel_row_stop )
            # print( "  extend_bottom", file = sys.stderr )
        if ( pixel_col_start < self.image_pixel_col_start ):
            tile = self.extend_left( tile, pixel_col_start )
            # print( "  extend_left", file = sys.stderr )
        if ( pixel_col_stop > self.image_pixel_col_stop ):
            tile = self.extend_right( tile, pixel_col_stop )
            # print( "  extend_right",file = sys.stderr )
        return tile
    
    def remove_overlap( self, tile ):
        '''
        Returns: "tile" with overlap removed
        '''
        return tile[ self.tile_pixel_overlap : - self.tile_pixel_overlap, self.tile_pixel_overlap : - self.tile_pixel_overlap ]
    
    # "private" methods:
    
    def copyMakeBorder( self, tile, border_top, border_bot, border_left, border_right, treatment = cv2.BORDER_REFLECT ):
        if tile.dtype == np.bool:
            return cv2.copyMakeBorder( tile.astype( np.int8 ), border_top, border_bot, border_left, border_right, treatment ).astype( np.bool )
        else:
            return cv2.copyMakeBorder( tile, border_top, border_bot, border_left, border_right, treatment )
        
    
    def extend_top( self, tile, pixel_row_start ):
        return self.copyMakeBorder( tile, self.image_pixel_row_start - pixel_row_start, 0, 0, 0 )
            
    def extend_bottom( self, tile, pixel_row_stop ):
        return self.copyMakeBorder( tile, 0, pixel_row_stop - self.image_pixel_row_stop, 0, 0 )
    
    def extend_left( self, tile, pixel_col_start ):
        return self.copyMakeBorder( tile, 0, 0, self.image_pixel_col_start - pixel_col_start, 0 )

    def extend_right( self, tile, pixel_col_stop ):
        return self.copyMakeBorder( tile, 0, 0, 0, pixel_col_stop - self.image_pixel_col_stop )


In [None]:
#explore a few tiles
def show_tile_and_mask(baseimage_oti, mask_oti, tile_col_pos, tile_row_pos):
    tile_image = baseimage_oti.get_tile( tile_col_pos, tile_row_pos )
    tile_mask = mask_oti.get_tile( tile_col_pos, tile_row_pos )
    fig, ax = plt.subplots(1,2,figsize=(20,3))
    ax[0].imshow(tile_image)
    ax[1].imshow(tile_mask)
'''    
def get_tile(baseimage, tile_size, tile_col_pos, tile_row_pos):
    start_col = tile_col_pos*tile_size
    end_col = start_col + tile_size
    start_row = tile_row_pos * tile_size
    end_row = start_row + tile_size
    tile_image = baseimage[start_col:end_col, start_row:end_row,:]
    return tile_image

def get_tile_mask(baseimage, tile_size, tile_col_pos, tile_row_pos):
    start_col = tile_col_pos*tile_size
    end_col = start_col + tile_size
    start_row = tile_row_pos * tile_size
    end_row = start_row + tile_size
    tile_image = baseimage[start_col:end_col, start_row:end_row]
    return tile_image

'''  
def show_tile_dist(tile):
    fig, ax = plt.subplots(1,2,figsize=(20,3))
    #ax[0].set_title("Tile ID = {} Xpos = {} Ypos = {}".format(img_mtd['tile_id'], img_mtd['tile_col_pos'],img_mtd['tile_row_pos']))
    ax[0].imshow(tile)
    ax[1].set_title("Pixelarray distribution");
    sns.distplot(tile.flatten(), ax=ax[1]);


As can be noticed in the sample image displayed, there is a black border and then a lot of white surrounding the tissue. If we select [0,0] we expect to see a black tile. If we move a little to the right and down, we are then in the white zone. So let's try the values [0,0] and [5,5] and we should be a black and a white tile respectively.

In [None]:
if EDA:
    tile_size = P[ 'TILE_SIZE' ]
    overlap = P[ 'TILE_OVERLAP' ]
    baseimage_oti = OverlappingTiledImage( baseimage, tile_size, tile_size, overlap )
    mask_oti = OverlappingTiledImage( mask, tile_size, tile_size, overlap )
    tile = baseimage_oti.get_tile( 0, 0)
    print( "tile.shape", tile.shape, file = sys.stderr )
    show_tile_dist(tile)
    print( "baseimage_oti.SHAPE()", baseimage_oti.SHAPE(), file = sys.stderr )
    gc.collect()
    print( psutil.virtual_memory(), file = sys.stderr )
    mask_tile = mask_oti.get_tile( 0, 0 )
    print( "mask_tile.dtype", mask_tile.dtype, file = sys.stderr )

Black as predicted. As we explore the tiles, I also calculate the tile histogram. If we observe the histogram we will notice that it will provide a useful way to filter black and white tiles later. The numpy.histogram function divides the color spectrum in 10 bins and shows how many pixels call within each bin. We can notice that black and white fall into the higher end of the spectrum. Black tiles have 0 pixels in the lower end, while "white" (actually "dirty gray") has only about 20 pixels in that region. We then see that tiles with some actual tissue have a more even distribution. Let's call this metric "lowpass energy". It turns out that if we later select lowpass energy > 100 we are garanteed to have actual tissue in the slide, and we can discard anything with < 100. 

In [None]:
if EDA:
    img_hist = np.histogram(tile)
    print('histogram = {}'.format(img_hist[0]))
    print('histogram_lowpass = {}'.format(np.sum(img_hist[0][0:4])))

And here is the white one ([5,5]

In [None]:
if EDA:
    tile = baseimage_oti.get_tile( 5, 5)
    show_tile_dist(tile)

In [None]:
if EDA:
    img_hist = np.histogram(tile)
    print('histogram = {}'.format(img_hist[0]))
    print('histogram_lowpass = {}'.format(np.sum(img_hist[0][0:4])))

In [None]:
if EDA:
    tile = baseimage_oti.get_tile( 8, 20 )
    print( "tile.shape", tile.shape, file = sys.stderr )
    show_tile_dist(tile)

#### Now let's try to find a glomerulus. If we look back at the polygon dump above, it shows that the first glom starts at pixel [10503, 4384]. If we divide both indexes by 512, we expect to find a glom in tile [8,20]

In [None]:
if EDA:
    img_hist = np.histogram(tile)
    print('histogram = {}'.format(img_hist[0]))
    print('histogram_lowpass = {}'.format(np.sum(img_hist[0][0:4])))

In [None]:
if EDA:
    show_tile_and_mask(baseimage_oti, mask_oti, 8, 20)

Bingo!!! We found our first glom. Let's now derive a metric for masks, so that in the future we can easily find tiles with gloms. This metric will be used when we want to filter the training dataset to make sure it includes a certain number of tiles with gloms. Simply counting the number of "TRUE" pixels in the mask is a great metric that indicate the tile contains a glom.

In [None]:
if EDA:
    tile_mask = mask_oti.get_tile( 8, 20)
    mask_density = np.count_nonzero(tile_mask)
    print( "mask_density", mask_density, "/", tile_mask.shape[0] * tile_mask.shape[1] )

Now let's move down the image by incrementing the height offset to [9,20], which should be the tile below [8,20]

In [None]:
if EDA:
    show_tile_and_mask(baseimage_oti, mask_oti, 9, 20)

In [None]:
if EDA:
    tile_mask = mask_oti.get_tile( 9, 20)
    mask_density = np.count_nonzero(tile_mask)
    mask_density

So, the glom ends in that tile, and there are fewer TRUE pixels. Going further down we find a cortex tile with no gloms.

In [None]:
if EDA:
    show_tile_and_mask(baseimage_oti, mask_oti, 10, 20)

In [None]:
if EDA:
    tile_mask = mask_oti.get_tile( 10, 20)
    mask_density = np.count_nonzero(tile_mask)

    print( "\nAt end of visualizations, before gc, memory is {}", psutil.virtual_memory(), file = sys.stderr, flush = True )
    baseimage = None
    mask = None
    baseimage_oti = None
    mask_oti = None
    gc.collect()
    print( "\nAt end of visualizations, memory is {}", psutil.virtual_memory(), file = sys.stderr, flush = True )

    mask_density

# Functions for normalizing images (Optional)
In effect, performs linear transform independently for each of the three color channels of the input images so that resulting images have standardized mean (128/255) and standard deviation (42/255)

In [None]:
def calculate_images_stats( tiff_image_dirnames ):
    image_stats = {}
    for tiff_image_dirname in tiff_image_dirnames:
        for tiff_image_filename in glob.glob( tiff_image_dirname + "*.tiff" ):
            stats = calculate_image_stats( tiff_image_filename )
            image_id = pathlib.Path( tiff_image_filename ).stem
            print( "for", image_id, "stats are", stats )
            image_stats[ image_id ] = stats
    return image_stats
        
def calculate_image_stats( tiff_image_filename ):
    '''
    Samples "tiff_image_filename" in a square of WINDOW_RADIUS (half-width)
    Returns:
        ( mean, std ), each a 3-array for three channels
    '''
    WINDOW_RADIUS = 1024  # ### SHOULD BE IN "P[...]" ###
    with rasterio.open( tiff_image_filename ) as tiff_image_dataset:
        image_rows, image_cols = tiff_image_dataset.shape
        window = Window.from_slices ( ( image_rows // 2 - WINDOW_RADIUS,
                                        image_rows // 2 + WINDOW_RADIUS ),
                                      ( image_cols // 2 - WINDOW_RADIUS,
                                        image_cols // 2 + WINDOW_RADIUS ) )
        window_image = tiff_image_dataset.read( [1, 2, 3 ], window = window )
        window_image = np.moveaxis( window_image, 0, -1 ) # Channel -> last
        return calculate_window_stats( window_image )
    
def calculate_window_stats( window_image ):    
    mean = np.mean( window_image, axis = ( 0, 1 ) ).astype( int )
    std = np.std( window_image, axis = ( 0, 1 ) ).astype( int )
    stats = ( mean, std )
    return stats
    
def normalize_image( image, stats ):
    '''
    Normalizes the R x C x 3 "image" by subtracting the image mean and
    dividing by the image stdev, return result
    '''
    mean, std = stats
    assert image.shape[ 2 ] == 3
    assert mean.shape[ 0 ] == 3
    assert std.shape[ 0 ] == 3
    # For each channel "c", we will subtract from image[:,:,c] the value
    # of mean[c], then divide by std[c], clip the result to +/- 3, which
    # will capture 97% of the cases, and then rescale to 0-255, so we
    # return a "uint8" result consistent with what's read using rasterio.
    image = image.astype( np.float )
    image -= mean.astype( np.float )
    EPSILON = 1E-6
    image /= ( std.astype( np.float ) + EPSILON)
    image = np.clip( image, - ( 3.0 - EPSILON ), ( 3.0 - EPSILON ) )
    image *= 255.0 / 6.0
    image += 128.0
    image = np.clip( image, 0, 255 ).astype( np.uint8 )
    return image

'''
TEST:
calculate_images_stats( ( "/kaggle/input/hubmap-kidney-segmentation/test/",
                          "/kaggle/input/hubmap-kidney-segmentation/train/" ) )
''' 

# Transforming the Tiles into a TFRecord Dataset
We are now ready to read all the images (one at a time or we will run out of memory!) and then writing each tile to a TFRecord file. Kaggle has a limit of 50 upper level directories, so we will create one dir for each image. We will also build a pandas dataframe that has the metadata for each tile, including the lowpass energy and mask density metrics that we derived above. 

Using the TFRecord format for storing data should be easy, but unfortunately it requires data serialization which complicates it a little bit. This is done using [protocol buffers](https://developers.google.com/protocol-buffers/) and that is a bit of a learning curve. But in ML you only need to understand the [TFExample](https://www.tensorflow.org/api_docs/python/tf/train/Example) format. In this Notebook I provide a little template code for dealing with TFExamples that can be quickly customized for any type of data. This template is explained in detail in [this tutorial](https://www.tensorflow.org/tutorials/load_data/tfrecord); but you don't need to read all this, in this Notebook I provide an example specific for image data that you can quickly customize.

For serialization using TFExample, we have to make any data fit into either one of 3 types:
* bytes_feature
* float_feature
* int_64_feature

In this Notebook and image and mask are passed as bytes_features and the other metadata as int_64. 

In [None]:
# Utilities serialize data into a TFRecord
def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


In [None]:
def image_example(image_index, image_tile, mask_tile, pixel_overlap, tile_id, tile_col_pos, tile_row_pos):
    image_tile_shape = image_tile.shape
    
    img_bytes = image_tile.tostring()

    mask_bytes = np.zeros((0,0)).tostring() if mask_tile is None else mask_tile.tostring()
    
    feature = {
        'img_index': _int64_feature(image_index),
        'height': _int64_feature(image_tile_shape[0]),  # NOTE this and "width" >includes< 2 * pixel_overlap
        'width': _int64_feature(image_tile_shape[1]),
        'pixel_overlap': _int64_feature(pixel_overlap),
        'num_channels': _int64_feature(image_tile_shape[2]),
        'image': _bytes_feature(img_bytes),
        'mask' : _bytes_feature(mask_bytes),
        'tile_id':  _int64_feature(tile_id),
        'tile_col_pos': _int64_feature(tile_col_pos),
        'tile_row_pos': _int64_feature(tile_row_pos),
    }

    return tf.train.Example(features=tf.train.Features(feature=feature))

This function writes a tile to storage, notice the GZIP compression -- this makes possible for all the tiles to be stored locally without exceeding the HD allowance of the Kaggle machine.

In [None]:
def write_tfrecord( image_index, image_tile, mask_tile, pixel_overlap, tile_id, tile_col_pos, tile_row_pos, tf_writer):
    tf_example = image_example(image_index, image_tile, mask_tile, pixel_overlap, tile_id, tile_col_pos, tile_row_pos)
    tf_writer.write(tf_example.SerializeToString())

Here is the function that takes an image, slices into tiles, calculates tile metadata and commits to storage. It also builds a pandas dataframe with the metadata for all tiles. 

In [None]:
def write_tfrecord_tiles( image_index, image_id, 
                          image_oti, mask_oti, output_dir,
                          image_stats = None ):
    '''
    Write all the tiles for "image_oti" and "mask_oti" (if non-None) to "output_path".tfrec
    Args:
        image_index    0-origin index of original image/mask
        image_id       8-digit hexadecimal identifier of image/mask
        image_oti      Overlapping tile generator for original image
        mask_oti       Overlapping tile generator for mask, may be None if not training
        output_dir     For storing the single .tfrec file that all tiles are written to
    Returns:
    Dataframe describing all tiles for this image / mask 
    '''
    print( "write_tfrecord_tiles, output_dir", output_dir, "image_id", image_id, file = sys.stderr )
    # Check" that "image_oti" and "mask_oti" match:
    if mask_oti is not None:
        assert image_oti.SHAPE() == mask_oti.SHAPE()
        assert image_oti.tile_shape() == mask_oti.tile_shape()
    
    tile_rows, tile_cols = image_oti.SHAPE()
    tile_pixel_rows, tile_pixel_cols, tile_pixel_overlap = image_oti.tile_shape()
    tileID = 0
    
    print( "write_tfrecord_tiles for image", image_id, tile_rows, "x", tile_cols, "tiles", flush = True )

    # create a pandas dataframe to store metadata for each tile
    tile_df = pd.DataFrame(columns = ['img_index', 'img_id','tile_id', 'tile_rel_path',
                                      'tile_col_num', 'tile_row_num', 
                                      'tile_pixel_rows', 'tile_pixel_cols', 'tile_pixel_overlap', 
                                      'lowband_density', 'mask_density'])

    output_path = output_dir + image_id + ".tfrec"
    print( "image_id", image_id, "output_path", output_path, file = sys.stderr, flush = True )
    
    opts = tf.io.TFRecordOptions(compression_type="GZIP")
    with tf.io.TFRecordWriter(output_path, opts) as tf_writer:

        for col_number in range(tile_cols):

            print('tile col_number {} '.format(col_number),end='', flush = True )
    
            for row_number in range(tile_rows):
                
                relative_path = image_id+'/col{}_row{}.tfrec'.format(col_number,row_number)
    
                # First, look at the glom mask for this tile.  If it's 
                # empty and P['SKIP_BLANK_TILES'] is true, skip it:
                if mask_oti is None:
                    tile_mask = None
                else:
                    tile_mask = mask_oti.get_tile( row_number, col_number )
                    if P['SKIP_BLANK_TILES'] & ( np.sum( tile_mask ) == 0 ):
                        continue;

                # Write this image, mask tile:
                image_tile = image_oti.get_tile( row_number, col_number )
                # If selected, normalize all three color channels of the tile:
                if image_stats is not None:
                    image_tile = normalize_image( image_tile, image_stats)

                num_records = write_tfrecord( image_index, image_tile, tile_mask, tile_pixel_overlap, 
                                               tileID, col_number, row_number, tf_writer )
                
                # populate the metadata for this tile
                img_hist = np.histogram(image_tile)
                lowband_density = np.sum(img_hist[0][0:4])
                mask_density = 0 if tile_mask is None else  np.count_nonzero(tile_mask)
                tile_df = tile_df.append({'img_index':image_index, 'img_id':image_id, 'tile_id': tileID, 
                                          'tile_rel_path':relative_path, 
                                          'tile_col_num':col_number, 'tile_row_num':row_number,
                                          'tile_pixel_rows':tile_pixel_rows, 
                                          'tile_pixel_cols':tile_pixel_cols, 
                                          'tile_pixel_overlap':tile_pixel_overlap,
                                          'lowband_density':lowband_density, 'mask_density':mask_density},
                                         ignore_index=True)
                tileID += 1
                
    # Follow Wojtek Rosa's convention to include number of tiles in tfrec filename
    os.rename( output_path, output_path.replace( image_id, image_id + "-" + str( tileID ) ) )
                
    return tile_df

This inline code will read each image and mask in the train set, swap axes when needed, loading the image and mask into numpy arrays and then invoking the above function for each image/mask pair. This will take a long time...

In [None]:
def tile_and_build_tfr( input_dir, output_dir, do_mask = True, do_normalize = False,
                        tile_size = P['TILE_SIZE'], tile_overlap = P['TILE_OVERLAP'] ):
    '''
    For all original images/masks in "input_dir" ,
    break the images/masks into overlapping tiles and place in TFRecord files in "output_dir", one file per original image/mask.  This aligns with input conventions
    assumed by Wojtek Rosa's code that we will borrow from.

    Args:
        input_dir    (e.g., '/kaggle/input/hubmap-kidney-segmentation/train/')
        output_dir   (e.g., "/kaggle/working/train/")
        do_mask      Iff True, tile mask images as well
        tile_size    Tiles are ( tile_size x tile_size ) PLUS border of tile_overlap on all four sides
        tile_overlap
    Returns:
        None
    '''
    print( "tile_and_build_tfr, input_dir", input_dir, "output_dir", output_dir, file = sys.stderr, flush = True )
    image_file_list = glob.glob( input_dir + '[0-9a-f]*.tiff')
                              
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)
    os.mkdir(output_dir)

    for image_index, image_file_name in enumerate( image_file_list ):
        
        # ### mask_file_name = image_file_name.replace( "tiff", "json" )
        image_id = os.path.split( image_file_name )[1].split(".")[0]
        print( "image_id", image_id )
        
        baseimage = tifffile.imread( image_file_name )
        print ('original image {  }, shape, ID = {}',baseimage.shape, image_id, flush = True )
        baseimage = np.squeeze(baseimage)
        if( baseimage.shape[0] == 3):
            baseimage = baseimage.swapaxes(0,1)
            baseimage = baseimage.swapaxes(1,2)
            print ('swapped shape',baseimage.shape)
            
        if do_normalize:
            image_stats = calculate_image_stats( image_file_name )
            print( "for image_index", image_index, "image_stats: ", image_stats )
        else:
            image_stats = None
            
        # read json mask
        if do_mask:
            mask_shape = (baseimage.shape[0], baseimage.shape[1])
            # ### mask = read_mask( mask_file_name, mask_shape)
            mask = read_mask_from_RLE( image_id, mask_shape)
        else:
            mask = None
        
        # Set up overlap tiling for image and mask:
        image_oti = OverlappingTiledImage( baseimage, tile_size, tile_size, tile_overlap )
        mask_oti = None if mask is None else OverlappingTiledImage( mask, tile_size, tile_size, tile_overlap )
        
        print('writing {} x {} tiles for image {}'.format(image_oti.SHAPE()[ 0 ], image_oti.SHAPE()[ 1 ], image_id ) )
        if mask_oti is not None:
            print('writing {} x {} tiles for mask {}'.format(mask_oti.SHAPE()[ 0 ], mask_oti.SHAPE()[ 1 ], image_id ) )
        tile_df = write_tfrecord_tiles( image_index, image_id, 
                                        image_oti, mask_oti, output_dir,
                                        image_stats )
        
        #write the dataframe
        print('writing tile metadata for image {}'.format(image_id))
        df_path = output_dir+image_id+'_tiles.csv'
        tile_df.to_csv(df_path)
        
        del baseimage
        del mask
        del image_oti
        del mask_oti
        gc.collect()
        print( f"\nAt end of writing tiles for {image_id}, memory is {psutil.virtual_memory()}", file = sys.stderr )
        

Write tiles for all train images and verify they were written to file.

In [None]:
# Tile and save train image
input_dir = "/kaggle/input/hubmap-kidney-segmentation/train/"
# ### output_dir = '/kaggle/working/train/'
!mkdir /kaggle/tmp
output_dir = '/kaggle/tmp/train/'
tile_and_build_tfr( input_dir, output_dir, do_mask = True, do_normalize = P['DO_NORMALIZE'] )
!ls -l /kaggle/tmp/

In [None]:
!ls -l /kaggle/tmp/train/*.tfrec
!wc

In [None]:
! kaggle datasets list --mine

In [None]:
! ls -al ~

In [None]:
import os

def kaggle_authenticate( username, key ):
    kaggle_data = {"username":username,"key":key}
    os.environ['KAGGLE_USERNAME']=kaggle_data["username"]
    os.environ['KAGGLE_KEY']=kaggle_data["key"]
    !kaggle

username = "markalavin"
key = "e13d4e98754a1b2c8913909429477e99"  # Get this by [Create New Key] under <user>/Accounts
# ### kaggle_authenticate( username, key )

In [None]:
!kaggle datasets create -h

In [None]:
!kaggle datasets init -p /kaggle/tmp/train/

In [None]:
import json

def create_dataset_metadata( title, slug, licenses, path ):
    '''
    Write a metadata file for uploading a data set; the file goes into
    the folder "path" as dataset-metadata.json.  "title" is the print
    name of the DataSet.  "slug" is an version of "title" with -'s
    replacing blanks.  "licenses" is a list of license names
    '''
    metadata = { "title":title,"id":slug,"licenses":licenses }
    with open( path + "dataset-metadata.json", "w") as file:
        file.write( json.dumps( metadata ) )
    !ls /kaggle/tmp/train
    !cat /kaggle/tmp/train/dataset-metadata.json
    !kaggle datasets create -p /kaggle/tmp/train -t

title = "new 512x512x64 no normalize no augment"
slug = "markalavin/new-512x512x64-no-normalize-no-augment"
licenses = [ { "name":"CC0-1.0"}]
path = "/kaggle/tmp/train/"
# ### create_dataset_metadata( title, slug, licenses, path )