# ***Disclaimer:*** 
Hello Kagglers! I am a Solution Architect with the Google Cloud Platform. I am a coach for this competition, the focus of my contributions is on helping users to leverage GCP components (GCS, TPUs, BigQueryetc..) in order to solve large problems. My ideas and contributions represent my own opinion, and are not representative of an official recommendation by Google. Also, I try to develop notebooks quickly in order to help users early in competitions. There may be better ways to solving particular problems, I welcome comments and suggestions. Use my contributions at your own risk, I don't garantee that they will help on winning any competition, but I am hoping to learn by collaborating with everyone.

# Objective

The main objective of this Notebook is to show how to use a Dataset that we previously created using this Notebook:
["HubMAP: Read data and build TFRecords"](https://www.kaggle.com/marcosnovaes/hubmap-read-data-and-build-tfrecords)

I made this Dataset Public, it is called ["hubmap-train-test"](https://www.kaggle.com/marcosnovaes/hubmap-tfrecord-512).

This Dataset was built by tiling each image into 512x512 tiles and then converting them to TFRecord files. 

This Notebook shows how to load the TFRecords directly from Google Cloud Storage (GCS). This is the way in which you can pass the Dataset to a TPU accelerator, as will be shown in the next Notebook.

NOTICE: In order for this Notebook to work, you need to link this Notebook to a GCP project using "Add Ons-->Google Cloud SDK", because I am reading directly from GCS to illustrate how it works. But you can modify the notebook to run from the local dataset as well.


In [None]:
%matplotlib inline

#import cv2
import json
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

#import shutil

import tensorflow as tf

import glob
#import tifffile
import gc

In [None]:
from kaggle_datasets import KaggleDatasets
from kaggle_secrets import UserSecretsClient

After importing the Kaggle Dataset  ["hubmap-train-test"](https://www.kaggle.com/marcosnovaes/hubmap-tfrecord-512), you should see the following in the input folder.


In [None]:
!ls /kaggle/input

This is the local name of the dataset: "hubmap-tfrecord-512" . We will use this name later to find out its location in GCS. 

Let's inspect the dataset structure perusing the local file system.

In [None]:
!ls /kaggle/input/hubmap-tfrecord-512/train/

In [None]:
!ls -l /kaggle/input/hubmap-tfrecord-512/train/2f6ecfcdf/col0

In [None]:
import glob 

file_list = glob.glob('/kaggle/input/hubmap-tfrecord-512/train/*.csv')
file_list

The CSV files contain the metadata of each tile, such as its tile row and col number that are used to calculate the height and width offsets.

In [None]:
import pandas as pd

train_df = pd.read_csv(file_list[0])
train_df.head()

In [None]:
list_img_id = train_df.loc[0]['img_id']
list_img_id

In order to read TFRecord using the TF library, we must grant tensorflow user credentials as below.

NOTICE: In order for this code to work, you need to link this Notebook to a GCP project using "Add Ons-->Google Cloud SDK"

In [None]:
user_secrets = UserSecretsClient()
user_credential = user_secrets.get_gcloud_credential()
user_secrets.set_tensorflow_credential(user_credential)

Now let's find out the bucket name where the dataset is stored. This information will allow us to access the dataset directly from GCS, which is the way the TPU accesses it. 

In [None]:
#from kaggle_datasets import KaggleDatasets
GCS_PATH = KaggleDatasets().get_gcs_path('hubmap-tfrecord-512')
GCS_PATH

In [None]:
# strip the first 5 chars, that is the "gs://" prefix
bucket_name = GCS_PATH[5:]
bucket_name


The function below will build a complete GCS file path for a given tile identified by img_name, tile_row and tile_col.

In [None]:
def get_tile_gcs_path( bucket_name, subdir,img_id, tile_row, tile_col):
    return 'gs://{}/{}/{}/col{}/col{}_row{}.tfrecords'.format(bucket_name, subdir, img_id, tile_row,tile_row, tile_col)

def get_tile_local_path( subdir,img_id, tile_row, tile_col):
    return '/kaggle/input/hubmap-tfrecord-512/{}/{}/col{}/col{}_row{}.tfrecords'.format(subdir, img_id, tile_row,tile_row, tile_col)

In [None]:
sample_tile_path = get_tile_gcs_path( bucket_name ,'train','2f6ecfcdf',8,20)
sample_tile_path

In [None]:
local_tile_path = get_tile_local_path( 'train','2f6ecfcdf',8,20)
local_tile_path

In [None]:
# read back a record to make sure it the decoding works
# Create a dictionary describing the features.
image_feature_description = {
    'img_index': tf.io.FixedLenFeature([], tf.int64),
    'height': tf.io.FixedLenFeature([], tf.int64),
    'width': tf.io.FixedLenFeature([], tf.int64),
    'num_channels': tf.io.FixedLenFeature([], tf.int64),
    'img_bytes': tf.io.FixedLenFeature([], tf.string),
    'mask': tf.io.FixedLenFeature([], tf.string),
    'tile_id': tf.io.FixedLenFeature([], tf.int64),
    'tile_col_pos': tf.io.FixedLenFeature([], tf.int64),
    'tile_row_pos': tf.io.FixedLenFeature([], tf.int64),
}

def _parse_image_function(example_proto):
  # Parse the input tf.Example proto using the dictionary above.
    single_example = tf.io.parse_single_example(example_proto, image_feature_description)
    img_index = single_example['img_index']
    img_height = single_example['height']
    img_width = single_example['width']
    num_channels = single_example['num_channels']
    
    img_bytes =  tf.io.decode_raw(single_example['img_bytes'],out_type='uint8')
   
    img_array = tf.reshape( img_bytes, (img_height, img_width, num_channels))
   
    mask_bytes =  tf.io.decode_raw(single_example['mask'],out_type='bool')
    
    mask = tf.reshape(mask_bytes, (img_height,img_width))
    mtd = dict()
    mtd['img_index'] = single_example['img_index']
    mtd['width'] = single_example['width']
    mtd['height'] = single_example['height']
    mtd['tile_id'] = single_example['tile_id']
    mtd['tile_col_pos'] = single_example['tile_col_pos']
    mtd['tile_row_pos'] = single_example['tile_row_pos']
    struct = {
        'img_array': img_array,
        'mask': mask,
        'mtd': mtd
    } 
    return struct

def read_tf_dataset(storage_file_path):
    encoded_image_dataset = tf.data.TFRecordDataset(storage_file_path, compression_type="GZIP")
    parsed_image_dataset = encoded_image_dataset.map(_parse_image_function)
    return parsed_image_dataset


In [None]:
sample_tile_path

In [None]:

ds = read_tf_dataset(sample_tile_path)

for struct in ds.as_numpy_iterator():
    #struct = g_dataset.get_next()
    img_mtd = struct["mtd"]
    img_array  = struct["img_array"]
    img_mask = struct["mask"]
 
    fig, ax = plt.subplots(1,2,figsize=(20,3))
    ax[0].set_title("Tile ID = {} Xpos = {} Ypos = {}".format(img_mtd['tile_id'], img_mtd['tile_col_pos'],img_mtd['tile_row_pos']))
    ax[0].imshow(img_array)
    #ax[1].set_title("Pixelarray distribution");
    #sns.distplot(img_array.flatten(), ax=ax[1]);
    ax[1].imshow(img_mask)

So, the dataset provides all the competition images as 512 x 512 tiles. The CSV files provide useful metadata for each tile. This meatadata can be used to build datasets of specific distributions, for example 50% with gloms and 50% no gloms, and so forth. As an example, here is how to build a dataset of all gloms in the first image:

In [None]:
# build a dataset of all image tiles from image 0 that have gloms in them
#for csv_file in file_list:
csv_file = file_list[0]
tiles_df = pd.read_csv(csv_file)
gloms_df = tiles_df.loc[tiles_df["mask_density"]  > 0]

gloms_df.head()

In [None]:
gloms_df.__len__()

We have filtered out only the tilest with gloms. Now build a list of tile paths from the data in the dataframe, and that list is basically our dataset. When you build a dataset using tf.data.TFRecordDataset you only need to pass a list of files. You can pass local file paths or paths to files that reside in GCS, which are URIs starting with "gs://". This allows for very efficient data transfer and this will be the way in which a TPU will be able to read your files, as illustrated in a future notebook. We build this list as below:

In [None]:
tile_path_array = []
num_tiles = gloms_df.__len__()
for index in range(num_tiles) :
    dataset_name = 'train'
    img_id = gloms_df.iloc[index]['img_id']
    col_offset = gloms_df.iloc[index]['tile_col_num']
    row_offset = gloms_df.iloc[index]['tile_row_num']
    tile_path = get_tile_gcs_path( bucket_name ,dataset_name,img_id,col_offset,row_offset)
    tile_path_array.append(tile_path)

Now we will build a dataset using only the first five 5 names, just for testing. Notice that the TFRecords will be read directly from GCS.

In [None]:
# check the first 5 paths to see if they are correct
tile_path_array[0:5]

In [None]:
# reat the dataset passing all the file names
ds = read_tf_dataset(tile_path_array[0:5])
#read 5 images and masks using the numpy iterator
glom_mtd = []
glom_tiles = []
glom_masks = []

for struct in ds.as_numpy_iterator():
    img_mtd = struct["mtd"]
    glom_mtd.append(img_mtd)
    img_array  = struct["img_array"]
    glom_tiles.append(img_array)
    img_mask = struct["mask"]
    glom_masks.append(img_mask)
    
    

Let's plot the 5 tiles and corresponding masks, to verify that indeed we built a dataset with only tiles that contain gloms. 

In [None]:
#plot the 5 glom tiles from the dataset
fig, ax = plt.subplots(5,2,figsize=(10,20))
ax[0][0].set_title("Tile ID = {} Xpos = {} Ypos = {}".format(glom_mtd[0]['tile_id'], glom_mtd[0]['tile_col_pos'],glom_mtd[0]['tile_row_pos']))
ax[0][0].imshow(glom_tiles[0])
ax[0][1].imshow(glom_masks[0])

ax[1][0].set_title("Tile ID = {} Xpos = {} Ypos = {}".format(glom_mtd[1]['tile_id'], glom_mtd[1]['tile_col_pos'],glom_mtd[1]['tile_row_pos']))
ax[1][0].imshow(glom_tiles[1])
ax[1][1].imshow(glom_masks[1])

ax[2][0].set_title("Tile ID = {} Xpos = {} Ypos = {}".format(glom_mtd[2]['tile_id'], glom_mtd[2]['tile_col_pos'],glom_mtd[2]['tile_row_pos']))
ax[2][0].imshow(glom_tiles[2])
ax[2][1].imshow(glom_masks[2])

ax[3][0].set_title("Tile ID = {} Xpos = {} Ypos = {}".format(glom_mtd[3]['tile_id'], glom_mtd[3]['tile_col_pos'],glom_mtd[3]['tile_row_pos']))
ax[3][0].imshow(glom_tiles[3])
ax[3][1].imshow(glom_masks[3])

ax[4][0].set_title("Tile ID = {} Xpos = {} Ypos = {}".format(glom_mtd[4]['tile_id'], glom_mtd[4]['tile_col_pos'],glom_mtd[4]['tile_row_pos']))
ax[4][0].imshow(glom_tiles[4])
ax[4][1].imshow(glom_masks[4])

So, now you can build a file list of GCS file paths using the tile metadata in the CSV files. You can build any filter you want, for example, avoid pure black or white tiles that have the lowband_density < 100, for example. In my next notebook I will be building a balanced set with gloms and no gloms to feed a UNET model using TPUs.

In order to save all the hard work done here, let's generate dataframes that contain the local and gcs file paths for all tiles and include them with the daset. That way the paths for each tile will be readily available for either local or remote access. I will then update the dataset with the CSVs that contain the files generated below.

In [None]:
# create a uber lits of all tiles for all images. Include local and cloud paths for each tile
def create_dataset_file_info( dataset_name):
    file_list = glob.glob('/kaggle/input/hubmap-tfrecord-512/{}/*.csv'.format(dataset_name))
    uber_tile_df = pd.DataFrame(columns = ['img_id','tile_id','tile_row_index', 'tile_col_index', 'lowband_density', 'mask_density','local_path','gcs_path'])
    for file_name in file_list:
            tile_df = pd.read_csv(file_name)
            num_tiles = tile_df.__len__()
            for index in range(num_tiles):
                img_id = tile_df.iloc[index]['img_id']
                tile_id = tile_df.iloc[index]['tile_id']
                tile_col_num = tile_df.iloc[index]['tile_col_num']
                tile_row_num = tile_df.iloc[index]['tile_row_num']
                lowband_density = tile_df.iloc[index]['lowband_density']
                mask_density = tile_df.iloc[index]['mask_density']           
                local_tile_path = get_tile_local_path(dataset_name, img_id, tile_col_num, tile_row_num )
                gcs_tile_path = get_tile_gcs_path(bucket_name, dataset_name, img_id, tile_col_num, tile_row_num )
                uber_tile_df = uber_tile_df.append({'img_id':img_id, 'tile_id': tile_id, 'tile_row_index':tile_col_num, 'tile_col_index':tile_row_num,
                                                     'lowband_density':lowband_density, 'mask_density':mask_density,
                                                     'local_path':local_tile_path, 'gcs_path':gcs_tile_path},ignore_index=True)
    #write the dataframe
    print('writing tile metadata for dataset {}'.format(dataset_name))
    output_file_name = '/kaggle/working/'+dataset_name+'_all_tiles.csv'
    uber_tile_df.to_csv(output_file_name)
    return uber_tile_df

In [None]:
uber_tile_df = create_dataset_file_info('test')

In [None]:
uber_tile_df.head()

In [None]:
uber_train_tile_df = create_dataset_file_info('train')

In [None]:
uber_train_tile_df.head()