# ***Disclaimer:*** 
Hello Kagglers! I am a Solution Architect with the Google Cloud Platform. I am not an insider to this competition, so I am allowed to contribute and even compete, although I cannot collect prizes. The focus of my contributions is on helping users to leverage GCP components (GCS, TPUs, BigQueryetc..) in order to solve large problems. My ideas and contributions represent my own opinion, and are not representative of an official recommendation by Google. Also, I try to develop notebooks quickly in order to help users early in competitions. There may be better ways to solving particular problems, I welcome comments and suggestions. Use my contributions at your own risk, I don't garantee that they will help on winning any competition, but I am hoping to learn by collaborating with everyone.



# Note: 
This is a very early version of this Notebook. I got to the point that it uses distributed training on a very large dataset using TPUs, so I wanted to share it early to help folks dealing with the huge size of the RSNA-STR competition dataset. Check later versions for more comments and explanations, in particular are this point I am not checkpoing nor saving the trained model.

This notebook uses a sample TFRecords dataset which I made public so that everyone can try this example quickly. I am also publishing the notebook I used to create the dataset, which you can quickly modify to suit your needs. Here are the references for the input dataset and the Notebooks that explain how it was built:

Input Dataset: [RSNA_PE_WINDOW_MIXED](https://www.kaggle.com/marcosnovaes/rsna-pe-window-mixed): Contains 10000 images from the competition dataset, 50% PE positive and 50% PE negative, labeled 1 for positive PE and 0 for negative. 

Notebook: [Building a TF Record Dataset](https://www.kaggle.com/marcosnovaes/building-a-tfrecord-dataset) : Explains how the TFRecord dataset was built

Notebook: [Managing Large Datasets with TFRecords](https://www.kaggle.com/marcosnovaes/managing-large-datasets-with-tfrecords-on-gcp) : Explains in detail the how to build datasets with TFRecords

**Objective**
This notebook explains how to deal with large datasets and also how to leverage distributed training with TPUs. This notebook can run using either GPUs or TPUs. If you are using GPUs you will have to limit the size of the dataset of you will ran out of memory. However, with TPUs the Notebook can process all 10000 images at full resolution to train a deep network. 

**Setup**
1. This Notebook uses both Google Cloud Services and the Google Cloud SDK. Make sure to enable both by linking your GCP project using the Menu "Add-ons-->Google Cloud SDK" and "Add-ons-->Google Cloud Services".
2. This Notebook must be configured with an accelerator. Add either a GPU or a TPU (works much better on TPUs)
3. Add the sample TF Record dataset RSNA_PE_WINDOW_MIXED to the Notebook by using the "Add" button in the input folder of this Notebook

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

import scipy.ndimage

from os import listdir, mkdir
import os
import time

import tensorflow as tf

from kaggle_datasets import KaggleDatasets
from kaggle_secrets import UserSecretsClient

# Import Keras Libraries

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Reshape
from keras.layers import Flatten
from keras.layers import Conv2D
from keras.layers import Conv2DTranspose
from keras.layers import LeakyReLU
from keras.layers import Dropout
from tensorflow.keras import layers


In [None]:
print(tf.__version__)

In [None]:
# If you are using TPUs, execute this cell and skip the next cell

# Use the cluster resolver to communicate with the TPU

tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)

strategy = tf.distribute.experimental.TPUStrategy(tpu)

In [None]:
worker_list=tpu.cluster_spec()
#print(vars(worker_list))
print(worker_list._cluster_spec["worker"][0])


In [None]:
!echo $TPU_NAME

In [None]:
# If you are using a GPU, remove the comment and execute this cell as opposed to the previous cell
#strategy = tf.distribute.MirroredStrategy()

Check how many cores we have. For TPUs we should see 8 cores, but only one for GPUs. The data will be distributed to all the cores, so we will use the number of cores to calculate the batch size for each instance

In [None]:
strategy.num_replicas_in_sync

You must provide the name of a GCP project in this cell. This is just used to set up the storage API, which will be used to list the files that are stored in the dataset. YOU MUST CHANGE THE "YOUR_PROJECT_ID" TO POINT TO THE PROJECT YOU LINKED TO THIS NOTEBOOK USING THE MENU Add Ons-->Google Cloud Services

In [None]:
# Set your own project id here
YOUR_PROJECT_ID = 'your_project_ID_here'

from google.cloud import bigquery
bigquery_client = bigquery.Client(project=YOUR_PROJECT_ID)
from google.cloud import storage
storage_client = storage.Client(project=YOUR_PROJECT_ID)

Utility functions to interact with the Storage API

In [None]:
def create_bucket(dataset_name):
    """Creates a new bucket. https://cloud.google.com/storage/docs/ """
    bucket = storage_client.create_bucket(dataset_name)
    print('Bucket {} created'.format(bucket.name))

def upload_blob(bucket_name, source_file_name, destination_blob_name):
    """Uploads a file to the bucket. https://cloud.google.com/storage/docs/ """
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)
    blob.upload_from_filename(source_file_name)
    print('File {} uploaded to {}.'.format(
        source_file_name,
        destination_blob_name))
    
def list_blobs(bucket_name):
    """Lists all the blobs in the bucket. https://cloud.google.com/storage/docs/"""
    blobs = storage_client.list_blobs(bucket_name)
    for blob in blobs:
        print(blob.name)
        
def get_blob_names(bucket_name):
    """Lists all the blobs in the bucket. https://cloud.google.com/storage/docs/"""
    blobs = storage_client.list_blobs(bucket_name)
    return blobs
        
def download_to_kaggle(bucket_name,destination_directory,file_name):
    """Takes the data from your GCS Bucket and puts it into the working directory of your Kaggle notebook"""
    os.makedirs(destination_directory, exist_ok = True)
    full_file_path = os.path.join(destination_directory, file_name)
    blobs = storage_client.list_blobs(bucket_name)
    for blob in blobs:
        blob.download_to_filename(full_file_path)

Code to read and decode TFRecords. This is the same code explained in the previous notebooks (see note at the top of this Notebook). I have made one change for this Notebook, I am applying the CT Window function as part of the decoding function. I had to change the CT Window function to use the Tensorflow math package (clip, min and max) as opposed to numpy. Numpy does not operate with Tensors, so all operations that you execute on a TPU that do computation must use Tensorflow equivalents. 

TODO on this version: Note that I commented out decoding the string values of the dataset (study Id and Image name). TPUs do not support passing string types on the dataset. It is still helpful to pass an identifier in case that we need other image attributes during training. This will be shown in a future version.

In [None]:
# Define the TFExample Data type for training models
# Our TFRecord format will include the CT Image and metadata of the image, including the prediction label (is PE present)

# Utilities serialize data into a TFRecord
def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

# Create a dictionary describing the features.
image_feature_description = {
    'height': tf.io.FixedLenFeature([], tf.int64),
    'width': tf.io.FixedLenFeature([], tf.int64),
    'image_raw': tf.io.FixedLenFeature([], tf.string),
    'study_id': tf.io.FixedLenFeature([], tf.string),
    'img_name': tf.io.FixedLenFeature([], tf.string),
    'pred_label': tf.io.FixedLenFeature([], tf.int64)
}

PE_WINDOW_LEVEL = 100
PE_WINDOW_WIDTH = 700

def _parse_image_function(example_proto):
  # Parse the input tf.Example proto using the dictionary above.
    single_example = tf.io.parse_single_example(example_proto, image_feature_description)
    img_height = single_example['height']
    img_width = single_example['width']
    img_bytes = tf.io.decode_raw(single_example['image_raw'],out_type='float64')
    resized_image = tf.reshape(img_bytes, (img_height,img_width))
    windowed_image = CT_window(resized_image, PE_WINDOW_LEVEL,PE_WINDOW_WIDTH )
    sample_image = tf.reshape(windowed_image, (img_height,img_width,1))
    mtd = dict()
    mtd['width'] = single_example['width']
    mtd['height'] = single_example['height']
    #mtd['study_id'] = tf.io.decode_base64(single_example['study_id'])
    #mtd['img_name'] = tf.io.decode_base64(single_example['img_name'])
    mtd['pred_label'] = single_example['pred_label']
    struct = {
    'img': sample_image,
    'img_mtd': mtd
    } 
    return struct


def read_tf_dataset(storage_file_path):
    encoded_image_dataset = tf.data.TFRecordDataset(storage_file_path, compression_type="GZIP")
    record_structs = encoded_image_dataset.map(_parse_image_function)
    return record_structs

def CT_window(img, WL=50, WW=350):
    upper, lower = WL+WW//2, WL-WW//2
    X = tf.clip_by_value(img, lower, upper)
    X = X - tf.math.reduce_min(X)
    X = X / tf.math.reduce_max(X)
    return X

Important Globals:
1. BUFFER_SIZE = Used for shuffling the dataset (see tf.dataset.buffer() )
1. GLOBAL_BATCH_SIZE = how many images in a training step. We will distributed them among the cores.(see tf.dataset.batch())
1. BATCH_SIZE_PER_REPLICA = how many images each core will process in one training step

In [None]:
## GLOBALS AND CONSTANTS
BUFFER_SIZE=120
GLOBAL_BATCH_SIZE=1024
NUM_REPLICAS=strategy.num_replicas_in_sync
BATCH_SIZE_PER_REPLICA = GLOBAL_BATCH_SIZE // NUM_REPLICAS

EPOCHS=3
LEARNING_RATE=0.00005
BETA_1=0.1

# provide a file name where checkpoints will be stored.
experiment_number = '2'
# setup a bucket for saving checkpoints
your_bucket_name = 'your_bucket_name_here'
checkpoint_path = 'gs://'+your_bucket_name+'/training_checkpoints/exp' + experiment_number
checkpoint_prefix  = checkpoint_path + '/_ckpt'


Grant Tensorflow permission to read the dataset. I ended up making my dataset public, but you can also use your own private dataset as long as you grant permissions as below. 

In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
user_credential = user_secrets.get_gcloud_credential()
user_secrets.set_tensorflow_credential(user_credential)

Discover the GCS path of the public dataset 'rsna-pe-window-mixed'. To make this easier, import the dataset using the "Add" button in the input folder of the Notebook. Then look for your dataset spelled exactly as it appears in the input folder (note that it uses "-" as opposed to the "_" separator that I used when creating the dataset.

In [None]:
#from kaggle_datasets import KaggleDatasets
GCS_PATH = KaggleDatasets().get_gcs_path('rsna-pe-window-mixed')
GCS_PATH

To get the name of the bucket that the dataset reside, you strip the "gs://" prefix

In [None]:
# get a list of directories
# strip the first 5 chars, that is the "gs://" prefix
bucket_name = GCS_PATH[5:]
bucket_name


We now  make an array of filenames using the list_blobs function of the storage API. We put the "gs://" prefix back in the file names as we will need it when we give the filenames to the TPU. I divided the 10,000 images into 200 files in 5 directories. The reason is so that each tfrecord file will be around 20M. Dividing your dataset into smaller files will allow the tf.dataset object to do a better job of accessing them and managing its cache.

In [None]:
filenames = []
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
    filenames.append('gs://{}/{}'.format(bucket_name,blob.name))
filenames.__len__()

In [None]:
filenames[0]

Read the dataset locally for verification

In [None]:
g_dataset = read_tf_dataset(filenames)
subset = g_dataset.take(1)
test_image = []
for struct in subset.as_numpy_iterator():
    #struct = g_dataset.get_next()
    img_mtd = struct["img_mtd"]
    img_bytes = struct["img"]
    test_image = img_bytes.reshape(512,512)
    #print("img_name = {}, pred_label = {}, image_shape = {}".format(img_mtd["img_name"], img_mtd["pred_label"], img_bytes.shape))
    fig, ax = plt.subplots(1,2,figsize=(20,3))
    ax[0].set_title("PE Specific CT-scan")
    ax[0].imshow(test_image, cmap="bone")
    ax[1].set_title("Pixelarray distribution");
    sns.distplot(test_image.flatten(), ax=ax[1]);
    #for img_data in g_dataset["mtd"]:
    #    print("img_name = {}, pred_label = {}".format(img_data["img_name"], img_data["pred_label"]))

Here is the classification model. This example is adapted from the [DC-GAN Tutorial](https://www.tensorflow.org/tutorials/generative/dcgan). It downsamples the image by 2 (stride = 2) using Conv2D. I am just using this model as an example, the focus of this Notebook is not model architecture. However, I find this model converges very quickly, so it lookd like a good choice. 

In [None]:
## Define the Discriminator Network - This example is for a 512 x 512 image. 

def make_discriminator_model():
    model = tf.keras.Sequential()
    #model.add(layers.Conv2D(64, (5, 5), strides=(2), padding='same',input_shape=[512, 512, 1]))
    model.add(layers.Conv2D(64, (5, 5), strides=(2, 2), padding='same', input_shape=[512, 512, 1]))
    model.add(layers.LeakyReLU())
    model.add(layers.Dropout(0.3))

    model.add(layers.Conv2D(128, (5, 5), strides=(2, 2), padding='same'))
    model.add(layers.LeakyReLU())
    model.add(layers.Dropout(0.3))
    
    model.add(layers.Conv2D(256, (5, 5), strides=(2, 2), padding='same'))
    model.add(layers.LeakyReLU())
    model.add(layers.Dropout(0.3))
    
    model.add(layers.Conv2D(512, (5, 5), strides=(2,2 ), padding='same'))
    model.add(layers.LeakyReLU())
    model.add(layers.Dropout(0.3))

    model.add(layers.Flatten())
    #model.add(layers.Dense(1))
    model.add(layers.Dense(1, activation='relu'))

    return model

The test image we loaded is of shape 512x512. To feed it into the model we need to add a batch dimension at position [0] and a channel dimension at position [3]

In [None]:
test_image.shape

In [None]:
test_img = test_image.reshape(1,512,512,1)

We can now test if the model works locally

In [None]:
discriminator = make_discriminator_model()
# provide the image we just generated, and get the decisio score. It should be near zero, since we provided a noie image
decision = discriminator(test_img)
print (decision)

Starting here, we will define some GLOBAL DISTRIBUTED variables and functions. By using the strategy.scope() context we make these functions and variables available to the TPUs. We will implement a [custom training loop as described in this tutorial](https://www.tensorflow.org/tutorials/distribute/custom_training). 

In [None]:
# define loss function
with strategy.scope():
    loss_object = tf.keras.losses.BinaryCrossentropy(
    from_logits=True,
    reduction=tf.keras.losses.Reduction.NONE)

    def compute_loss(labels, predictions):
        per_example_loss = loss_object(labels, predictions)
        return tf.nn.compute_average_loss(per_example_loss, global_batch_size=GLOBAL_BATCH_SIZE)

In [None]:
# define loss metrics

with strategy.scope():
    test_loss = tf.keras.metrics.Mean(name='test_loss')

    train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
    test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='test_accuracy')

Create a GLOBAL variable for the model. Note that the "discriminator" variable will be common for BOTH the CPU and TPU. This is how we can retrieve and save the trained model. Tensorflow takes care of all the serialization needed to make this happen, it is really an AWESOME feature.

In [None]:
# create global variables in the strategy scope
with strategy.scope():
    discriminator = make_discriminator_model()
    discriminator_optimizer = tf.keras.optimizers.Adam(lr=LEARNING_RATE, beta_1=BETA_1, amsgrad=False)
    

The "train_step" function is the main function of the training loop. Note the "tf.function" adornment. This will activate the "Autograph" feature of TF 2.X and it will compile the function for fast execution in an accelerator. Once it is on the accelerator it is hard to debug, you will not have access to the print function. When developing, I first comment out the "@tf.function" to run locally, then I use @tf.function but still run locally on GPUs to test the compilation, then I finally switch to TPUs when everything works fine. 

In [None]:
with strategy.scope():
        
    def restore_models( gan_checkpoint, checkpoint_directory ):
        status = gan_checkpoint.restore(tf.train.latest_checkpoint(checkpoint_directory))
        return status

# Define Train Step
    @tf.function
    def train_step(tf_records):
        with tf.GradientTape() as disc_tape:
            images = tf_records["img"]
            reshaped_images = tf.reshape(images,(BATCH_SIZE_PER_REPLICA,512,512,1))
            labels = tf_records["img_mtd"]["pred_label"]
            labels = tf.dtypes.cast(labels, tf.float32)
            labels = tf.reshape(labels, (BATCH_SIZE_PER_REPLICA,1))
            model_output = discriminator(reshaped_images, training=True)
            loss_object = tf.keras.losses.BinaryCrossentropy(from_logits=True,reduction=tf.keras.losses.Reduction.NONE)
            per_example_loss = loss_object( labels, model_output)
            disc_loss = tf.nn.compute_average_loss(per_example_loss, global_batch_size=GLOBAL_BATCH_SIZE)
           
        step_loss = 0.1 * disc_loss  
        gradients_of_discriminator = disc_tape.gradient(disc_loss, discriminator.trainable_variables)
        discriminator_optimizer.apply_gradients(zip(gradients_of_discriminator, discriminator.trainable_variables))
       
        return step_loss

Note how these variable are used. 

In [None]:
GLOBAL_BATCH_SIZE

In [None]:
BATCH_SIZE_PER_REPLICA

Here is the call that execute the train step on all 8 cores of the TPU. This is done using strategy.run() followed by strategy.reduce() that will use the new "All Reduce" capability that was introduced for TPUs with TF version 2.X

In [None]:
with strategy.scope():
  # `run` replicates the provided computation and runs it
  # with the distributed input. Note the use of the ReduceOp.SUM in the strategy.reduce operation
    @tf.function
    def distributed_train_step(dataset_inputs):
        per_replica_losses = strategy.run(train_step, args=(dataset_inputs,))
        #test_loss = 0.5
        reduced_loss = strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,axis=None)
        return reduced_loss
    
    #tf.function
    def distributed_test_step(dataset_inputs):
        return strategy.run(test_step, args=(dataset_inputs,))

    

The train loop is a local function that will iterate for each epoch. A Epoch is a run through the entire dataset. For testing purposes I am taking just 32 records as the "subset" variable, but then you have to set GLOBAL_BATCH_SIZE to a smaller number. By default the Notebook is using a GLOBAL_BATCH_SIZE = 1024, which means 128 images per step. If you are using GPU, you have to go MUCH smaller or you will run out of memory.

In [None]:
def train_loop( num_epochs, input_dataset ):
        for epoch in range(num_epochs):
            total_loss = 0.0
            num_batches = 0.0
            start_time = time.time()
           
            # if you are testing, use a subset
            #subset = input_dataset.take(32)
            #train_dataset = subset.batch(GLOBAL_BATCH_SIZE, drop_remainder=True).cache()
            
            # when you are ready to run an epoch on the whole 10,000 images comment the line below. 
            train_dataset = input_dataset.batch(GLOBAL_BATCH_SIZE, drop_remainder=True).cache()
            
            # Distribute the dataset among the several cores
            dist_train_dataset = strategy.experimental_distribute_dataset(train_dataset)
            for x in dist_train_dataset:
                total_loss += distributed_train_step(x)
                # print a * for every step of GLOBAL_BATH_SIZE images
                print("*",end='')
                num_batches += 1
            train_loss = total_loss / num_batches
            end_time = time.time()
            elapsed_time = end_time - start_time 
            img_per_second = num_batches * GLOBAL_BATCH_SIZE /elapsed_time
            # print how many images/sec we processed
            print("training speed = {} images per second".format(img_per_second))
            
        return elapsed_time, train_loss

The "train" function is basically doing some house keeping for every iteration. It will periodically print progress every "status_interval" epoch and checkpoint (not implemented yet) after num_epochs.

In [None]:
# local function to drive training
def train( num_epochs, status_interval, check_option):

    num_processed = 0
    input_dataset = read_tf_dataset(filenames)
    while num_processed < num_epochs:
        print("Training Epoch #{}".format(num_processed))
        epoch_time, train_loss = train_loop(status_interval, input_dataset)
        template = ("Epoch {}, Loss: {}, elapsed time in epoch: {}")
        print (template.format(num_processed, train_loss, epoch_time))
        num_processed += status_interval
        
        if check_option == 1:
            template = ("saving checkpoint: Epoch {}, Loss: {}, elapsed time in last epoch: {}")
            print (template.format(num_processed, train_loss, epoch_time))
            gan_checkpoint.save(gan_checkpoint_prefix)    

The line below will train 3 Epochs, printing an update at every iteration (second arg = 1) and not checkpointing (third arg = 0). I am still implementing checkpointing. For the 10,000 images an epoch is about 128 seconds, not bad for full 512x512 images.

NOTE: the first time you execute the step function it will take some time to compile the needed functions. This is a one time, the following steps will run faster as the code is pre-compiled. 

In [None]:
# !train(2,1,0)

In [None]:
!pip3 install --upgrade "tensorflow==2.3.0"

In [None]:
!pip3 install --upgrade "cloud-tpu-profiler==2.3.0"

In [None]:
!/opt/conda/bin/capture_tpu_profile --service_addr 10.0.0.2:8470 --logdir '/kaggle/working/' --duration_ms 50 --num_tracing_attempts 1  --verbosity -1

In [None]:
!which capture_tpu_profile

In [None]:
import multiprocessing 
import subprocess

import os 
  
def worker1(): 
    # printing process id 
    print("ID of process running worker1: {}".format(os.getpid())) 
    command = '/opt/conda/bin/capture_tpu_profile'
    flags = '--service_addr 10.0.0.2:8470 --logdir /kaggle/working/ --duration_ms 5000'
    #os.system("/opt/conda/bin/capture_tpu_profile/capture_tpu_profile --service_addr 10.0.0.2:8470 --logdir '/kaggle/working/' --duration_ms 5000 > /kaggle/working/cmd_output")
    with open('/kaggle/working/output.txt', 'w') as f:
        process = subprocess.run([command, '--service_addr', '10.0.0.2:8470', '--logdir', '/kaggle/working/','--duration_ms','500', '--num_tracing_attempts', '10'], stdout=f)
        
    
def worker2(): 
    # printing process id 
    print("ID of process running worker2: {}".format(os.getpid()))
    train(3,1,0)
  
if __name__ == "__main__": 
    # printing main program process id 
    print("ID of main process: {}".format(os.getpid())) 
  
    # creating processes 
    p1 = multiprocessing.Process(target=worker1) 
    #p2 = multiprocessing.Process(target=train, args=(1,1,0,)) 
  
    # starting processes 
    p1.start() 
    #p2.start()
    train(1,1,0)

    # process IDs 
    print("ID of process p1: {}".format(p1.pid)) 
    #print("ID of process p2: {}".format(p2.pid)) 
    
    # wait until processes are finished 
    p1.join() 
    #p2.join() 
    
    # both processes finished 
    print("Both processes finished execution!") 
  
    # check if processes are alive 
    print("Process p1 is alive: {}".format(p1.is_alive())) 
    #print("Process p2 is alive: {}".format(p2.is_alive()))

In [None]:

!cat  /kaggle/working/output.txt

I hope this works for everyone, let me know otherwise! Cheers!