# ***Disclaimer:*** 
Hello Kagglers! I am a Solution Architect with the Google Cloud Platform. I am not an insider to this competition, so I am allowed to contribute and even compete, although I cannot collect prizes. The focus of my contributions is on helping users to leverage GCP components (GCS, TPUs, BigQueryetc..) in order to solve large problems. My ideas and contributions represent my own opinion, and are not representative of an official recommendation by Google. Also, I try to develop notebooks quickly in order to help users early in competitions. There may be better ways to solving particular problems, I welcome comments and suggestions. Use my contributions at your own risk, I don't garantee that they will help on winning any competition, but I am hoping to learn by collaborating with everyone.

# **Acknowledgements:**
*     I would like thank [Laura Fink](https://www.kaggle.com/allunia) for her [excellent notebook](https://www.kaggle.com/allunia/pulmonary-dicom-preprocessing) on Dicom processing. I used some snippets to setup the DiCom environment and produce file path names to read the dataset, as well as to plot images[](http://).
*     I also thank [Ian Pan](https://www.kaggle.com/vaillant) for the very insightful [discussion posting](https://www.kaggle.com/c/rsna-str-pulmonary-embolism-detection/discussion/182930) explaining the importance of the CT Window and how to implement it. I used the snippet in that post to convert the DiCom images using the PE Specific Window as described in his posting.
    

# **Objective:** 
The objective of this notebook is to help contributors to the [RSNA competition](https://www.kaggle.com/c/rsna-str-pulmonary-embolism-detection) to get an jump start in using [TFRecords](https://www.tensorflow.org/tutorials/load_data/tfrecord) in order to manage the huge size of the input dataset (close to 1TB!). The strategy I propose is to use Google Cloud Storage (GCS) to store the dataset in TFRecord format and then use the TF Dataset object to read it from different platforms (CPU, GPU, TPU, etc..). This topic was previously explained using a JPEG dataset in [this notebook](https://www.kaggle.com/paultimothymooney/convert-kaggle-dataset-to-gcs-bucket-of-tfrecords), so I thought that an adaptation to a DiCom dataset would give users a jump start.  

Notice that you will get a huge benefit by managing your datasets in GCS in this competition, due to the large amount of data. It is also very important that your model be able to manage the memory used while ingesting the dataset. This is true if you are training using CPU, GPU or TPUs. The best way to this is to use the TFRecord format, which you can then read it directly into a [tf.dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset). The tf.dataset has amazing memory management and support for ingestion pipelines. This will give you "superpowers" for training with large datasets, in any computing platform.

Using the TFRecord format for storing data should be easy, but unfortunately it requires data serialization which complicates it a little bit. This is done using [protocol buffers](https://developers.google.com/protocol-buffers/) and that is a bit of a learning curve. But in ML you only need to understand the [TFExample](https://www.tensorflow.org/api_docs/python/tf/train/Example) format. In this Notebook I provide a little template code for dealing with TFExamples that can be quickly customized for any type of data. This template is explained in detail in [this tutorial](https://www.tensorflow.org/tutorials/load_data/tfrecord); but you don't need to read all this, in this Notebook I provide an example specific for image data that you can quickly customize.

# **Setup:**
This notebook can use either local storage or GCS storage. The path you follow is controlled by the GCP_SDK_ENABLED variable below. I set the default for local storage (GCP_SDK_ENABLED = False) to enable users to quickly run the example in local storage just for learning purposes. But the real value of this Notebook is the capability to load large datasets to GCS storage. This is enabled by setting (GCP_SDK_ENABLED = True) in the cell below. To enable the GCP SDK you must also link this Notebook to a GCP project using "Add-ons->Google Cloud SDK" in this Notebook. 

You also need to create a GCS bucket in the linked project and provide the bucket name in the BUCKET_NAME variable below. If you get a "permission denied" error when reading or writing it will be because either SDK was not enabled or the bucket name was not correct. If you get a "NOT FOUND" error it is probably because the GCS Bucket does not exist or was mispelled. 

In [None]:
# After you have linked your project change this flag to True to use GCS and execute the next cell as well to enable access
# To link your GCP project see Menu "Add-ons->Google Cloud SDK"
GCP_SDK_ENABLED = False

## IMPORTANT
## YOU MUST MODIFY THE LINE BELOW WITH THE NAME OF THE BUCKET YOU ARE USING IN GCS
BUCKET_NAME = "gs://your-unique-bucket-name-here"

After you enabled the GCP SDK as described above, you can then grant Tensorflow permission to read and write to your GCS bucket. This is a great feature, it will also allow a TPU to read your datasets. If you get a permission denied at any point, make sure your project is linked and re-run this cell.

In [None]:
# Authorize Tensorflow to write directly to GCS
# This requires this Notebook to be linked to a GCP project, see Menu "Add-ons->Google Cloud SDK"
# After linking the project, import credentials and authorize Tensorflow as follows

from kaggle_secrets import UserSecretsClient

if( GCP_SDK_ENABLED):
    user_secrets = UserSecretsClient()
    user_credential = user_secrets.get_gcloud_credential()
    user_secrets.set_tensorflow_credential(user_credential)

# Steps:
1. [Install gdcm package, import packages](#install)
2. [Build Input File List from RSNA PE Challenge Dataset](#par2)
3. [Read DiCom images](#par3) 
4. [Define CT Window Function](#par4)
5. [Define the TFRecord Format](#par5)
6. [Write TFRecord to Storage](#par6)
7. [Reload TFRecord from Storage to Test Correctness](#par7)
8. [Scale up! Upload a range of studies to GCS](#par8)

# 1. Install gcdm Package, import packages <a class="anchor" id="install"></a>

In [None]:
!conda install -c conda-forge gdcm -y

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

import pydicom
import scipy.ndimage
import gdcm

from os import listdir, mkdir
import os

# 2. Build Input File List from RSNA PE Challenge Dataset <a class="anchor" id="par2"></a>

In [None]:
listdir("../input/")

In [None]:
basepath = "../input/rsna-str-pulmonary-embolism-detection/"
listdir(basepath)

In [None]:
train_df = pd.read_csv(basepath + "train.csv")
test_df = pd.read_csv(basepath + "test.csv")

In [None]:
train_df.head()

In [None]:
# create a list of unique Study Ids
list_of_studies = train_df.StudyInstanceUID.unique()

In [None]:
list_of_studies.shape

In [None]:
# create a list of file directories for each study 
train_df["dcm_path"] = basepath + "train/" + train_df.StudyInstanceUID + "/" + train_df.SeriesInstanceUID
list_of_directories = train_df.dcm_path.unique()
list_of_directories.shape

# 3. Load DiCom Images for a Given Study <a class="anchor" id="par3"></a>
The loading function was borrowed by [this posting](https://www.kaggle.com/c/rsna-str-pulmonary-embolism-detection/discussion/182930). 

In [None]:
def load_dicom_array(dcm_path):
    #dicom_files = glob.glob(osp.join(f, '*.dcm'))
    #dicoms = [pydicom.dcmread(d) for d in dicom_files]
    dicom_files = listdir(dcm_path)
    dicoms = [pydicom.dcmread(dcm_path + "/" + file) for file in listdir(dcm_path)]
    M = float(dicoms[0].RescaleSlope)
    B = float(dicoms[0].RescaleIntercept)
    # Assume all images are axial
    z_pos = [float(d.ImagePositionPatient[-1]) for d in dicoms]
    dicoms = np.asarray([d.pixel_array for d in dicoms])
    dicoms = dicoms[np.argsort(z_pos)]
    dicoms = dicoms * M
    dicoms = dicoms + B
    return dicoms, np.asarray(dicom_files)[np.argsort(z_pos)]

In [None]:
dicom_imgs, img_names = load_dicom_array(list_of_directories[0])

In [None]:
img_names

In [None]:
fig, ax = plt.subplots(1,2,figsize=(20,3))
ax[0].set_title("Original CT-scan")
ax[0].imshow(dicom_imgs[0], cmap="bone")
ax[1].set_title("Pixelarray distribution");
sns.distplot(dicom_imgs[0].flatten(), ax=ax[1]);

# 4. Define CT Window Function <a class="anchor" id="par4"></a>
The CT Window function below was suggested [in this posting](https://www.kaggle.com/c/rsna-str-pulmonary-embolism-detection/discussion/182930).

In [None]:
def CT_window(img, WL=50, WW=350):
    upper, lower = WL+WW//2, WL-WW//2
    X = np.clip(img.copy(), lower, upper)
    X = X - np.min(X)
    X = X / np.max(X)
    #X = (X*255.0).astype('uint8')
    return X

In [None]:
windowed_ct = CT_window(dicom_imgs[0], 100, 700)

In [None]:
fig, ax = plt.subplots(1,2,figsize=(20,3))
ax[0].set_title("PE Specific CT-scan")
ax[0].imshow(windowed_ct, cmap="bone")
ax[1].set_title("Pixelarray distribution");
sns.distplot(windowed_ct.flatten(), ax=ax[1]);

# 5. Define TFRecord Format <a class="anchor" id="par5"></a>
The code that follows is the TFExample template as explained [here](https://www.tensorflow.org/tutorials/load_data/tfrecord). Here is a quick explanation of how this works. For serialization using TFExample, we have to make any data fit into either one of 3 types:
* bytes_feature
* float_feature
* int_64_feature

Image data and strings are coded as bytes_feature. any other type is either float or int. The image_example will produce recrods with the types needed by a model. The image_raw field will contain the image bytes, the other fields are metadata. The pred_label field will be used to indicate if PE is present in the image.

This is a very useful snippet that you can customize if you want to define your own TFRecord type with specific fields. This template code is associated with a "TFRecord Write" procedure and a "TFRecord Read" procedure. So, to deal with TFRecords you need the "definition" snippet below, the [write snippet](#par6) and the [read snippet](#par7) -- all provided in this Notebook. You can use these 3 snippets as a template and customize them to add fields, etc... But keep in mind that you keep the data type definitions of the 3 snippets in sync.

In [None]:
# Define the TFExample Data type for training models
# Our TFRecord format will include the CT Image and metadata of the image, including the prediction label (is PE present)

import tensorflow as tf

# Utilities serialize data into a TFRecord
def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def image_example(image, study_id, image_name, pred_label):
    image_shape = image.shape
    image_bytes = image.tostring()
    feature = {
        'height': _int64_feature(image_shape[0]),
        'width': _int64_feature(image_shape[1]),
        'image_raw': _bytes_feature(image_bytes),
        'study_id': _bytes_feature(study_id.encode()),
        'img_name': _bytes_feature(image_name.encode()),
        'pred_label':  _int64_feature(pred_label)
    }
    return tf.train.Example(features=tf.train.Features(feature=feature))


# 6. Write TFRecord to Storage <a class="anchor" id="par6"></a>

This is the snippet that defines a function to write TFRecords to Storage. It is very interesting in this example that the tf.io.TFRecordWriter can write data straight to GCS!!! Note that if you have the GCP_SDK_ENABLED it will use a GCS path to write directly. If you get a permission problem make sure your project is linked to GCP and that you have a bucket in that project with the correct name, check the variables in the first cell of this Notebook. If you get a "NOT FOUND" it is likely the bucket name was wrong or does not exist. 

In [None]:
# Define function to write the TFRecord
# First, process each image into `tf.Example` messages.
# Then,TFRecordWriter to write to a `.tfrecords` file.


PE_WINDOW_LEVEL = 100
PE_WINDOW_WIDTH = 700

def create_tfrecord( study_id, study_path, sdk_enabled=False):
    if(sdk_enabled):
        storage_file_path = BUCKET_NAME+"/RSNA_PE/PE_Window_512/train/"+study_id+".tfrecords"
    else:
        storage_file_path = '/kaggle/working/'+study_id+'.tfrecords'

    study_images, study_image_file_names = load_dicom_array(study_path)
    num_records = study_images.__len__()

    total_records = 0
    with tf.io.TFRecordWriter(storage_file_path) as writer:
        for index in range(num_records):
            img_file_name = study_image_file_names[index]
            img_name = img_file_name.split(".")[0]
            img_data = train_df.loc[train_df["SOPInstanceUID"] == img_name]
            pred_label = img_data["pe_present_on_image"].values[0]
            #print("pred_label ",pred_label)
            windowed_image = CT_window(study_images[index], PE_WINDOW_LEVEL, PE_WINDOW_WIDTH)
            tf_example = image_example(windowed_image, study_id, img_name, pred_label)
            writer.write(tf_example.SerializeToString())
            total_records = total_records + 1
            print("*",end='')
            #print("wrote {}".format(img_name))
        writer.close()
        
    print("wrote {} records".format(total_records))
    return total_records

In [None]:
# This will write to local storage, just as a test. You can then refresh the output diretory and see the record file
num_records = create_tfrecord( list_of_studies[0], list_of_directories[0], False)

In [None]:
# now write to Cloud Storage (if GCP_SDK_ENABLED)
num_records = create_tfrecord( list_of_studies[0], list_of_directories[0], GCP_SDK_ENABLED)

In [None]:
num_records

At this point you can use the GCS storage browser in your project you should see the TFRecord file just created.

# 7. Reload TFRecord from Storage to Test Correctness <a class="anchor" id="par7"></a>

In [None]:
# Create a dictionary describing the features.
image_feature_description = {
    'height': tf.io.FixedLenFeature([], tf.int64),
    'width': tf.io.FixedLenFeature([], tf.int64),
    'image_raw': tf.io.FixedLenFeature([], tf.string),
    'study_id': tf.io.FixedLenFeature([], tf.string),
    'img_name': tf.io.FixedLenFeature([], tf.string),
    'pred_label': tf.io.FixedLenFeature([], tf.int64)
}

def _parse_image_function(example_proto):
  # Parse the input tf.Example proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, image_feature_description)

In [None]:
# and now we can reload the dataset again directly from GCS

sample_study_id = list_of_studies[0]
if(GCP_SDK_ENABLED):
    storage_file_path = BUCKET_NAME+"/RSNA_PE/PE_Window_512/train/"+sample_study_id+".tfrecords"
else:
    storage_file_path = '/kaggle/working/'+sample_study_id+'.tfrecords'

encoded_image_dataset = tf.data.TFRecordDataset(storage_file_path)
parsed_image_dataset = encoded_image_dataset.map(_parse_image_function)
parsed_image_dataset

In [None]:
# extract a record from the dataset and display the image
def load_dataset(dataset) :
    reloaded_images = []
    img_mtd = []
    i=0
    for image_features in dataset.as_numpy_iterator():
        i=i+1
        sample_image = np.frombuffer(image_features['image_raw'], dtype='float64')
        mtd = dict()
        mtd['width'] = image_features['width']
        mtd['height'] = image_features['height']
        mtd['study_id'] = image_features['study_id'].decode()
        mtd['img_name'] = image_features['img_name'].decode()
        mtd['pred_label'] = image_features['pred_label']                                  
        reloaded_images.append(sample_image.reshape(mtd['width'],mtd['height'])) 
        img_mtd.append(mtd)
    return reloaded_images, img_mtd

In [None]:
#print(len(sample_array))
reloaded_images, img_mtd = load_dataset(parsed_image_dataset)
print(reloaded_images[0].shape)  
print(img_mtd[0])

In [None]:
fig, ax = plt.subplots(1,2,figsize=(20,3))
ax[0].set_title("Reloaded CT-scan {}".format(img_mtd[0]['img_name']))
ax[0].imshow(reloaded_images[0], cmap="bone")
ax[1].set_title("Pixelarray distribution");
sns.distplot(reloaded_images[0].flatten(), ax=ax[1]);

# 8. Scale up! Upload a range of studies to GCS <a class="anchor" id="par8"></a>
We can load a range of studies using the code below, but I found out that are over 7279 studies and that would take days! I don't think training over the entire 1T dataset is the best strategy here... Maybe it is best to select a subset. The functions I provide here take a list of studies as input, so you can trim it to a specific subset of interest. In the example below a write the 3 first studies.

In [None]:
num_studies = list_of_studies.shape[0]
lower_range = 0
upper_range = 3
# you can upload up to num_studies, which is more than 17,000 and will take days. 
for index in range(lower_range,upper_range):
    print("processing study {} out of {}".format(index,num_studies))
    print("writing tfrecords for {}".format(list_of_studies[index]))
    num_records = create_tfrecord(list_of_studies[index], list_of_directories[index], GCP_SDK_ENABLED)
    print("wrote {} records".format(num_records))