<h1>Preprocessing Medical Images (DICOM files) with Apache Beam</h1>

<strong>This notebook demonstrates how to use Apache Beam to convert medical images (in the DICOM format) into TFRecords. </strong>

An interactive version of the notebook is available on Kaggle at:<br>
https://www.kaggle.com/spiroganas/preprocessing-medical-images-with-apache-beam
<hr>

The dataset for Kaggle's ["RSNA STR Pulmonary Embolism Detection" competition](https://www.kaggle.com/c/rsna-str-pulmonary-embolism-detection/) includes approximately 1.94 million medical images, which consume about 912 gigabytes of hard drive space.  <strong><em>You may not believe me, but this is "small data"</em></strong> (you can easily store the entire dataset on a hard drive that costs less than $200). When you start dealing with data at a healthcare system like England's NHS, which serves 56 million people, you simply can't run your analysis on a single computer.  If your gonna work with real <strong>BIG DATA</strong>, you need Apache Beam.

[Apache Beam](https://beam.apache.org/get-started/beam-overview/) is "particularly useful for [Embarrassingly Parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) data processing tasks".  Basically, it let's you run your code on hundreds or thousands of computers at once.

This is a fairly basic example:
* The files are stored on a single, local hard drive.
* The Apache Beam pipeline is running on [Direct Runner](https://beam.apache.org/documentation/runners/direct/), a simple, local execution tool designed for testing.
* The output is TFRecord files

"[The TFRecord format is a simple format for storing a sequence of binary records.](https://www.tensorflow.org/tutorials/load_data/tfrecord)"  TensorFlow can quickly read data stored in the TFRecord format.  This is especially important when the model is being trained on fast GPUs or [TPUs](https://cloud.google.com/tpu) (where IO is frequently the training bottleneck).



If you were running this on real-world data, the data would probably be stored in Google Cloud Storage or AWS S3, and the pipeline would run on [Apache Spark](https://beam.apache.org/documentation/runners/spark/) or [Google DataFlow](https://cloud.google.com/dataflow). 

A lot of the image preprocessing code is based on: https://www.kaggle.com/gzuidhof/full-preprocessing-tutorial





In [None]:
# define a few constants to keep track of the folders we will be using
WORKING_FOLDER = '/kaggle/working/'
TEMP_FOLDER = '/kaggle/temp/'

TRAIN_FOLDER = '/kaggle/input/rsna-str-pulmonary-embolism-detection/train'
TEST_FOLDER = '/kaggle/input/rsna-str-pulmonary-embolism-detection/test'

LABELS_FILE = '/kaggle/input/rsna-str-pulmonary-embolism-detection/train.csv'

# Change this to True if you want to print all the arrays (which make the notebook really hard to read...)
VERBOSE = False

# Step 1: Install and import the libraries we will be using

* [Pydicom](https://pydicom.github.io/) is a library that reads DICOM files
* Some DICOM files contain images that are compressed.  I used [GDCM](https://en.wikipedia.org/wiki/GDCM) to decompress images in this Kaggle notesbook.  If GDCM won't work on your computer, you can also use the pylibjpeg, pylibjpeg-libjpeg, and pylibjpeg-openjpeg libraries.
* opencv is a computer vision library.  We will us it to resize the image and to change it from grayscale to RGB color.
* apache-beam makes it easy to run your pipeline on a cluster of computers.
* The Kaggle Notebook already has some popular data science libraries installed (numpy, tensorflow 2, and pandas).  So if you're runnning this on your own machine, you will need to install those.




In [None]:
# Install the packages we will be using to read and process the DICOM Files
!apt-get -qq update && apt-get -qq install -y  libgdcm-tools python-gdcm
!conda install -y -q -c conda-forge opencv gdcm pydicom apache-beam -y

In [None]:
import numpy as np
import tensorflow as tf
import apache_beam
import pydicom
import gdcm
import cv2 as cv
import matplotlib.pyplot as plt
import os
import sys
import csv
import pickle
import random
from datetime import datetime
print('Libraries Imported!')

%matplotlib inline
np.set_printoptions(threshold=sys.maxsize)
print('Settings Set!')

# Step 2:  Get a list of the DICOM files you want to convert to TFRecords

There are several ways you can do this:
1. Hardcode the filenames as a list (only really useful when you're developing/testing).
2. Store the filenames in a text or Excel file and read them into a list.
3. Use os.walk or glob. 
4. If your files are on AWS, use the boto3 Python library.
5. If your files are on Google Cloud Storage, use the google-cloud-storage Python library.

However you get/create your list of files, they will be the input to the pipeline.  Apache Beam will split the list and send a sub-list to each of the worker computers in your cluster.

(In this Kaggle Notebook example, we are using direct runner, so the entire list of files will be processed on a single machine.  But Apache Beam's raison d'Ãªtre is to farm out tasks to dozens/hundreds/thousands of computer!)

In [None]:
# This will get a list of all the Kaggle Data


def list_dicoms_in_folder(folder):
    '''Lists the full path to all the DICOM files in a given folder.'''
    my_list = []
    for dirname, _, filenames in os.walk(folder):
        for filename in filenames:
            if filename[-4:]=='.dcm':
                my_list.append(os.path.join(dirname, filename))
    return my_list


# Change this to True if you want to do something with all the data
# For testing purposes, you can just use the small list of files defined in the next cell
if False:
    Train_Data = list_dicoms_in_folder(TRAIN_FOLDER)
    Test_Data = list_dicoms_in_folder(TEST_FOLDER)




In [None]:
# This creates a very small dataset for testing
SMALL_DATA = ['/kaggle/input/rsna-str-pulmonary-embolism-detection/train/0003b3d648eb/d2b2960c2bbf/0787742383e4.dcm',
                        '/kaggle/input/rsna-str-pulmonary-embolism-detection/train/0003b3d648eb/d2b2960c2bbf/10454bf652e0.dcm',
                        '/kaggle/input/rsna-str-pulmonary-embolism-detection/train/00cf4b2b751b/540a03df1d81/04a5bfe1bf00.dcm',]
print(SMALL_DATA)

# Step 3:  Create a dictionary that maps medical images to their labels.

We need to link each medical image to the label that we want to predict.

Medical images are uniquely identified by the SOP Instance UID field.  So that will be the key in our dictionary.

1. The labels are stored in a csv file.  
2. The labels will be passed to the Apache Beam pipeline as a [Side Input](https://beam.apache.org/documentation/patterns/side-inputs/).



In [None]:
def create_labels_dict(labels_csv):
    """Reads in a csv file and outputs a dictionary of the SOP Instance UID
    and a list of the other columns."""
    labels_dict = {}
    with open(labels_csv, newline="") as csvfile:
        my_reader = csv.reader(csvfile, delimiter=",", quotechar="|")
        next(my_reader)
        for row in my_reader:
            labels_dict[row[2]] = {
                "pe_present_on_image": int(row[3]),
                "negative_exam_for_pe": int(row[4]),
                "qa_motion": int(row[5]),
                "qa_contrast": int(row[6]),
                "flow_artifact": int(row[7]),
                "rv_lv_ratio_gte_1": int(row[8]),
                "rv_lv_ratio_lt_1": int(row[9]),
                "leftsided_pe": int(row[10]),
                "chronic_pe": int(row[11]),
                "true_filling_defect_not_pe": int(row[12]),
                "rightsided_pe": int(row[13]),
                "acute_and_chronic_pe": int(row[14]),
                "central_pe": int(row[15]),
                "indeterminate": int(row[16]),
            }

    return labels_dict
    
    
labels_dict = create_labels_dict('/kaggle/input/rsna-str-pulmonary-embolism-detection/train.csv')

# Step 4: An example of reading a DICOM Medical Image

In this step, we are going to show how to extract the numpy array containing the medical image from the DICOM file.

So you can think of a DICOM file as if it were a dictionary of key/value pairs.  The keys are called "Headers" or "Tags" and they contain information about the image.  These tags will include things like:
* The machine used to generate the image.
* The date the image was taken.
* The part of the body in the image.
* Patient Identifiers

The pixel_array tag stores a numpy array that represents the actual medical image.  

Sometimes the image data is compressed.  There are a lot of valid ways to compress the image (see the TransferSyntaxUID_dict in the code below).  The TransferSyntaxUID tag will tell you how an image was compressed.  

Pydicom wasn't designed to decompress images, which is why we had to install GDCM.

In [None]:
element = SMALL_DATA[0]
ds = pydicom.dcmread(element)

# The kaggle images are compressed, so pydicom needs to use the GDCM software: https://pydicom.github.io/pydicom/stable/old/image_data_handlers.html
# The TransferSyntaxUID tells you what image compression method was applied to the image
# Dictinary that translates UID to a text description is from: http://dicom.nema.org/medical/dicom/2018a/output/chtml/part06/chapter_A.html

def TransferSyntaxDescription(ds):
    '''Translates the TransferSyntaxUID into a text description of the image compression method used.
    '''

    TransferSyntaxUID_dict = {
        '1.2.840.10008.1.2' : 'Implicit VR Little Endian: Default Transfer Syntax for DICOM',
        '1.2.840.10008.1.2.1' : 'Explicit VR Little Endian',
        '1.2.840.10008.1.2.1.99' : 'Deflated Explicit VR Little Endian',
        '1.2.840.10008.1.2.2' : 'Explicit VR Big Endian (Retired)',
        '1.2.840.10008.1.2.4.50' : 'JPEG Baseline (Process 1): Default Transfer Syntax for Lossy JPEG 8 Bit Image Compression',
        '1.2.840.10008.1.2.4.51' : 'JPEG Extended (Process 2 & 4): Default Transfer Syntax for Lossy JPEG 12 Bit Image Compression (Process 4 only)',
        '1.2.840.10008.1.2.4.52' : 'JPEG Extended (Process 3 & 5) (Retired)',
        '1.2.840.10008.1.2.4.53' : 'JPEG Spectral Selection, Non-Hierarchical (Process 6 & 8) (Retired)',
        '1.2.840.10008.1.2.4.54' : 'JPEG Spectral Selection, Non-Hierarchical (Process 7 & 9) (Retired)',
        '1.2.840.10008.1.2.4.55' : 'JPEG Full Progression, Non-Hierarchical (Process 10 & 12) (Retired)',
        '1.2.840.10008.1.2.4.56' : 'JPEG Full Progression, Non-Hierarchical (Process 11 & 13) (Retired)',
        '1.2.840.10008.1.2.4.57' : 'JPEG Lossless, Non-Hierarchical (Process 14)',
        '1.2.840.10008.1.2.4.58' : 'JPEG Lossless, Non-Hierarchical (Process 15) (Retired)',
        '1.2.840.10008.1.2.4.59' : 'JPEG Extended, Hierarchical (Process 16 & 18) (Retired)',
        '1.2.840.10008.1.2.4.60' : 'JPEG Extended, Hierarchical (Process 17 & 19) (Retired)',
        '1.2.840.10008.1.2.4.61' : 'JPEG Spectral Selection, Hierarchical (Process 20 & 22) (Retired)',
        '1.2.840.10008.1.2.4.62' : 'JPEG Spectral Selection, Hierarchical (Process 21 & 23) (Retired)',
        '1.2.840.10008.1.2.4.63' : 'JPEG Full Progression, Hierarchical (Process 24 & 26) (Retired)',
        '1.2.840.10008.1.2.4.64' : 'JPEG Full Progression, Hierarchical (Process 25 & 27) (Retired)',
        '1.2.840.10008.1.2.4.65' : 'JPEG Lossless, Hierarchical (Process 28) (Retired)',
        '1.2.840.10008.1.2.4.66' : 'JPEG Lossless, Hierarchical (Process 29) (Retired)',
        '1.2.840.10008.1.2.4.70' : 'JPEG Lossless, Non-Hierarchical, First-Order Prediction (Process 14 [Selection Value 1]): Default Transfer Syntax for Lossless JPEG Image Compression',
        '1.2.840.10008.1.2.4.80' : 'JPEG-LS Lossless Image Compression',
        '1.2.840.10008.1.2.4.81' : 'JPEG-LS Lossy (Near-Lossless) Image Compression',
        '1.2.840.10008.1.2.4.90' : 'JPEG 2000 Image Compression (Lossless Only)',
        '1.2.840.10008.1.2.4.91' : 'JPEG 2000 Image Compression',
        '1.2.840.10008.1.2.4.92' : 'JPEG 2000 Part 2 Multi-component Image Compression (Lossless Only)',
        '1.2.840.10008.1.2.4.93' : 'JPEG 2000 Part 2 Multi-component Image Compression',
        '1.2.840.10008.1.2.4.94' : 'JPIP Referenced',
        '1.2.840.10008.1.2.4.95' : 'JPIP Referenced Deflate',
        '1.2.840.10008.1.2.4.100' : 'MPEG2 Main Profile / Main Level',
        '1.2.840.10008.1.2.4.101' : 'MPEG2 Main Profile / High Level',
        '1.2.840.10008.1.2.4.102' : 'MPEG-4 AVC/H.264 High Profile / Level 4.1',
        '1.2.840.10008.1.2.4.103' : 'MPEG-4 AVC/H.264 BD-compatible High Profile / Level 4.1',
        '1.2.840.10008.1.2.4.104' : 'MPEG-4 AVC/H.264 High Profile / Level 4.2 For 2D Video',
        '1.2.840.10008.1.2.4.105' : 'MPEG-4 AVC/H.264 High Profile / Level 4.2 For 3D Video',
        '1.2.840.10008.1.2.4.106' : 'MPEG-4 AVC/H.264 Stereo High Profile / Level 4.2',
        '1.2.840.10008.1.2.4.107' : 'HEVC/H.265 Main Profile / Level 5.1',
        '1.2.840.10008.1.2.4.108' : 'HEVC/H.265 Main 10 Profile / Level 5.1',
        '1.2.840.10008.1.2.5' : 'RLE Lossless',
        '1.2.840.10008.1.2.6.1' : 'RFC 2557 MIME encapsulation',
        '1.2.840.10008.1.2.6.2' : 'XML Encoding',
        '1.2.840.10008.1.20' : 'Papyrus 3 Implicit VR Little Endian (Retired)',
    }
    try:
        return TransferSyntaxUID_dict.get(ds.file_meta.TransferSyntaxUID, "Unknown TransferSyntaxUID")
    except KeyError:
        return "Missing TransferSyntaxUID"
        
    
    
# Let's take a look at the decompressed image
print('Image Compression Mode: ', TransferSyntaxDescription(ds))
plt.imshow(ds.pixel_array , cmap=plt.cm.gray)
plt.show()

if VERBOSE:
    print("pixel_array: ")
    print()
    print(ds.pixel_array )






# Step 5:  Create a Beam function to read the DICOM files

We can add custom code to an Apache Beam pipeline by creating a class that inherits from apache_beam.DoFn.  We overide the process() function with our custom code.

The process function requires an element argument.  element is the output of the previous step in the pipeline (i.e. the input to this step).  You can also feed the process function "side inputs", which is data that doesn't come from the previous step in the pipeline.  In this example, the side input is our dictionary that maps images to their labels.

In [None]:

class Read_Dicom_File(apache_beam.DoFn):
    """Input is a list of dictionaries that contains a "Dicom_Filename" key.
    This function reads the DICOM file and extract certain data tags.
    It returns a list of dictionaries that includes the input data
    and the data extracted from the Dicom file"""

    def process(self, element, labels_dict, include_label=True):

        ds = pydicom.dcmread(element)

        label = (
            int(labels_dict[ds["SOPInstanceUID"].value]["pe_present_on_image"])
            if include_label
            else -9999999  # For prediction/test data sets, you don't have a label
        )

        return [
            {
                "FileName": element,
                "SOPInstanceUID": ds["SOPInstanceUID"].value,
                "Label": label,
                "RescaleSlope": float(ds["RescaleSlope"].value),
                "RescaleIntercept": float(ds["RescaleIntercept"].value),
                # "PatientAge": ds["PatientAge"].value,
                # "PatientSex": ds["PatientSex"].value,
                # "BodyPartExamined": ds["BodyPartExamined"].value,
                "PixelData": ds.pixel_array,
                
            }
        ]

# Step 6: Standardizing CT Images
If the DICOM file is from a CT scan, the PixelData needs to be converted to the [Hounsfield Scale](https://en.wikipedia.org/wiki/Hounsfield_scale).

By converting pixels to the Hounsfield Units,  we can compare CT scans that were taken on different scanners.

This Beam function is an example of applying arbitrary Python code to an element as it flows through the pipeline.  We will be adding the "PixelData_PreProcessed" key to our element and using that to hold our modified image.  


In [None]:

class Convert_to_Hounsfield_Units(apache_beam.DoFn):
    """Input is a dictionary containing at least the keys PixelData, RescaleSlope and RescaleIntercept.
    Output is the same as input, plus the key PixelData_PreProcessed."""

    def process(self, element):

        # If the image has already been pre-processed, use that.
        # Otherwise use the original image.
        image = element.get("PixelData_PreProcessed", element["PixelData"])

        image = element["PixelData"].astype(dtype=np.float32)
        slope = np.array(element["RescaleSlope"], dtype=np.float32)
        intercept = np.array(element["RescaleIntercept"], dtype=np.float32)

        # Convert the image to Hounsfield Units
        image = (slope * image) + intercept

        # Very small values represent area that is outside the scanning bound of the
        # CT scanner.  We replace those values with the value for air (which by definition is -1000)
        image[image < -1000.0] = -1000.0

        element["PixelData_PreProcessed"] = image.astype(dtype=np.float32)

        return [element]


# Step 7:  Resize the image

You may want your medical images to be a certain size.

This may be especially helpful if you're planning to use transfer learning (i.e. using one of the pre-trained models availble at https://www.tensorflow.org/api_docs/python/tf/keras/applications).  For example, the Resnet50 model has a default input_size of (224, 224, 3).


You can also apply other opencv transformation, like converting the image from grayscale to RGB color.

In [None]:
class Resize_for_Machine_Learning(apache_beam.DoFn):
    """Input is a dictionary containing the key PixelData_PreProcessed or PixelData.
       """

    def process(self, element, new_image_size=(224, 224), grayscale=True):

        # The default imput size for the VGG16 CNN model is (224, 224)
        # but you can use this to make your images any size.

        # If the image has already been pre-processed, use that.
        # Otherwise use the original image.
        image = element.get("PixelData_PreProcessed", element["PixelData"]).astype(
            dtype=np.float32
        )

        image = cv.resize(image, new_image_size)

        # Convert from greyscale to RGB (RGB is required by some machine learning models)
        if grayscale:
            image = cv.cvtColor(image, cv.COLOR_GRAY2RGB)

        element["PixelData_PreProcessed"] = image.astype(dtype=np.float32)

        return [element]

        
        

# Step 8:  Rescale the image for Machine Learning

The Kaggle Tutorial https://www.kaggle.com/gzuidhof/full-preprocessing-tutorial has a lot of good ideas about how to prepare your images for use in a machine learning model.  Basically, you should:
* Filter out anything you're not interested in (for example, filter out bones if you're looking at a soft tissue disease).
* Rescale the data so the values fall between 0 and 1.
* Zero-center the data, so the mean value is 0 and all values fall between -1 and 1.



In [None]:

class Rescale_for_Machine_Learning(apache_beam.DoFn):
    """Input is a dictionary containing at least the key PixelData or PixelData_PreProcessed.
    Output is the same as input, plus the key PixelData_PreProcessed."""

    def process(self, element):
        # This logic comes from the Kaggle Tutorial:  https://www.kaggle.com/gzuidhof/full-preprocessing-tutorial

        # Set the Min and Max range for Hounsfield Units
        # For lung CT scans, we are not interested in bones or other very dense.  So our max cut-off is set to 400.
        MIN_BOUND = -1000.0
        MAX_BOUND = 400.0

        # If the image has already been pre-processed, use that.
        # Otherwise use the original image.
        image = element.get("PixelData_PreProcessed", element["PixelData"]).astype(
            dtype=np.float32
        )

        def normalize(image, MIN_BOUND=-1000.0, MAX_BOUND=400.0):
            """This logic comes from the Kaggle Tutorial:  https://www.kaggle.com/gzuidhof/full-preprocessing-tutorial
            Our values currently range from -1024 to around 2000. Anything above 400 is not interesting
            to us, as these are simply bones with different radiodensity. A commonly used set of
            thresholds in the LUNA16 competition to normalize between are -1000 and 400.
            Here's some code you can use:
            """
            image = (image - MIN_BOUND) / (MAX_BOUND - MIN_BOUND)
            image[image > 1] = 1.0
            image[image < 0] = 0.0
            return image

        image = normalize(image)

        # Zero centering (from https://www.kaggle.com/gzuidhof/full-preprocessing-tutorial)
        # As a final preprocessing step, it is advisory to zero center your data so that
        # your mean value is 0. To do this you simply subtract the mean pixel value from all pixels.
        #
        # To determine this mean you simply average all images in the whole dataset. If that sounds
        # like a lot of work, we found this to be around 0.25 in the LUNA16 competition.
        PIXEL_MEAN = 0.25
        image = image - PIXEL_MEAN

        element["PixelData_PreProcessed"] = image.astype(dtype=np.float32)

        return [element]

# Step 9:  A Beam function that writes data to a TFRecord file

TFRecords are a format that can speed up Data IO.  

More details are here:  https://www.tensorflow.org/tutorials/load_data/tfrecord




In [None]:

# This is how we write to a TFRecord file

#  https://www.tensorflow.org/tutorials/load_data/tfrecord
# The following functions can be used to convert a value to a type compatible
# with tf.train.Example.


def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy()  # BytesList won't unpack a string from an EagerTensor.
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


def _float_feature(value):
    """Returns a float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))


def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


def serialize_example(image, label):
    """
    Creates a tf.train.Example message ready to be written to a file.
    """
    # Create a dictionary mapping the feature name to the tf.train.Example-compatible
    # data type.
    feature = {
        "image": _bytes_feature(image.tostring()),
        "label": _int64_feature(label),
    }

    # Create a Features message using tf.train.Example.
    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto.SerializeToString()


class Write_to_TFRecord(apache_beam.DoFn):
    """Input is a dictionary containing at least the key PixelData_HU.
    Output is the same as input, plus the key PixelData_HU_Rescaled."""

    def process(self, element, OUTPUT_FOLDER):

        filename = os.path.join(OUTPUT_FOLDER, element["SOPInstanceUID"] + ".TFRecord")

        image = element.get("PixelData_PreProcessed", element["PixelData"]).astype(
            dtype=np.float32
        )
        label = element.get("Label")

        # Write the `tf.train.Example` observations to the file.
        with tf.io.TFRecordWriter(filename) as writer:
            example = serialize_example(image, label)
            writer.write(example)





# Step 10: Balance the data set

This data set has about 20 negative cases for each positive case.  We want our training data set to be closer to 50-50, so we create a function that "throws away" some of the negative cases.

In [None]:
def balanceDataset(element):
    """Balances the dataset by including all the positive cases
    and 5% of the negative cases"""
    Percent_of_Negative_Cases_to_include = 0.05
    if element["Label"] == 1:
        return True
    elif random.random() < Percent_of_Negative_Cases_to_include:
        return True
    else:
        return False

# Step 11:  Run the Apache Beam pipeline

Normally, you would run an Apache Beam pipeline on a cluster of computers.

This example will use the "directrunner" to run the code within this notebook.

In [None]:
list_of_Dicoms_to_process = SMALL_DATA
labels_dict = create_labels_dict(LABELS_FILE)



startTime = datetime.now()
print("Now running your Apache Beam pipeline!!!  Start Time: ", startTime)

with apache_beam.Pipeline() as p:
    rows = (
        p
        | apache_beam.Create(list_of_Dicoms_to_process, reshuffle=True)
        | apache_beam.ParDo(
            Read_Dicom_File(), labels_dict, include_label=True
        )
      #  | "Balance Dataset" >> apache_beam.Filter(balanceDataset)  # Comment this out for testing so it doesn't filter out any elements
        | apache_beam.ParDo(Convert_to_Hounsfield_Units())
        | apache_beam.ParDo(Resize_for_Machine_Learning())
        | apache_beam.ParDo(Rescale_for_Machine_Learning())
        | apache_beam.ParDo(Write_to_TFRecord(), OUTPUT_FOLDER=WORKING_FOLDER)
        # | "print" >> apache_beam.Map(print)
    )

print("Finished running your Apache Beam pipeline!!!")
print("Total Time:", datetime.now() - startTime)

print("Finished running all the code!!!!!!!!!!!!!!!!!!!!!!!!!!")








# Step 12:  List the files created by the Apache Beam pipeline

In [None]:
#!ls /kaggle/working/

# Print the number of TFRecord files
!ls -l /kaggle/working/*.TFRecord | wc -l
! echo --------------------------------------------------------------------------------------------------

# See how big the TFRecord files are
!ls /kaggle/working/ -l #--block-size=M

# Step 13:  Read a TFRecord file and display the medical image

To prove that this worked, we will load the TFRecord files into a Tensorflow Dataset (which could easily be used to train a mode).
Then we will parse the TFRecords back into a numpy and print the image.

Note the axes of the image... They prove that the image has been shrunk from around 500 by 500 to the requested 300 by 300.

In [None]:
# Get a list of the TFRecord files
TFRecordFiles = []
for dirname, _, filenames in os.walk(WORKING_FOLDER):
    for filename in filenames:
        if filename[-9:]=='.TFRecord':
            TFRecordFiles.append(os.path.join(dirname, filename))
          
        
        
        
# Define two parsing functions that will turn the TFRecord back into an array and a label        
def _parse_function(example, image_shape=(224, 224, 1)):
    # Parse the input `tf.train.Example` proto using the feature_dictionary.
    # Create a description of the features.
    feature_description = {
        "image": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "label": tf.io.FixedLenFeature([], tf.int64, default_value=0),
    }

    parsed_example = tf.io.parse_single_example(example, feature_description)
    return parsed_example



def _parse_function2(example):
    label = example["label"]
    image = tf.io.decode_raw(
        example["image"], float, little_endian=True, fixed_length=None, name=None
    )
    image = tf.reshape(image, (224, 224, 3))
    # apache beam is now producing an rgb image
    # image = tf.image.grayscale_to_rgb(image)  # I moved this into the apache beam pipeline
    return image, label        
            
            
# Create a dataset         
TF_dataset = tf.data.TFRecordDataset(TFRecordFiles)

# Use map to apply the parsing functions to the data
TF_dataset = TF_dataset.map(_parse_function)
TF_dataset = TF_dataset.map(_parse_function2)
            
            
for images, labels in TF_dataset.take(1):  # only take first element of dataset
    image = images.numpy()
    label = labels       
      
print("Label: ", label)        
print()        
plt.imshow(image , cmap=plt.cm.gray)
plt.show()
           
    
