# Creates TFRecords for the Pneumothorax Dataset


### Creates TFRecords for the files from the [SIIM-ACR Pneumothorax-Segmentation Dataset](https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation). 

### TFRecords Examples contain:

- **All examples**:
    - X-ray Images extracted from the DICOM files
    - Image IDs
    - Patient metadata from the DICOM files (age and sex)
    - Image metadata (width and height)


- **Train examples also include:**
    - Labels, implied from the dataframe
    - Transformed Run Lenght Encodings (RLEs)


### About the Transformed Run Lenght Encodings (RLEs) 
RLEs stored in the TFRecs have been transformed from the original RLEs in the provided dataframe. **NOTE**: the `rle2mask()` function provided by the competition hosts will not work with this type of RLE encoding.

The RLEs stored in the TFRecs are pairs of values that contain a start position and a run length

- e.g. `1 3` means starting at pixel 1 and running a total of 3 pixels 
- RLE pairs are space delimited
- The pixels are numbered from top to bottom, then left to right: $1$ is pixel $(1,1)$, $2$ is pixel $(2,1)$, etc
- The function `rle2mask()` provided below decodes the rle into a mask. `build_mask()` creates an image-like array for the mask



### The TFRec Features are:
```
features = {
    'image': tf.io.FixedLenFeature([], tf.string), 
    'img_id': tf.io.FixedLenFeature([], tf.string), 
    'sex': tf.io.FixedLenFeature([], tf.string),
    'age': tf.io.FixedLenFeature([], tf.int64),
    'height': tf.io.FixedLenFeature([], tf.int64),
    'width': tf.io.FixedLenFeature([], tf.int64),
    }

if labeled:
    features['label'] = tf.io.FixedLenFeature([], tf.int64)
    features['rle'] = tf.io.FixedLenFeature([], tf.string)
```
Look at `read_tfrecord()` for an example of how to extract example features. 



## Import Packages

In [None]:
import os, contextlib2, pydicom
import numpy as np
import pandas as pd
import matplotlib as mpl
from matplotlib import pyplot as plt

import tensorflow as tf
print('Tensorflow version ', tf.__version__)
AUTO = tf.data.experimental.AUTOTUNE 

## Data Processing

### Train Data

In [None]:
file_paths_train = tf.io.gfile.glob('../input/siim-train-test/dicom-images-train/*/*/*.dcm')
file_ids_train = [x.split('/')[-1].split('.dcm')[0] for x in file_paths_train]

df = pd.read_csv('../input/siim-acr-pneumothorax-segmentation/stage_2_train.csv')
ids_train = df['ImageId'].unique()
len(ids_train), len(file_ids_train) # note discreapancy in number of ids_train in the files vs. dataframe

Check unmatched train files, ImageIds

In [None]:
df_no_file = [f for f in ids_train if f not in file_ids_train]
file_no_df = [f for f in file_ids_train if f not in ids_train]
len(df_no_file), len(file_no_df)

Highlights from data description and exploration (train data):

1. All ids_train in the dataframe have a corresponding dicom file, but not the way around.

2. Non-contiguous mask pixels, even if corresponding to the same image, are specified in different rows. Thus, an image might have more than 1 row of EncodedPixels but all those rows correspond to the same mask. 

3. Images without pneumothorax disease have a mask value of -1.

4. DICOM images have 1 channel and are 2D, i.e. (H, W) arrays.

Mainly because of 2, and so the encodings will go into the TFRecs as a fixed lenght feature, I will change the encoding to Start-Lenght RLE (as oppposed to the *relative* form RLE originally provided in the dataframe). 

In [None]:
df = df.assign(Label = np.where(df['EncodedPixels'] == '-1', 0, 1))
df0 = df[df.Label == 0]
df1 = df[df.Label == 1]

ids_0 = df0['ImageId'].unique()
ids_1 = df1['ImageId'].unique()

print('Number of images with no Pneumothorax to be recorded: {}'.format(len(ids_0)))
print('Number of images with Pneumothorax to be recorded:    {}'.format(len(ids_1)))

### Test Data

In [None]:
file_paths_test = tf.io.gfile.glob('../input/siim-train-test/dicom-images-test/*/*/*.dcm')
file_ids_test = [x.split('/')[-1].split('.dcm')[0] for x in file_paths_test]

len(file_ids_test)

Create a single dictionary mapping IDs and DICOM file paths

In [None]:
file_ids = file_ids_train + file_ids_test
file_paths = file_paths_train + file_paths_test
paths_dict = dict(zip(file_ids, file_paths))
assert len(paths_dict.keys()) == len(file_ids_train) + len(file_ids_test)

## Transform RLEs

In [None]:
dicom_data = pydicom.dcmread(paths_dict[ids_1[0]])
image = dicom_data.pixel_array
height, width = image.shape
print(height, width)

IMAGE_SIZE = (height, width)
N_CHANNELS = 1
N_CLASSES = 1

In [None]:
# RLE to Mask and Mask to RLE 

def rle2mask_relative(rle, width, height): 
    # converts the "relative" rle provided to mask
    mask= np.zeros(width*height)
    array = np.asarray([int(x) for x in rle.split()])
    starts = array[0::2]
    lengths = array[1::2]

    current_position = 0
    for index, start in enumerate(starts):
        current_position += start
        mask[current_position:current_position+lengths[index]] = 1
        current_position += lengths[index]

    return mask.reshape(width, height)

def integrate_masks(image_id):
    # puts together non-contiguous mask pixels onto 1 single mask
    temp_df = df1['EncodedPixels'][df1['ImageId'] == image_id]
    mask = np.zeros([width, height])
    for rle in temp_df:
        mask = np.maximum(mask, rle2mask_relative(rle, width, height))
    return mask.T

def mask2rle(mask_array):
    '''
    mask_array: a numpy or tensorflow 2D array: 1 - mask, 0 - background
    Returns: run length as string
    '''
    pixels = tf.transpose(mask_array)
    pixels = tf.reshape(pixels, [-1])
    pixels = tf.cast(pixels, dtype=tf.int64) 
    pixels = tf.concat(([0], pixels, [0]), axis = 0) 
    changes = (pixels[1:] != pixels[:-1])
    runs = tf.where(changes) + 1 
    runs = tf.squeeze(runs)
    lens = runs[1::2] - runs[::2]

    zeros = tf.math.multiply(lens, 0)
    ones = tf.math.add(zeros, 1)
    inter = tf.stack((ones, zeros), axis = 1)
    inter = tf.reshape(inter, [-1])

    starts = tf.math.multiply(runs, inter)
    lens = tf.stack((zeros, lens), axis = 1)
    lens = tf.reshape(lens, [-1])

    rles = tf.math.add(starts, lens)
    rles = tf.strings.as_string(rles)
    rles = tf.strings.reduce_join(rles, separator=' ')
    return rles

def rle2mask(rle, input_shape): 
    size = tf.math.reduce_prod(input_shape) 

    s = tf.strings.split(rle)
    s = tf.strings.to_number(s, tf.int32)

    starts = s[0::2] - 1
    lens = s[1::2]

    total_ones = tf.reduce_sum(lens)
    ones = tf.ones([total_ones], tf.int32)

    r = tf.range(total_ones)
    lens_cum = tf.math.cumsum(lens)
    s = tf.searchsorted(lens_cum, r, 'right')
    idx = r + tf.gather(starts - tf.pad(lens_cum[:-1], [(1, 0)]), s)

    mask_flat = tf.scatter_nd(tf.expand_dims(idx, 1), ones, [size])
    mask = tf.reshape(mask_flat, (input_shape[1], input_shape[0]))
    return tf.transpose(mask)

def build_mask(rle, input_shape = (height, width)):
    mask = rle2mask(rle, input_shape)
    mask = tf.expand_dims(mask, axis=2)
    mask = tf.reshape(mask, (*input_shape, 1))
    return mask


In [None]:
# Plot Utils (DICOM files)

axes_color = '#999999'
mpl.rcParams.update({'text.color' : "#999999", 'axes.labelcolor' : axes_color,
                     'font.size': 10, 'xtick.color':axes_color,'ytick.color':axes_color,
                     'axes.spines.top': False, 'axes.spines.right': False,
                     'axes.edgecolor': axes_color, 'axes.linewidth':1.0, 'figure.figsize':[8, 4]})

def plot_array(array):
    fig = plt.figure(figsize=(5,5))
    try: plt.imshow(array, alpha = 0.4, cmap = plt.cm.bone)
    except: plt.imshow(tf.keras.preprocessing.image.array_to_img(array),
                       alpha = 0.4, cmap = plt.cm.bone)
    plt.axis('off')
    plt.show()
    return None

def plot_xray_and_mask_from_id(image_id):
    print('Image ID:', image_id)
    dcm_file_path = paths_dict[image_id]
    dicom_data = pydicom.dcmread(dcm_file_path)
    
    plt.figure(figsize=(16,8))
    plt.subplot(1,3,1)
    plt.imshow(dicom_data.pixel_array, cmap=plt.cm.bone)
    plt.title('X-Ray')
    plt.axis('off')

    mask = integrate_masks(image_id)
    plt.subplot(1,3,2)
    plt.imshow(mask, alpha = 0.4, cmap = 'bone')
    plt.title('Mask')
    plt.axis('off')
    
    plt.subplot(1,3,3)
    plt.imshow(dicom_data.pixel_array, cmap=plt.cm.bone)
    plt.imshow(mask, alpha = 0.4, cmap = plt.cm.bone)
    plt.title('X-Ray + Mask')
    plt.axis('off')
    plt.show()
    return None

idx = -1

### X-rays + Mask Examples (from DICOM files)

In [None]:
idx += 1
plot_xray_and_mask_from_id(ids_1[idx])

To transform the RLE

1. Build the mask array with the given encodings using the given decoding function:

    `mask = integrate_masks(ids_1[3])`

2. Convert the array to the new RLE encoding:

    `rle = mask2rle(mask)`


To verify, compare masks from original and transformed RLEs

In [None]:
idx += 1
mask_from_orig = integrate_masks(ids_1[idx])
new_rle = mask2rle(mask_from_orig)
mask_from_new = rle2mask(new_rle, (width, height))

plt.figure(figsize=(10, 5))
plt.subplot(1,2,1)
plt.imshow(mask_from_orig, alpha = 0.5, cmap = 'bone')
plt.title('Mask from Original RLE')
plt.axis('off')

plt.subplot(1,2,2)
plt.imshow(mask_from_new, alpha = 0.5, cmap = 'bone')
plt.title('Mask from Transformed RLE')
plt.axis('off')

print('Equal?', np.sum(mask_from_orig == mask_from_orig) == width*height)

### Create a dictionary for the tranformed RLEs for images with pneumothorax.

- Train images with no pneumothorax will all be assigned `1 0` as RLE during TFRec creation. This is for ease of mask processing during the training pipeline.

- Test images will not include RLE.

- Train images with IDs not listed in the dataframe will be excluded from the TFRecs.

In [None]:
rle_dict = {}
for id in ids_1:
    mask_array = integrate_masks(id)
    rle_dict[id] = mask2rle(mask_array).numpy()

## Metadata from DICOM files

Create dictionaries for sex, age, and image sizes from the DICOM files to verify data consistency and completeness before adding to the TFRecords. This is done for train ids in the dataframe and test ids from test files.

In [None]:
sexes, ages, sizes = {}, {}, {}
for image_id in list(ids_train) + file_ids_test:
    dcm_file_path = paths_dict[image_id]
    dicom_data = pydicom.dcmread(dcm_file_path)
    ages[image_id] = int(dicom_data.PatientAge)
    sexes[image_id] = dicom_data.PatientSex
    sizes[image_id] = dicom_data.pixel_array.shape

incomplete = []
for image_id in ids_train:
    if isinstance(ages[image_id], int) and sexes[image_id] in ['M', 'F'] and sizes[image_id] == (1024, 1024):
        pass
    else: incomplete.append[image_id]

print('Number of incomplete or inconsistent DICOM files: ', len(incomplete))

## Create TFRecords

In [None]:
def _bytestring_feature(list_of_bytestrings):
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=list_of_bytestrings))

def _int_feature(list_of_ints): # int64
  return tf.train.Feature(int64_list=tf.train.Int64List(value=list_of_ints))

def image_bits_from_id(image_id):
    image = pydicom.dcmread(paths_dict[image_id]).pixel_array
    image = np.expand_dims(image, axis=2) # necessary since orig array is 2D, need 3D
    image = tf.constant(image)
    image_bits = tf.image.encode_jpeg(image, optimize_size=True, chroma_downsampling=False)
    image_bits = image_bits.numpy()
    return image_bits

def create_tfrec_example(image_id, diagnosis = 0):
    '''
    Creates a TFRecord example
    args: image_id: (str) the image id
          diagnosis: (int or None) one of 0, 1, or None. 1 for disease, 
              0 for no disease. None if unknown (i.e. test record)
    returns: tfrec example
    '''
    image = image_bits_from_id(image_id) 
    age = ages[image_id]
    sex = sexes[image_id]

    feature = {
        'image': _bytestring_feature([image]),
        'img_id': _bytestring_feature([image_id.encode()]),
        'sex': _bytestring_feature([sex.encode()]),
        'age': _int_feature([age]), 
        'width': _int_feature([1024]), 
        'height': _int_feature([1024])
        }
    
    if diagnosis != None:
        if diagnosis: rle_bits = rle_dict[image_id] 
        else: rle_bits = '1 0'.encode()
        feature['label'] = _int_feature([diagnosis]),
        feature['rle'] = _bytestring_feature([rle_bits]),

    tfrec_example = tf.train.Example(features=tf.train.Features(feature=feature))
    return tfrec_example

# create_tfrec_example(ids_1[133], 1)

Create 20 TFRecs for each class for 5% granularity and for $K \in \{4, 5, 10\}$ $K$-fold partitioning. 

Separate train TFRecs by label (disease, no-disease) for more flexible oversampling during training. Undersampling can be done using `dataset.filter()` in the dataset pipeline. 

Distribute the test files within 5 TFRecs.

In [None]:
 # the number of TFREcs for each label class.
def get_tfrec_fnames(id_list, description, NUM_TFRECS):
    shard_object_count = [int(len(id_list)/NUM_TFRECS)] * NUM_TFRECS

    for i in range(len(id_list)%NUM_TFRECS):  
        shard_object_count[i] = shard_object_count[i] + 1
    
    tf_record_output_filenames = ['penumothorax-1024x1024-{}-{:02d}-{}.tfrec'.format(
        description, idx+1, shard_object_count[idx]) for idx in range(NUM_TFRECS)]
    return tf_record_output_filenames

In [None]:
disease_tfrec_names = get_tfrec_fnames(ids_1, 'train-disease', NUM_TFRECS = 20)
no_disease_tfrec_names = get_tfrec_fnames(ids_0, 'train-no-disease', NUM_TFRECS = 20)
test_tfrec_names = get_tfrec_fnames(file_ids_test, 'test', NUM_TFRECS = 5)

disease_dict = {'ids': ids_1, 'tfrec_names': disease_tfrec_names, 'diagnosis': 1, 'NUM_TFRECS':20}
no_disease_dict = {'ids': ids_0, 'tfrec_names': no_disease_tfrec_names, 'diagnosis': 0, 'NUM_TFRECS':20}
test_dict = {'ids': file_ids_test, 'tfrec_names': test_tfrec_names, 'diagnosis': None, 'NUM_TFRECS':5}

In [None]:
def open_sharded_tfrecs(exit_stack, tfrec_names):
    return [exit_stack.enter_context(tf.io.TFRecordWriter(fname)) for fname in tfrec_names]

for d in [disease_dict, no_disease_dict, test_dict]:
    id_list = d['ids']
    tfrec_names = d['tfrec_names']
    diagnosis = d['diagnosis']
    NUM_TFRECS = d['NUM_TFRECS']

    with contextlib2.ExitStack() as tf_record_close_stack:
        output_tfrecords = open_sharded_tfrecs(tf_record_close_stack, tfrec_names)    

        for i, image_id in enumerate(id_list): 
            tf_record=create_tfrec_example(image_id, diagnosis)
            output_tfrecords[i%NUM_TFRECS].write(tf_record.SerializeToString())

## Verify TFRecords

In [None]:
# Parse TFRecs
def read_tfrecord(example, labeled = True):
    features = {
        'image': tf.io.FixedLenFeature([], tf.string), 
        'img_id': tf.io.FixedLenFeature([], tf.string), 
        'sex': tf.io.FixedLenFeature([], tf.string),
        'age': tf.io.FixedLenFeature([], tf.int64),
        'height': tf.io.FixedLenFeature([], tf.int64),
        'width': tf.io.FixedLenFeature([], tf.int64),
        }
    
    if labeled:
        features['label'] = tf.io.FixedLenFeature([], tf.int64)
        features['rle'] = tf.io.FixedLenFeature([], tf.string)

    example = tf.io.parse_single_example(example, features)

    image = example['image']
    image = tf.image.decode_jpeg(image, channels=N_CHANNELS)
    image = tf.cast(image, tf.float32) / 255.0  
    image = tf.reshape(image, [*IMAGE_SIZE, N_CHANNELS]) 
    img_id = example['img_id']
    
    if not labeled: return image, img_id
    else:
        rle = example['rle']
        mask = build_mask(rle)
        mask = tf.cast(mask, tf.float32)
        mask = tf.reshape(mask, [*IMAGE_SIZE, N_CLASSES]) 
        return image, mask

def get_dataset(filenames, labeled = True):
    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTO)
    dataset = dataset.map(lambda x: read_tfrecord(x, labeled), num_parallel_calls=AUTO)
    return dataset

def parse_examples(dataset, n=20):
    dataset_examples = []
    for i, (image, item2) in enumerate(dataset.take(n)):
        dataset_examples.append((image, item2))
    return dataset_examples

In [None]:
# Plot Utils (for TFRecords)
def plot_xray_mask(img_mask_tuple):
    xray = tf.keras.preprocessing.image.array_to_img(img_mask_tuple[0])
    mask = tf.keras.preprocessing.image.array_to_img(img_mask_tuple[1])

    plt.figure(figsize=(16,8))
    plt.subplot(1,3,1)
    plt.imshow(xray, cmap=plt.cm.bone)
    plt.title('X-Ray')
    plt.axis('off')

    plt.subplot(1,3,2)
    plt.imshow(mask, alpha = 0.4, cmap = 'bone')
    plt.title('Mask')
    plt.axis('off')
    
    plt.subplot(1,3,3)
    plt.imshow(xray, cmap=plt.cm.bone)
    plt.imshow(mask, alpha = 0.4, cmap = plt.cm.bone)
    plt.title('X-Ray + Mask')
    plt.axis('off')
    plt.show()
    return None

**Visualize X-Rays with Disease**

In [None]:
ds = get_dataset(disease_tfrec_names[18:])
disease_examples = parse_examples(ds)

In [None]:
idx += 1
plot_xray_mask(disease_examples[idx])

**Visualize X-Rays with No Disease**

In [None]:
ds = get_dataset(no_disease_tfrec_names[15:])
no_disease_examples = parse_examples(ds)  

In [None]:
idx += 1
plot_xray_mask(no_disease_examples[idx])

Visualize X-Rays from test data

In [None]:
ds = get_dataset(test_tfrec_names[2:], labeled = False)
test_examples = parse_examples(ds)

In [None]:
idx += 1
test_example = test_examples[idx]
xray = tf.keras.preprocessing.image.array_to_img(test_example[0])
plt.figure(figsize=(5, 5))
plt.imshow(xray, cmap=plt.cm.bone)
plt.title('Test ID: {}'.format(test_example[1].numpy().decode()))
plt.axis('off'); plt.show()

In [None]:
!mkdir ./tfrecords
!mv ./*.tfrec tfrecords
!ls ./tfrecords

In [None]:
!rm -r ./tfrecords
!ls