# Pulmonary Embolism - How to Create TFRecords

In this notebook, we will create TFRecords that contains basic information to train TensorFlow models. 

### Overview
The images are CT scans (axial slices) that were pre-processed using different windows. Each channel is a different window. Resolution size 512x512.

- RED channel / LUNG window / level=-600, width=1500
- GREEN channel / PE window / level=100, width=700
- BLUE channel / MEDIASTINAL window / level=40, width=400



### Acknowledgements
These images were made using [Ian Pan's](https://www.kaggle.com/vaillant) pre-processing found [here](https://www.kaggle.com/c/rsna-str-pulmonary-embolism-detection/discussion/182930)

In [None]:
# LOAD LIBRARIES
import numpy as np, pandas as pd, os
import matplotlib.pyplot as plt, cv2
import tensorflow as tf, re, math
from glob import glob
import os
from tqdm import tqdm

In [None]:
df = pd.read_csv('../input/rsna-str-pulmonary-embolism-detection/train.csv')

In [None]:
fold0 = [glob(x + '/*.jpg') for x in glob('../input/pe-train-512x512-fold-0-batch-*')]
fold1 = [glob(x + '/*.jpg') for x in glob('../input/pe-train-512x512-fold-1-batch-*')]

In [None]:
LABEL_COLUMNS = ['negative_exam_for_pe', 'indeterminate', 'chronic_pe', 'acute_and_chronic_pe', 'central_pe', 'leftsided_pe', 'rightsided_pe', 'rv_lv_ratio_gte_1', 'rv_lv_ratio_lt_1']

# Write TFRecords - Train
All the code below comes from Tensorflow's docs [here](https://www.tensorflow.org/tutorials/load_data/tfrecord)

In [None]:
def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

In [None]:
def serialize_example(feature0, feature1, feature2):
  feature = {
      'image': _bytes_feature(feature0),
      'image_name': _bytes_feature(feature1),
      'target': _bytes_feature(feature2)
  }
  example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
  return example_proto.SerializeToString()

In [None]:
def write_file(fold, tfrec_name='', file_size=100):
    with tf.io.TFRecordWriter(f'{tfrec_name}.tfrec') as writer:
        for k, image_path in enumerate(fold):
            image_name = image_path.split('/')[-1].split('.')[0]
            target = df[df['SOPInstanceUID'] == image_name][LABEL_COLUMNS].values[0]
            img = cv2.imread(image_path)
            img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
            img = cv2.imencode('.jpg', img, (cv2.IMWRITE_JPEG_QUALITY, 94))[1].tostring()

            example = serialize_example(
                img, 
                str.encode(image_name), 
                tf.io.serialize_tensor(np.array(target, dtype=np.uint8)))
            writer.write(example)

In [None]:
def shard_files(file_paths, k=0, f=0):
    file_size = 1000
    path_array = []
    for num, fp in enumerate(file_paths):
        path_array.append(fp)
        if (num+1)%file_size==0:
            print(f'Sharding files {num+1}')
            write_file(path_array, tfrec_name=f'train-{f}{k:02}-{(num+1)//file_size}', file_size=file_size)
            path_array = [] # reset the array
    # write remaining files
    if len(path_array) > 0:
        print('\nSharding remainder: ', len(path_array))
        write_file(path_array, tfrec_name=f'train-{f}{k:02}-{(num+1)//file_size+1}')

### Notes for Pulmonary Embolism TFRecord creation: 
1. enter a number from k = 0-19, where k is the batch number
2. once you reach 19 for fold0, change fold0 to fold1 and f=0 to f=1
3. you must complete 40 notebooks to create TFRecords for Pulmonary Embolism dataset
4. once the notebook is done executing, create a "new dataset" using the button below
5. make sure the dataset is public in order to use the TFRecord through Kaggle's notebooks

In [None]:
%%time

k = 0
f = 0
shard_files(fold0[k], k, f)