# Building Data Pipeline with TensorFlow

In this notebook I will tell and show by example using TensorFlow why building pipeline is important for creating an effective neural network training cycle.

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import MultiLabelBinarizer

In [None]:
TEST_PATH = "../input/hpa-single-cell-image-classification/test"
TRAIN_PATH = "../input/hpa-single-cell-image-classification/train"
TRAIN_CSV = "../input/hpa-single-cell-image-classification/train.csv"
N_CLASSES = 19
SIZE = (512, 512)
BATCH_SIZE = 32

In [None]:
def parse_label(raw_label):
    '''Parse label to indicator vector'''
    
    label = list(map(int, raw_label.split('|')))
    label = mlb.transform([label])
    return np.squeeze(label).astype(np.int8)

def set_full_path(path):
    '''Wrap function, that return full of file'''
    def f(filename):
        return path + "/" + filename
    return f

In [None]:
# Load data
data = pd.read_csv(TRAIN_CSV)

# Create MultiLabelBinarizer to transform label into multilabel vector
mlb = MultiLabelBinarizer()
mlb.fit([range(N_CLASSES)])

# apply parse_label function to 'Label' column
data['Label'] = data['Label'].apply(parse_label)
data['ID'] = data['ID'].apply(set_full_path(TRAIN_PATH))

# Get filenames and labels
filenames, labels = data['ID'], data['Label']
labels = np.array(labels.values.tolist()).astype(np.int8)

# Creating Data Pipeline with tf.data

---

Why we need to create data pipeline?

The process of training our NN on one batch can be divided by two parts:
1. Prepare batch (read data from directory, do data augmentation, etc.)
2. Feed batch to NN

In order to achieve a high speed of NN training, we need to prevent our GPU/TPU from [data starvation](http://https://en.wikipedia.org/wiki/Starvation_(computer_science)). In other words, the GPU / TPU should't stand idle. Without datapipeline our GPU/TPU will be waiting for next batch after backpropagation process is complete.

---

Without pipelining, the CPU and the GPU/TPU sit idle much of the time:



![Fig 1: Sequential execution frequently leaves the GPU idle](https://supportkb.dell.com/img/ka02R000000hGN0QAM/ka02R000000hGN0QAM_en_US_1.jpeg)
With pipelining, idle time diminishes significantly:




![Fig 2: Pipelining overlaps CPU and GPU utilization, maximizing GPU utilization](https://supportkb.dell.com/img/ka02R000000hGN0QAM/ka02R000000hGN0QAM_en_US_2.jpeg)


Resource: [Optimization Techniques...](https://www.dell.com/support/kbdoc/ru-ua/000124384/optimization-techniques-for-training-chexnet-on-dell-c4140-with-nvidia-v100-gpus)

---

For this purpose in TensorFlow exist <code>tf.data module</code>

We apply the following steps for training:

1. Create <code>dataset</code> from <code>filenames</code> and <code>labels</code>
```
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
```
2. Shuffle instances. Gradient descent works better when instances in the training set are independent and identically distributed
```
dataset = dataset.shuffle(len(filenames))
```
3. Parse images from labels. I use <code>num_parallel_calls=tf.data.AUTOTUNE</code> to tune the value dynamically at runtime
```
dataset = dataset.map(parse_function, num_parallel_calls=tf.data.AUTOTUNE)
```
4. Use data augmentation for the images.
```
dataset = dataset.map(train_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
```
5. Batch the images
```
dataset = dataset.batch(BATCH_SIZE)
```
6. Prefetch one batch. In some cases you want to prefetch more than 1 batch if the duration of the preprocessing varies a lot.
```
dataset = dataset.prefetch(1)
```

In [None]:
# it is not necessary to wrap this function by tf.function
@tf.function
def parse_function(filename, label):
    '''
    Load 512x512x3 Tensor and convert label:
        - read content of 3 files (red, blue, green channels)
        - decode and resize them using png format
        - stack 3 channels
        - convert values to float32
        
        - convert label to Tensor 
    '''    
    red = tf.io.read_file(filename + "_red.png")
    blue = tf.io.read_file(filename + "_blue.png")
    green = tf.io.read_file(filename + "_green.png")
    
    red = tf.io.decode_png(red, channels=1)
    blue = tf.io.decode_png(blue, channels=1)
    green = tf.io.decode_png(green, channels=1)
    
    red = tf.image.resize(red, [*SIZE])
    blue = tf.image.resize(blue, [*SIZE])
    green = tf.image.resize(green, [*SIZE])
    
    image = tf.stack([red, green, blue], axis=-1)
    
    image = tf.squeeze(image)
    image = tf.image.convert_image_dtype(image, tf.float32)
    label = tf.convert_to_tensor(label, dtype=tf.int8)
    
    return image, label

@tf.function
def train_preprocess(image, label):
    '''
    Augmnet data:
        - random flip left/right
        - random flip up/down
    '''
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_flip_up_down(image)

    return image, label

In [None]:
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(len(filenames))
dataset = dataset.map(parse_function, num_parallel_calls=tf.data.AUTOTUNE)
dataset = dataset.map(train_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(1)