<a href="https://colab.research.google.com/github/sourcecode369/TensorFlow-2.0/blob/master/tensorflow_2.0_docs/TensorFlow%20Core/Guide/Data%20Input%20Pipelines/Performance%20with%20tf.data/TensorFlow_2_0_Data_Input_Pipelines_Performance_with_tf_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Overview

In [1]:
!pip install --upgrade tensorflow
import tensorflow as tf
print("TensorFlow version: ",tf.__version__)

Collecting tensorflow
[?25l  Downloading https://files.pythonhosted.org/packages/46/0f/7bd55361168bb32796b360ad15a25de6966c9c1beb58a8e30c01c8279862/tensorflow-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (86.3MB)
[K     |████████████████████████████████| 86.3MB 1.9MB/s 
Collecting tensorflow-estimator<2.1.0,>=2.0.0 (from tensorflow)
[?25l  Downloading https://files.pythonhosted.org/packages/95/00/5e6cdf86190a70d7382d320b2b04e4ff0f8191a37d90a422a2f8ff0705bb/tensorflow_estimator-2.0.0-py2.py3-none-any.whl (449kB)
[K     |████████████████████████████████| 450kB 44.7MB/s 
[?25hCollecting tensorboard<2.1.0,>=2.0.0 (from tensorflow)
[?25l  Downloading https://files.pythonhosted.org/packages/9b/a6/e8ffa4e2ddb216449d34cfcb825ebb38206bee5c4553d69e7bc8bc2c5d64/tensorboard-2.0.0-py3-none-any.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 34.3MB/s 
Installing collected packages: tensorflow-estimator, tensorboard, tensorflow
  Found existing installation: tensorflow-estimator 1.15.

### Structure of an input pipeline

> A typical TensorFlow training input pipeline can be framed as an ETL process:

1. **Extract:** Read data from memory (NumPy) or persistent storage -- either local (HDD or SSD) or remote (e.g. GCS or HDFS).

2. **Transform:** Use CPU to parse and perform preprocessing operations on the data such as shuffling, batching, and domain specific transformations such as image decompression and augmentation, text vectorization, or video temporal sampling.

3. **Load:** Load the transformed data onto the accelerator device(s) (e.g. GPU(s) or TPU(s)) that execute the machine learning model.

In [0]:
def parse_fn(example):
  "Parse TFExample records and perform simple data augmentation."
  example_fmt = {
    "image": tf.FixedLengthFeature((), tf.string, ""),
    "label": tf.FixedLengthFeature((), tf.int64, -1)
  }
  parsed = tf.parse_single_example(example, example_fmt)
  image = tf.io.image.decode_image(parsed["image"])
  image = _augment_helper(image)  # augments image using slice, reshape, resize_bilinear
  return image, parsed["label"]

def make_dataset():
  dataset = tf.data.TFRecordDataset("/path/to/dataset/train-*.tfrecord")
  dataset = dataset.shuffle(buffer_size=FLAGS.shuffle_buffer_size)
  dataset = dataset.map(map_func=parse_fn)
  dataset = dataset.batch(batch_size=FLAGS.batch_size)
  return dataset

In [4]:
tf.config.experimental.list_physical_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU')]

In [5]:
tf.test.is_gpu_available()

False

### Optimizing Performance

#### Pipelining

In [0]:
### 
# dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
###

#### Parallelize data transformation

In [0]:
###
# dataset = dataset.prefetch(map_func = parse_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE)
###

#### Parallelize data extraction

In [0]:
###
# files = tf.data.list_files("/path/to/dataset/train-*.tfrecord")  
# dataset = files.interleave(
# tf.data.TFRecordDataset, cycle_length = FLAGS.num_parallel_reads,
# num_parallel_calls=tf.data.experimental.AUTOTUNE)
###

### Performance considerations

The tf.data API is designed around composable transformations to provide its users with flexibility. Although many of these transformations are commutative, the ordering of certain transformations has performance implications.

#### Map and batch

apply the batch transformation before the map transformation.

#### Map and cache

If the user-defined function passed into the map transformation is expensive, apply the cache transformation after the map transformation as long as the resulting dataset can still fit into memory or local storage. If the user-defined function increases the space required to store the dataset beyond the cache capacity, consider pre-processing your data before your training job to reduce resource usage.

#### Map and interleave/ prefetch/ shuffle

A number of transformations, including interleave, prefetch, and shuffle, maintain an internal buffer of elements. If the user-defined function passed into the map transformation changes the size of the elements, then the ordering of the map transformation and the transformations that buffer elements affects the memory usage. In general, we recommend choosing the order that results in lower memory footprint, unless different ordering is desirable for performance (for example, to enable fusing of the map and batch transformations).

#### Shuffle and Repeat

If the repeat transformation is applied before the shuffle transformation, then the epoch boundaries are blurred. That is, certain elements can be repeated before other elements appear even once. On the other hand, if the shuffle transformation is applied before the repeat transformation, then performance might slow down at the beginning of each epoch related to initialization of the internal state of the shuffle transformation. In other words, the former (repeat before shuffle) provides better performance, while the latter (shuffle before repeat) provides stronger ordering guarantees.

### Best Practises using tf.data

> Here is a summary of the best practices for designing performant TensorFlow input pipelines:

1. Use the prefetch transformation to overlap the work of a producer and consumer. In particular, we recommend adding prefetch to the end of your input pipeline to overlap the transformations performed on the CPU with the training done on the accelerator. Either manually tuning the buffer size, or using tf.data.experimental.AUTOTUNE to delegate the decision to the tf.data runtime.

2. Parallelize the map transformation by setting the num_parallel_calls argument. Either manually tuning the level of parallelism, or using tf.data.experimental.AUTOTUNE to delegate the decision to the tf.data runtime.

3. If you are working with data stored remotely and / or requiring deserialization, we recommend using the interleave transformation to parallelize the reading (and deserialization) of data from different files.

4. Vectorize cheap user-defined functions passed in to the map transformation to amortize the overhead associated with scheduling and executing the function.

5. If your data can fit into memory, use the cache transformation to cache it in memory during the first epoch, so that subsequent epochs can avoid the overhead associated with reading, parsing, and transforming it.

6. If your pre-processing increases the size of your data, we recommend applying the interleave, prefetch, and shuffle first (if possible) to reduce memory usage.

7. We recommend applying the shuffle transformation before the repeat transformation.