In [1]:
import tensorflow as tf
tf.__version__
#tf.enable_eager_execution()

tf.estimator package not installed.
tf.estimator package not installed.


'1.12.0'

The **`tf.data`** API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training. <br/><br/>The pipeline for a text model might involve extracting symbols from raw text data, converting them to __embedding identifiers__ with a lookup table, and batching together sequences of different lengths. The **`tf.data`** API makes it easy to deal with large amounts of data, different data formats, and complicated transformations.

The [**`tf.data`**](https://www.tensorflow.org/api_docs/python/tf/data) API introduces two new abstractions to TensorFlow:

1. <font color=blue>A [**`tf.data.Dataset`**](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) represents a sequence of elements, in which each element contains one or more [**`Tensor`**](https://www.tensorflow.org/api_docs/python/tf/Tensor) objects.</font> For example, in an image pipeline, an element might be a single training example, with a pair of tensors representing the image data and a label. There are two distinct ways to create a dataset:

    * <font color=blue>Creating a <font color=green>source</font> (e.g. **`Dataset.from_tensor_slices()`**) constructs a dataset from one or more **`tf.Tensor`** objects.</font>

    * <font color=blue>Applying a <font color=green>transformation</font> (e.g. **`Dataset.batch()`**) constructs a dataset from one or more **`tf.data.Dataset`** objects.</font>

2. <font color=blue>A [**`tf.data.Iterator`**](https://www.tensorflow.org/api_docs/python/tf/data/Iterator) provides the main way to extract elements from a dataset The operation returned by **`Iterator.get_next()`** yields the next element of a **`Dataset`** when executed.</font>, and <font color=red>typically acts as the interface between input pipeline code and your model.</font> <br/><br/>The simplest iterator is a "<font color=green>one-shot iterator</font>", <font color=blue>which is associated with a particular **`Dataset`** and iterates through it once.</font> For more sophisticated uses, <font color=blue>the **`Iterator.initializer`** operation enables you to reinitialize and parameterize an iterator with different datasets</font>, so that you can, for example, iterate over training and validation data multiple times in the same program.

# Basic mechanics

This section of the guide describes the fundamentals of creating different kinds of **`Dataset`** and **`Iterator`** objects, and how to extract data from them.

<font color=blue>To start an input pipeline, you must define a <font color=green>source</font></font>. For example, <font color=blue>to construct a **`Dataset`** from some tensors in memory</font>, you can use **`tf.data.Dataset.from_tensors()`** or **`tf.data.Dataset.from_tensor_slices()`**. Alternatively, <font color=blue>if your input data are on disk in the recommended <font color=green>TFRecord format</font>, you can construct a [**`tf.data.TFRecordDataset`**](https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset).</font>

<font color=blue>Once you have a **`Dataset`** object, you can <font color=green>transform</font> it into a new **`Dataset`** by chaining method calls on the **`tf.data.Dataset`** object.</font> For example, <font color=blue>you can apply per-element transformations such as **`Dataset.map()`** (to apply a function to each element), and multi-element transformations such as **`Dataset.batch()`**.</font> See the documentation for [**`tf.data.Dataset`**](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) for a complete list of transformations.

<font color=blue>The most common way to consume values from a **`Dataset`** is to make an iterator object that provides access to one element of the dataset at a time (for example, by calling **`Dataset.make_one_shot_iterator()`**).</font> <br/><br/>A **`tf.data.Iterator`** provides two operations: 
* <font color=blue>**`Iterator.initializer`**, which enables you to (re)initialize the iterator's state; </font>
* <font color=blue>**`Iterator.get_next()`**, which returns **`tf.Tensor`** objects that correspond to the symbolic next element.</font> Depending on your use case, you might choose a different type of iterator, and the options are outlined below.

### <font color=gray>Dataset structure</font>

<font color=green>Dataset</font>: <font color=blue>A dataset comprises elements that each have the same structure.</font> <br/><br/>
<font color=green>Element</font>: <font color=blue>an element contains one or more **`tf.Tensor`** objects, called <font color=green>components</font></font>. <br/><br/>
<font color=green>Component</font>: <font color=blue>each component has a **`tf.DType`** representing the type of elements in the tensor, and a **`tf.TensorShape`** representing the (possibly partially specified) static shape of each element.</font></font> <br/><br/>The **`Dataset.output_types`** and **`Dataset.output_shapes`** properties <font color=blue>allow you to inspect the inferred types and shapes of each <font color=green>component</font> of a dataset <font color=green>element</font>.</font> <br/><br/>The <font color=green>nested structure</font> of these properties <font color=blue>map to the structure of an element, which may be a single tensor, a tuple of tensors, or a nested tuple of tensors.</font> <br/><br/>For example:

In [4]:
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 10]))
print(dataset1.output_types)  # ==> "tf.float32"
print(dataset1.output_shapes)  # ==> "(10,)"
print()

dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random_uniform([4]),
    tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)))
print(dataset2.output_types)  # ==> "(tf.float32, tf.int32)"
print(dataset2.output_shapes)  # ==> "((), (100,))"
print()

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
print(dataset3.output_types)  # ==> (tf.float32, (tf.float32, tf.int32))
print(dataset3.output_shapes)  # ==> "(10, ((), (100,)))"

<dtype: 'float32'>
(10,)

(tf.float32, tf.int32)
(TensorShape([]), TensorShape([Dimension(100)]))

(tf.float32, (tf.float32, tf.int32))
(TensorShape([Dimension(10)]), (TensorShape([]), TensorShape([Dimension(100)])))


It is often convenient to give names to each <font color=green>component</font> of an <font color=green>element</font>, for example if they represent different features of a training example. <br/><br/><font color=blue>In addition to tuples, you can use **`collections.namedtuple`** or a **`dictionary`** <font color=red>_mapping strings to tensors_</font> to represent a single <font color=green>element</font> of a **`Dataset`**.</font>

In [3]:
dataset = tf.data.Dataset.from_tensor_slices(
   {"a": tf.random_uniform([4]),
    "b": tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)})
print(dataset.output_types)  # ==> "{'a': tf.float32, 'b': tf.int32}"
print(dataset.output_shapes)  # ==> "{'a': (), 'b': (100,)}"

{'a': tf.float32, 'b': tf.int32}
{'a': TensorShape([]), 'b': TensorShape([Dimension(100)])}


<font color=blue>The **`Dataset`** <font color=green>_transformations_</font> support datasets of any structure. <br/><br/>When using the **`Dataset.map()`**, **`Dataset.flat_map()`**, and **`Dataset.filter()`** <font color=green>_transformations_</font>, which apply a function to each <font color=green>_element_</font>, the <font color=green>_element_</font> structure determines the arguments of the function</font>:

In [None]:
dataset1 = dataset1.map(lambda x: ...)

dataset2 = dataset2.flat_map(lambda x, y: ...)

# Note: Argument destructuring is not available in Python 3.
dataset3 = dataset3.filter(lambda x, (y, z): ...)

### <font color=gray>Creating an iterator</font>

Once you have built a **`Dataset`** to represent your input data, the next step is to create an **`Iterator`** to <font color=blue>access elements from that dataset.</font> The **`tf.data`** API currently supports the following iterators, in increasing level of sophistication:
<font color=green>
* **one-shot iterator**
* **initializable iterator**
* **reinitializable iterator**
* **feedable iterator**</font>

### <font color=green>◎ one-shot iterator</font>

A <font color=green>**one-shot iterator**</font> is the simplest form of iterator, which only <font color=blue>supports iterating once through a dataset</font>, <font color=red>with no need for explicit initialization.</font> <br/><br/><font color=blue><font color=green>**One-shot iterators**</font> handle almost all of the cases that the existing queue-based input pipelines support,</font> <font color=red>but they do not support _*parameterization*_.</font> <br/><br/>Using the example of **`Dataset.range()`**:

In [13]:
with tf.Session() as sess:
    dataset = tf.data.Dataset.range(100)
    iterator = dataset.make_one_shot_iterator()
    next_element = iterator.get_next()
    
    result = []
    for i in range(100):
        value = sess.run(next_element)
        result.append(str(value))
        assert i == value
    print('result:', ','.join(result))
        

result: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99


### <font color=green>◎ initializable iterator</font>

An <font color=green>**initializable iterator**</font> <font color=blue>requires you to run an explicit **`iterator.initializer`** operation before using it</font>. <br/><br/>In exchange for this inconvenience, <font color=blue>it enables you to <font color=red>_parameterize_</font> the definition of the dataset, using one or more **`tf.placeholder()`** tensors that can be fed when you initialize the iterator.</font> <br/><br/>Continuing the **`Dataset.range()`** example:

In [4]:
max_value    = tf.placeholder(tf.int64, shape=[])
dataset      = tf.data.Dataset.range(max_value)
iterator     = dataset.make_initializable_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    result = []
    # Initialize an iterator over a dataset with 5 elements.
    sess.run(iterator.initializer, feed_dict={max_value: 5})
    for i in range(5):
        value = sess.run(next_element)
        result.append(str(value))
        assert i == value
    print('5 elements: ', ','.join(result))
    result.clear()
    
    # Initialize the same iterator over a dataset with 10 elements.
    sess.run(iterator.initializer, feed_dict={max_value: 10})
    for i in range(10):
        value = sess.run(next_element)
        result.append(str(value))
        assert i == value
    print('10 elements: ', ','.join(result))

5 elements:  0,1,2,3,4
10 elements:  0,1,2,3,4,5,6,7,8,9


### <font color=green>◎ reinitializable iterator</font>

<font color=blue>A <font color=green>**reinitializable iterator**</font> can be initialized from multiple different **`Dataset`** objects.</font> For example, you might have a <font color=blue>training input pipeline</font> that uses _random perturbations_ to the input images to improve generalization, and a <font color=blue>validation input pipeline</font> that evaluates predictions on unmodified data. <br/><br/>These pipelines will typically use different **`Dataset`** objects that have the same structure (i.e. the same types and compatible shapes for each component).

In [10]:
# Define training and validation datasets with the same structure.
training_dataset = tf.data.Dataset.range(30).map(
    lambda x: x + tf.random_uniform([], -10, 10, tf.int64))
validation_dataset = tf.data.Dataset.range(10)

# A reinitializable iterator is defined by its structure. We could use the
# `output_types` and `output_shapes` properties of either `training_dataset`
# or `validation_dataset` here, because they are compatible.
iterator = tf.data.Iterator.from_structure(
    training_dataset.output_types, 
    training_dataset.output_shapes
)
next_element = iterator.get_next()

training_init_op = iterator.make_initializer(training_dataset)
validation_init_op = iterator.make_initializer(validation_dataset)

with tf.Session() as sess:
    # Run 20 epochs in which the training dataset is traversed, followed by the
    # validation dataset.
    for i in range(3):
        result = []
        # Initialize an iterator over the training dataset.
        sess.run(training_init_op)
        for _ in range(30):
            value = sess.run(next_element)
            result.append(str(value))
        print('[{}] train_set: {}'.format(i, ','.join(result)))
        result.clear()

        # Initialize an iterator over the validation dataset.
        sess.run(validation_init_op)
        for _ in range(10):
            value = sess.run(next_element)
            result.append(str(value))
        print('[{}] val_set: {}'.format(i, ','.join(result)))
        print()

[0] train_set: -5,7,-2,5,-3,2,-3,12,7,16,7,5,5,3,20,16,24,21,13,26,16,23,29,28,25,34,21,19,36,35
[0] val_set: 0,1,2,3,4,5,6,7,8,9

[1] train_set: 9,-3,-6,4,4,-4,8,-3,-1,16,8,11,5,19,4,6,11,15,12,15,27,26,20,26,30,34,18,33,31,35
[1] val_set: 0,1,2,3,4,5,6,7,8,9

[2] train_set: -10,10,6,-1,4,5,14,7,0,9,15,1,13,22,11,10,13,18,9,12,15,24,29,22,18,27,31,26,23,23
[2] val_set: 0,1,2,3,4,5,6,7,8,9



### <font color=green>◎ feedable iterator</font>

<font color=blue>A <font color=green>**feedable iterator**</font> can be used together with **`tf.placeholder`** to select what **`Iterator`** to use in each call to **`tf.Session.run`**, via the familiar **`feed_dict`** mechanism.</font> <br/><br/>It offers the same functionality as a <font color=green>**reinitializable iterator**</font>, <font color=red>but it does not require you to initialize the iterator from the start of a dataset when you switch between iterators.</font> For example, using the same training and validation example from above, you can use [**`tf.data.Iterator.from_string_handle`**](https://www.tensorflow.org/api_docs/python/tf/data/Iterator#from_string_handle) to <font color=red>define a feedable iterator that allows you to switch between the two datasets</font>:

In [11]:
# Define training and validation datasets with the same structure.
training_dataset = tf.data.Dataset.range(10).map(
    lambda x: x + tf.random_uniform([], -10, 10, tf.int64)
).repeat()
validation_dataset = tf.data.Dataset.range(5)

# A feedable iterator is defined by a handle placeholder and its structure. We
# could use the `output_types` and `output_shapes` properties of either
# `training_dataset` or `validation_dataset` here, because they have
# identical structure.
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(
    string_handle=handle, 
    output_types=training_dataset.output_types, 
    output_classes=training_dataset.output_shapes
)
next_element = iterator.get_next()

# You can use feedable iterators with a variety of different kinds of iterator
# (such as one-shot and initializable iterators).
training_iterator = training_dataset.make_one_shot_iterator()
validation_iterator = validation_dataset.make_initializable_iterator()

with tf.Session() as sess:
    # The `Iterator.string_handle()` method returns a tensor that
    # can be evaluated and used to feed the `handle` placeholder.
    training_handle = sess.run(training_iterator.string_handle())
    validation_handle = sess.run(validation_iterator.string_handle())

    # Loop 5 times, alternating between training and validation.
    for i in range(5):
        result = []
        # Run 200 steps using the training dataset. Note that the training dataset is
        # infinite, and we resume from where we left off in the previous `while` loop
        # iteration.
        for _ in range(20):
            value = sess.run(next_element, feed_dict={handle: training_handle})
            result.append(str(value))
        print('[{}] train_set: {}'.format(i, ','.join(result)))
        result.clear()
        
        # Run one pass over the validation dataset.
        sess.run(validation_iterator.initializer)
        for _ in range(5):
            value = sess.run(next_element, feed_dict={handle: validation_handle})
            result.append(str(value))
        print('[{}] val_set: {}'.format(i, ','.join(result)))
        print()

[0] train_set: -7,4,-7,7,10,-4,2,8,13,2,1,5,-3,-5,9,14,7,-3,12,5
[0] val_set: 0,1,2,3,4

[1] train_set: 4,8,9,7,9,9,13,16,3,3,5,9,1,-4,-6,5,7,3,-2,13
[1] val_set: 0,1,2,3,4

[2] train_set: 8,0,-5,1,4,4,7,2,12,10,9,7,-7,-1,12,-1,14,14,5,-1
[2] val_set: 0,1,2,3,4

[3] train_set: 5,2,3,8,11,4,5,5,10,12,0,-1,9,3,6,-3,13,-2,0,10
[3] val_set: 0,1,2,3,4

[4] train_set: 7,-6,2,-6,2,-4,14,5,14,9,9,-2,1,-2,12,0,4,-1,6,15
[4] val_set: 0,1,2,3,4



### <font color=gray>Consuming values from an iterator</font>

### <font color=gray>Saving iterator state</font>

# Reading input data

### <font color=gray>Consuming NumPy arrays</font>

### <font color=gray>Consuming TFRecord data</font>

### <font color=gray>Consuming text data</font>

### <font color=gray>Consuming CSV data</font>

# Preprocessing dta with Dataset.map()

### <font color=gray>Parsing tf.Example protocol buffer messages</font>

### <font color=gray>Decoding image data and resizing it</font>

### <font color=gray>Applying arbitrary Python logic with tf.py_func()</font>

# Batching dataset elements

### <font color=gray>Simple batching</font>

### <font color=gray>Batching tensors with padding</font>

# Training workflows

### <font color=gray>Processing multiple epochs</font>

### <font color=gray>Randomly shuffling input data</font>

### <font color=gray>Using high-level APIs</font>

In [None]:
### Dataset structure