In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals
import numpy as np
import tensorflow as tf

### importing data
`tf.data.Dataset` for an imaging pipeline represents one or more tensor objects. For imaging this might be a simple training example with pair of tensors representing the image data and label. 
  - creating a source (Dataset.from_tensor_slices()) constructs a dataset from one or more `tf.Tensor` objects. 
  - apply a transformation with one or more `tf.data.Dataset` objects
  
 `tf.data.Iterator` provides the main way to extract elements from a dataset. The oipeeation returned by `Iterator.get_next()` yields the next element of a Dataset when executed. This will act as an interface between input pipeline code and your model. m

`tf.data.TFRecordDataset` is good if you have your input data written to disk somewhere, which you can construct a `tf.data.TFRecordDataset`. Once you have the `Dataset` object you can transform it into a new `Dataset` by chaining methods on `tf.data.Dataset` object. 
    
Most common way to consume values from a `Dataset` is to make a `iterator` object that provides access to one element of the dataset at a time. A `tf.data.Iterator` provides two operations: `Iterator.initializer`, which enables you to reinitialize the iterator's state and `Iterator.get_next()`, which returns `tf.Tensor` objects that corresponds to the next symbolic element. 

##### Dataset Structure 
An element contains one or more tf.Tensor objects, called components. Each component has a `tf.DType` representing the type of elements in the tensor, and a `tf.TensorShape` representing the (possibly partially specified) static shape of each element. The `Dataset.output_types` and `Dataset.output_shapes` properties allow you to inspect the inferred types and shapes of each component of a dataset element. The nested structure of these properties map to the structure of an element, which may be a single tensor, a tuple of tensors, or a nested tuple of tensors.

In [2]:
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 10]))
print(dataset1.output_types)
print(dataset1.output_shapes)

<dtype: 'float32'>
(10,)


In [3]:
dataset2 = tf.data.Dataset.from_tensor_slices(
    (tf.random_uniform([4]),
     tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)))
print(dataset2.output_types)
print(dataset2.output_shapes)

(tf.float32, tf.int32)
(TensorShape([]), TensorShape([Dimension(100)]))


In [4]:
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
print(dataset3.output_types)
print(dataset3.output_shapes)

(tf.float32, (tf.float32, tf.int32))
(TensorShape([Dimension(10)]), (TensorShape([]), TensorShape([Dimension(100)])))


In [5]:
x = np.arange(5)
y = np.arange(5)
print(x)
print(y)
z = zip(x,y)
list(z)

[0 1 2 3 4]
[0 1 2 3 4]


[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]

#### Creating the iterator
`tf.data` supports the following iterators: 
##### one-shot: 
is the simplest form of iterator, supports iterating once through dataset with no need for explicit initialization. These will handle almost all of the cases that exist queue-based input pipeline but do not support parameterization. 

In [9]:
dataset = tf.data.Dataset.range(100)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    for i in range(100):
        value = sess.run(next_element)
        assert i == value

##### intializable:

you must run an explicit `iterator.initializer` op before using. in exchange you can parameterize the definition of the dataset using one or more `tf.placeholder()` tensors that can be fed when you initialize the iterator.

In [10]:
max_value = tf.placeholder(tf.int64, shape=[])
dataset = tf.data.Dataset.range(max_value)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    # initialize an iterator over a dataset with 10 elements.
    sess.run(iterator.initializer, feed_dict={max_value:10})
    for i in range(10):
        value = sess.run(next_element)
        assert i == value

In [11]:
with tf.Session() as sess:
    sess.run(iterator.initializer, feed_dict={max_value:100})
    for i in range(100):
        value = sess.run(next_element)
        assert i == value

##### reintializable:

can be from multiple different `Dataset` objects. For example, you have a training input pipeline that uses random perturbations to the input images to improve generalization, and a validation input pipeline that evals predictions on unmodified data. 


In [3]:
# Define training and validation datasets with the same structure
training_dataset = tf.data.Dataset.range(100).map(lambda x: x + tf.random_uniform([],-10, 10, tf.int64))
validation_dataset = tf.data.Dataset.range(50)

# A reinitializable iterator is defined by its structure. We could use the
# `output_types` and `output_shapes` properties of either `training_dataset`
# or `validation_dataset` here, because they are compatible.
iterator = tf.data.Iterator.from_structure(training_dataset.output_types,
                                           training_dataset.output_shapes)
next_element = iterator.get_next()

training_init_op = iterator.make_initializer(training_dataset)
validation_init_op = iterator.make_initializer(validation_dataset)

with tf.Session() as sess:
    # Run 20 epochs in which the training dataset is traversed, followed by validation dataset
    for _ in range(20):
        # Initialize an iterator over the training dataset
        sess.run(training_init_op)
        for _ in range(100):
            sess.run(next_element)
        
        # Initialize an iterator over the validation dataset
        sess.run(validation_init_op)
        for _ in range(50):
            sess.run(next_element)


#### feedable:

can be used with `tf.placeholder` to select what `Iterator` to use in each call to `tf.Session.run` via the `feed_dict` mechanism. It offers the same functionality as a reintializable iterator but does not require to initialize the iterator from the start of dataset when you switch between iterators. We can use the same training and validation example from above using `tf.data.Iteraotr.from_string_handle` to define a feedable iterator that allows you to swtich between two datasets:

In [None]:
# Define the training and validation datasets with the same structure.
training_dataset = tf.data.Dataset.range(100).map(lambda x: x + tf.random_uniform([],-10, 10, tf.int64)).repeat()
validation_dataset = tf.data.Dataset.range(50)

In [None]:
# A feedable iterator is defined by a handle placeholder and its structure. 
# We could us the output_types and output_shapes properties of with the 
# training_dataset or validation_dataset because they have identical structure
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle, training_dataset.output_types, training_dataset.output_shapes)
next_element = iterator.get_next()

# You can use the feedable iterators with a variety of different kinds of iterator (like 
# one-shot and initializable iterators)
training_iterator = training_dataset.make_one_shot_iterator()
validation_iterator = validation_dataset.make_initializable_iterator()

with tf.Session() as sess:
    # the 'Iterator.string_handle()' method returns a tensor that can be evaluated and used to feed
    # the 'handle' placeholder
    training_handle = sess.run(training_iterator.string_handle())
    validation_handle = sess.run(validation_iterator.string_handle())
    
    # loop forever, alternating between training and validatioin
    while True:
        # Run 200 steps using the training dataset. Note, that dataset is infinite and we
        # resume where  we left off in the previous 'while' loop
        # iteration. 
        for _ in range(200):
            sess.run(next_element, feed_dict={handle: training_handle})
        
        # Run onw pass over the validation dataset.
        for _ in range(50):
            sess.run(next_element, feed_dict={handle: validation_handle})

##### CONSUMING VALUES FROM AN ITERATOR

`Iterator.get_next()` method returns one or more `tf.Tensor` objects that correspond to the symbolic next element of an iterator. Each time these tensors are evaluated tehy take the value of the next element in the underlying dataset. Note, calling `Iterator.get_next()` does not immediately advance the iterator. You must have returned `tf.Tensor` object in an expression and pass the result to a `tf.Session.run()` to get the next elements. 

Note, that the `Iterator.get_next()` will raise a `tf.errors.OutofRangeError` once it reached the end of the dataset. 

In [14]:
dataset = tf.data.Dataset.range(50)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

# Typically the 'result' will be the output of a model or optimizer's training op
result = tf.add(next_element, next_element)

with tf.Session() as sess:
    sess.run(iterator.initializer)
    print(sess.run(result))
    print(sess.run(result))
    print(sess.run(result))
    print(sess.run(result))
    print(sess.run(result))
    try:
        sess.run(result)
    except tf.errors.OutOfRangeError:
        print("End of dataset")

0
2
4
6
8


In [15]:
# This is a common pattern for wrapping the "training loop"
with tf.Session() as sess:
    sess.run(iterator.initializer)
    while True:
        try:
            sess.run(result)
        except tf.errors.OutOfRangeError:
            break

In [None]:
# if each element fo the dataset has a nested structure the return value of Iterator.get_next()
# will be one or more tf.Tensor in same structure

#### Reading input data

##### numpy arrays


In [None]:
# comsuming numpy arrays
# use np.load() to load training data into two arrays
# features and labels will be embedded into your tensorflow graph as tf.constant() ops
# this is good for small datasets but will waste memory and can run into the 2GB limit for
# tf.GraphDef protocol buffer
with np.load("/var/data/training_data.npy") as data:
    features = data["features"]
    labels = data["labels"]
    
    assert features.shape[0] == labels.shape[0]
    
    dataset = tf.data.Dataset.from_tensor_slices((features, labels))

In [None]:
# comsuming numpy arrays
# use np.load() to load training data into two arrays
# features and labels will be embedded into your tensorflow graph as tf.constant() ops
# this is good for small datasets but will waste memory and can run into the 2GB limit for
# tf.GraphDef protocol buffer
with np.load("/var/data/training_data.npy") as data:
    features = data["features"]
    labels = data["labels"]
    
    assert features.shape[0] == labels.shape[0]
    
    dataset = tf.data.Dataset.from_tensor_slices((features, labels))

In [None]:
# alternatively we can define the Dataset in terms of tf.placeholder() tensors and 
# feed the NumPy arrays when you initialize the Iterator over dataset
with np.load("/var/data/training_data.npy") as data:
    features = data["features"]
    labels = data["labels"]
    
# Assume taht each row of feattures correspnds to the same row as labels. 
assert features.shape[0] == labels.shape[0]

features_placeholder = tf.placeholder(features.dtype, features.shape)
labels_placeholder = tf.placeholder(labels.dtype, features.shape)

dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, labels_placeholder))
# [Other transformations on 'dataset']
dataset = ...
iterator = dataset.make_initializable_iterator()

sess.run(iterator.initializer, feed_dict={features_placeholder: features,
                                          labels_placeholder: labels})

##### TFRecord data format
 Used to process large amounts of data that does not fit in memory. `tf.data.TFRecordDataset` class enables you to stream over the contents of one or more TFRecord files as part of an input pipeline. 

In [None]:
# creates a dataset that reads all of the examples from two files. 
filenames = ["/var/data/file1.tf.record", "/var/data/file2.tf.record"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...)  # parse the record into tensors. 
dataset = dataset.repeat()  # repeat the input indefinitely.
dataset = dataset.batch(32) 
iterator = dataset.make_initializable_iterator()

# you can feed the initializer with the app filenames for the current
# phase of execution 

# initilize 'iterator' with training data
training_filenames = ["/var/data/file1.tfrecord", "var/data/file2.tfrecord"]
sess.run(iterator.initializer, feed_dict={filenames: training_filenames})

# initialize 'iterator' with validation data
training_filenames = ["/var/data/file1.tfrecord", "var/data/file2.tfrecord"]
sess.run(iterator.initializer, feed_dict={filenames: validation_filenames})

##### Consuming text data: TODO

##### Consuming CSV data: TODO

#### Preprocessing data with `Dataset.map()` 

`Dataset.map(f)` produces a new dataset by applying a given function `f` to each element of the input dataset. It's based on `map() function` in Python. `f` takes the `tf.Tensor` objects that reperent a single element in the input and returns a `tf.Tensor` object that represents ta single element in the new dataset.  

###### Parsing `tf.Example` protocol buffer messages

Input pipelines extract `tf.train.Example` protocol buffer messages from TFRecord-format file using `tf.python_io.TFRecordWriter`). Each `tf.train.Example` contains one or more "features" and input pipeline converts these into tensors. 

In [None]:
# Transforms scalar string 'example_proto' into a pair of scalar string and 
# a scalar integer, represents a image and it's label
def _parse_fucntion(example_proto):
    features = {"image": tf.FixedLenFeature((), tf.string, default_value=""),
                "label": tf.FixedLenFeature((), tf.int64, default_value=0)}
    parsed_features = tf.parse_single_example(example_proto, features)
    return parsed_features["image"], parsed_feature["label"]

# Creates a dataset that reads all of the examples from two files, and extracts
# the image and label feeatures
filenames =["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(_parse_function)

##### Decoding image data and resizing it

When we want to resize an image to a common size, so that they might be batched into a fixed size.

In [None]:
# Reads a image from a file, decodes it into a dense tensor, and resizes it 
# to a fixed shape.
def _parse_function(filename, label):
    image_string = tf.read_file(filename)
    image_decoded = tf.image.decoded_jpeg(image_string)
    image_resized = tf.image.resize_images(image_decoded, [28, 28])
    return image_resized, label

# A vector of filenames
filenames = tf.constant(["/var/data/image1.jpg", "/var/data/image2.jpg"])

# labels[i] is the label for the image in 'filenames[i]'
labels = tf.constant([0, 37, ...])

dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)


In [None]:
# We can apply arbitrary Python logic with tf.py_func()
# Should use TF operations to preprocess data.
# Sometimes it's helpful to use external libraries when passing in data.

import cv2

# Use a custom OpenCv function to read an image, insted of the standard
# TF 'tf.read_file()' operation

def _read_py_function(filename, label):
    image_decoded = cv2.imread.(filename.decode(), cv2.IMREAD_GRAYSCALE)
    return image_decoded, label

# Use standardized TF operation to resize the image to a fixed shape. 
def _resize_function(image_decoded, label):
    image_decoded.set_shape([None, None, None])
    image_resized = tf.image.resize_images(image_decoded, [28, 28])
    return image_resized, label

filenames = ["/var/data/image1.jpg", "/var/data/image2.jpg", ...]
labels = [0, 37, 29, 1, ...]

dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(
    lambda filename, label: tuple(tf.py_func(
        _read_py_function, [filename, label], [tf.uint8, label.dtype])))
dataset = dataset.map(_resize_function)

#### Batching dataset elements

##### Simple batching
Simplest is to form batch stacks n consecutive elements into a single element. The `Dataset.batch()` transform does this with same constraints as `tf.stack()` operator applied to each componenet of the elements i.e. for each component i, all elements mush have a tensor of the exact same shape.

In [2]:
inc_dataset = tf.data.Dataset.range(100)
dec_dataset = tf.data.Dataset.range(0, -100, -1)
dataset = tf.data.Dataset.zip((inc_dataset, dec_dataset))
batched_dataset = dataset.batch(4)

iterator = batched_dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    print(sess.run(next_element))
    print(sess.run(next_element))
    print(sess.run(next_element))

(array([0, 1, 2, 3]), array([ 0, -1, -2, -3]))
(array([4, 5, 6, 7]), array([-4, -5, -6, -7]))
(array([ 8,  9, 10, 11]), array([ -8,  -9, -10, -11]))


In [3]:
# what if we have tensors of different sizes (sequence models)
# use Dataset.passed_batch() transofmr to enable to batch tensors of different shape tp 
# specify padding

dataset = tf.data.Dataset.range(100)
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
dataset = dataset.padded_batch(4, padded_shapes = (None,))

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    print(sess.run(next_element))
    print(sess.run(next_element)) 

[[0 0 0]
 [1 0 0]
 [2 2 0]
 [3 3 3]]
[[4 4 4 4 0 0 0]
 [5 5 5 5 5 0 0]
 [6 6 6 6 6 6 0]
 [7 7 7 7 7 7 7]]


#### Training workflows

##### Multiple epochs

We can use tf.data in 2 ways to process multiple epochs of the same data. 

To create a dataset that repeats its input for 10 epochs:

In [None]:
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...)
dataset = dataset.repeat(10)  # this will repeat the input 10 times
dataset = dataset.batch(32)

We can write a training loop to receieve a signal at the end of each epoch

In [None]:
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...)
dataset = dataset.batch(32)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

# Compute for 100 epochs.
for _ in range(100):
    sess.run(iterator.initializer)
    while True:
        try:
            sess.run(next_element)
        except tf.errors.OutOfRangeError:
            break

  # [Perform end-of-epoch calculations here.]

##### Randomly shuffling input data

In [None]:
# Dataset.shuffle() transform randomly shuffles the input dataset, it maintains a fixed-size 
# buffer and chooses the next element uniformly at random from the buffer

filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
dataset = dataset.repeat()