## What is it for?

The Dataset API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training. The pipeline for a text model might involve extracting symbols from raw text data, converting them to embedding identifiers with a lookup table, and batching together sequences of different lengths. The Dataset API makes it easy to deal with large amounts of data, different data formats, and complicated transformations.

## Data structure

* tf.data.Dataset: A sequence of elements.

* elements: each elements contains one or more Tenor objects.


## How to create Dataset

### From Numpy arrays

In [1]:
import numpy as np
import tensorflow as tf

np_array_1 = np.array([1,2,3,4])
np_array_2 = np.array([5,6,7,8])



dataset = tf.data.Dataset.from_tensor_slices((np_array_1, np_array_2))

In [6]:
dataset.output_shapes, dataset.output_types

((TensorShape([]), TensorShape([])), (tf.int64, tf.int64))

there is a simpler api for just 1 tensor

In [8]:
import numpy as np
import tensorflow as tf

np_array_1 = np.array([1,2,3,4])



dataset = tf.data.Dataset.from_tensor_slices((np_array_1))

In [9]:
dataset.output_shapes, dataset.output_types

(TensorShape([]), tf.int64)

If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().

### From TFRecords data

The Dataset API supports a variety of file formats so that you can process large datasets that do not fit in memory. For example, the TFRecord file format is a simple record-oriented binary format that many TensorFlow applications use for training data. The tf.data.TFRecordDataset class enables you to stream over the contents of one or more TFRecord files as part of an input pipeline.

In [10]:
# Creates a dataset that reads all of the examples from two files.
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)

The filenames argument to the TFRecordDataset initializer can either be a string, a list of strings, or a **tf.Tensor of strings. Therefore if you have two sets of files for training and validation purposes, you can use a tf.placeholder(tf.string) to represent the filenames, and initialize an iterator from the appropriate filenames**.



###  Consuming text data: tf.data.TextLineDataset

Many datasets are distributed as one or more text files. The tf.data.TextLineDataset provides an easy way to extract lines from one or more text files. Given one or more filenames, a TextLineDataset will produce one string-valued element per line of those files. Like a TFRecordDataset, TextLineDataset accepts filenames as a tf.Tensor, so you can parameterize it by passing a tf.placeholder(tf.string).

## Pre-processing data with Dataset.map

### Parsing tf.Example protocol buffer messages

Many input pipelines extract tf.train.Example protocol buffer messages from a TFRecord-format file (written, for example, using tf.python_io.TFRecordWriter). Each tf.train.Example record contains one or more "features", and the input pipeline typically converts these features into tensors.

In [None]:
# Transforms a scalar string `example_proto` into a pair of a scalar string and
# a scalar integer, representing an image and its label, respectively.
def _parse_function(example_proto):
  features = {"image": tf.FixedLenFeature((), tf.string, default_value=""),
              "label": tf.FixedLenFeature((), tf.int32, default_value=0)}
  parsed_features = tf.parse_single_example(example_proto, features)
  return parsed_features["image"], parsed_features["label"]

# Creates a dataset that reads all of the examples from two files, and extracts
# the image and label features.
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(_parse_function)

### Applying arbitrary Python logic with tf.py_func()

## Batching dataset elements

In [17]:
import tensorflow as tf
sess = tf.Session()
dataset = tf.data.Dataset.range(20)
data_batch = dataset.batch(4)

iterator = data_batch.make_one_shot_iterator()
next_element = iterator.get_next()

with sess.as_default() as sess:
#     tf.train.start_queue_runners()

    for i in range(5):
      value = sess.run(next_element)
      print(value)


[0 1 2 3]
[4 5 6 7]
[ 8  9 10 11]
[12 13 14 15]
[16 17 18 19]


use tf.train.shuffle_batch

In [19]:
import tensorflow as tf
sess = tf.Session()
dataset = tf.data.Dataset.range(20)

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
# next_element.set_shape([1])

batch_elment = tf.train.shuffle_batch([next_element], batch_size=4, capacity=64, min_after_dequeue=32)
with sess.as_default() as sess:
    tf.train.start_queue_runners()
    for i in range(5):
      value = sess.run(batch_elment)
      print(value)

[ 5 15 14  1]
[16  9 19 13]
[6 3 4 0]
[12  2 10  8]
[ 7 17 18 11]


## Randomly shuffling input data

In [22]:
import tensorflow as tf
sess = tf.Session()
dataset = tf.data.Dataset.range(80)

dataset = dataset.repeat(2)

dataset = dataset.shuffle(buffer_size=160)

dataset = dataset.batch(16)

iterator = dataset.make_one_shot_iterator()

next_element = iterator.get_next()

# next_element.set_shape([1])

with sess.as_default() as sess:
    for i in range(10):
      value = sess.run(next_element)
      print(value)

[34 56 72 70  6 65 31 55 14 18 62  2 74 46 67 11]
[76 17 13 77  7 43  5 79 26 42 25 37 48 52 28 36]
[15 33 45 57 16 24 22 15 27 62 12 71 68  3 60 11]
[75 31 51  0 76 14 44 66 78 12 50 19 30 64 68 70]
[47 36  9 55 18 23 47 72 45 21 44 52 79  8 29 59]
[41 35 63 20 25 28 10 19 59 26 46 60 16 48 10 61]
[39 65  8 39 73 53 64 78 57  5 66 54 69 49 40 30]
[ 6 69 20 74 33 22  2  4 13  3 77 35  1  0 38 67]
[50 71 23 58 38 41 24 17 29  7 75 42 56 21 54 51]
[61  1 37 32 43 73 27 49 34 32 53  4  9 63 58 40]
