# TesorFlow tf.data

This document assumes TensorFlow 2.0+ and deal with `tf.data`.

In [1]:
import tensorflow as tf
import itertools as it

In [2]:
print("TensorFlow version:", tf.__version__)

TensorFlow version: 2.0.0-beta1


## Basic ranging with `tf.data.Dataset.range()`

Lets' start with a dataset giving integers 0, 1, 2, 3, 4, 5, 6. In raw Python we use `range(7)` for this purpse, but we want analogous one that works great in distributed computing: `tf.data.Dataset.range()`

In [3]:
ds = tf.data.Dataset.range(7)
ds

<RangeDataset shapes: (), types: tf.int64>

With eager mode enabled, `tf.data.Dataset` may be treated as a Python iterable object. Otherwise we need to convert the object into `tf.data.Iterator` with `.make_one_shot_iterator()`. Then the iterator is accessed with TensorFlow iterator APIs such as `.get_next()`.

In [4]:
for x in ds:
    print(x)
    print(f"Type: {type(x)}, value = {x.numpy()}")

tf.Tensor(0, shape=(), dtype=int64)
Type: <class 'tensorflow.python.framework.ops.EagerTensor'>, value = 0
tf.Tensor(1, shape=(), dtype=int64)
Type: <class 'tensorflow.python.framework.ops.EagerTensor'>, value = 1
tf.Tensor(2, shape=(), dtype=int64)
Type: <class 'tensorflow.python.framework.ops.EagerTensor'>, value = 2
tf.Tensor(3, shape=(), dtype=int64)
Type: <class 'tensorflow.python.framework.ops.EagerTensor'>, value = 3
tf.Tensor(4, shape=(), dtype=int64)
Type: <class 'tensorflow.python.framework.ops.EagerTensor'>, value = 4
tf.Tensor(5, shape=(), dtype=int64)
Type: <class 'tensorflow.python.framework.ops.EagerTensor'>, value = 5
tf.Tensor(6, shape=(), dtype=int64)
Type: <class 'tensorflow.python.framework.ops.EagerTensor'>, value = 6


## Grouping with `tf.data.Dataset.batch()`

`tf.data.Dataset.batch()` is a handy function in grouping dataset elements by chunks. Also note the return type is `BatchDataset`.

In [5]:
ds = tf.data.Dataset.range(7)
ds = ds.batch(2, drop_remainder=True)
ds

<BatchDataset shapes: (2,), types: tf.int64>

In [6]:
for x in ds:
    print(x)

tf.Tensor([0 1], shape=(2,), dtype=int64)
tf.Tensor([2 3], shape=(2,), dtype=int64)
tf.Tensor([4 5], shape=(2,), dtype=int64)


Note that the last element was dropped because of the option `drop_remainder=True` in `.batch()`.

You can apply batch multiple times to obtain high-dimensional element.

In [7]:
ds = tf.data.Dataset.range(20)
ds = ds.batch(2, drop_remainder=True).batch(3, drop_remainder=True)
ds

<BatchDataset shapes: (3, 2), types: tf.int64>

In [8]:
for x in ds:
    print(x)

tf.Tensor(
[[0 1]
 [2 3]
 [4 5]], shape=(3, 2), dtype=int64)
tf.Tensor(
[[ 6  7]
 [ 8  9]
 [10 11]], shape=(3, 2), dtype=int64)
tf.Tensor(
[[12 13]
 [14 15]
 [16 17]], shape=(3, 2), dtype=int64)


## Tuple element

Tuple of tensors becomes dataset with a tuple element after `.tf.data.Dataset.from_tensor_slices()`. This works like `zip()` in Python, but it takes tensors instead of iterables.

**[NOTE]** Length of element must agree otherwise `ValueError` raises.

In [9]:
tuple_of_tensors = (tf.range(10), tf.range(2, 12))
ds = tf.data.Dataset.from_tensor_slices(tuple_of_tensors)
ds

<TensorSliceDataset shapes: ((), ()), types: (tf.int32, tf.int32)>

In [10]:
for (x, y) in ds:
    print(f"x={x}, y={y}")

x=0, y=2
x=1, y=3
x=2, y=4
x=3, y=5
x=4, y=6
x=5, y=7
x=6, y=8
x=7, y=9
x=8, y=10
x=9, y=11


`.batch` works magically; we'll still get a tuple as the dataset element, but each tuple component is batched.

In [11]:
ds = ds.batch(3)
for (x, y) in ds:
    print(f"x={x},  y={y}")

x=[0 1 2],  y=[2 3 4]
x=[3 4 5],  y=[5 6 7]
x=[6 7 8],  y=[ 8  9 10]
x=[9],  y=[11]


Also, unpacking in lambda function works magically in `tf.data.Dataset.map()` unlike raw Python. [**]

In [12]:
tuple_of_tensors = ([0,2,4,6,8,10,12], [4,5,6,7,8,9,10])
ds = tf.data.Dataset.from_tensor_slices(tuple_of_tensors)
ds = ds.map(lambda x, y: (x + y, x * y))
ds = ds.batch(3)

for (x, y) in ds:
    print(f"x={x},  y={y}")

x=[ 4  7 10],  y=[ 0 10 24]
x=[13 16 19],  y=[42 64 90]
x=[22],  y=[120]


[**] Note that analogus operation in raw Python fails due to [lack of tuple unpacking in lambda](https://stackoverflow.com/questions/21892989/what-is-the-good-python3-equivalent-for-auto-tuple-unpacking-in-lambda).

In [13]:
tuples = ([0,2,4,6,8,10,12], [4,5,6,7,8,9,10])
list(map(lambda x, y: (x + y, x * y), zip(tuples)))

TypeError: <lambda>() missing 1 required positional argument: 'y'

## Dict element

The magic we sa in tuple also works for dict of tensors.

In [14]:
d = {"x": tf.range(10), "y": tf.range(2, 12)}
ds = tf.data.Dataset.from_tensor_slices(d)
ds = ds.batch(3)
for d in ds:
    print(f"d['x'] = {d['x']},   d['y'] = {d['y']}")

d['x'] = [0 1 2],   d['y'] = [2 3 4]
d['x'] = [3 4 5],   d['y'] = [5 6 7]
d['x'] = [6 7 8],   d['y'] = [ 8  9 10]
d['x'] = [9],   d['y'] = [11]


## Sliding window

In [15]:
ds = tf.data.Dataset.range(10)
ds = ds.window(3, shift=1).flat_map(lambda x: x.batch(3))
for x in ds:
    print(f"{x}")

[0 1 2]
[1 2 3]
[2 3 4]
[3 4 5]
[4 5 6]
[5 6 7]
[6 7 8]
[7 8 9]
[8 9]
[9]
