##### Copyright 2022.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Data pipeline in TensorFlow: Understanding the data pipeline

## Overview

This is a tutorial for beginners to learn the TensorFlow data pipeline (`tf.data`), its usage with high level tf.keras APIs, and `tf.data` related operations that could be used for pipeline-level data processing.

## Setup and model preparation

Before showing the usage of `tf.data`, you can go through the following steps to setup the environment in Google Colab and build a very simple model with `tf.keras`, so that it could be used with `tf.data` later.

Now import TensorFlow into your program:

In [None]:
import tensorflow as tf

Build a tf.keras.Sequential model by stacking layers. For demo purposes choose a very simple model:

In [None]:
model = tf.keras.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(10)
])

Choose an optimizer and loss function for training:

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=['accuracy'])

## Load the Fashion-MNIST dataset

The dataset used in this tutorial is [Fashion-MNIST dataset](https://github.com/zalandoresearch/fashion-mnist), which consists of Zalando's article images with a training set of 60,000 examples and a test set of 10,000 examples. For simplicity reasons we use tf.keras to load the data into numpy. The `images` are converted to `float32` and `labels` are converted to `int32`:




In [None]:
import numpy as np

train, test = tf.keras.datasets.fashion_mnist.load_data()

images, labels = train
images = images.astype(np.float32)/255.0
labels = labels.astype(np.int32)

The next step is to convert the loaded numpy data into a `tf.data.Dataset`:


In [None]:
d_train = tf.data.Dataset.from_tensor_slices((images, labels))

A `tf.data.Dataset` is ready, but what exactly is `tf.data.Dataset`?

You could check the property of `element_spec` to find out:

In [None]:
print("d_train: {}".format(d_train.element_spec))

The output of the `element_spec` lists the details of the dataset:
```
(
  TensorSpec(shape=(28, 28), dtype=tf.float32, name=None),
  TensorSpec(shape=(), dtype=tf.int32, name=None),
)
```

Turns out the dataset is a series of tuples where the first element of the tuple is a `28x28` image while the second element of the tuple is a label of `int32` scalar.

## Supported operations of `tf.data.Dataset`

Many usefule operations are supported by `tf.data.Dataset`.  For example, `take(n)` will take the first `n` elements at the beginning of the dataset. The dataset can also be iterated with `for` loop through implicit `__iter__` call:


In [None]:
for image, label in d_train.take(2):
  print("image: {}\nlabel: {}\n".format(image, label))

The `map(func)` is a very useful operation that applies `func` to each elements of the dataset, and returns a new dataset after the transformations:

In [None]:
d_train_size = d_train.map(lambda image, label: (tf.size(image), tf.size(label)))

# expected image_size is 784 = 28x28 and label_size is 1:
for image_size, label_size in d_train_size.take(2):
  print("image_size: {}\nlabel_size: {}\n".format(image_size, label_size))

A complete list of supported operations for `tf.data.Dataset` is available in the [api documentation](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).


## Usage of `tf.data` and `tf.keras`

Before `tf.data.Dataset` is used by `tf.keras`, the dataset is normally shuffled and batched:

In [None]:
d_train = d_train.shuffle(5000).batch(32)

Recall a `model` has already be compiled at the beginning of the tutorial, it is now possible to directly use `tf.data` with `mode..fit` within `tf.keras`:

In [None]:
model.fit(d_train, epochs=5)

The `Model.evaluate` method accepts `tf.data.Dataset` as well:

In [None]:
model.evaluate(d_train, verbose=2)

Finally, `model.predict` could also takes a `tf.data.Dataset` as an input for inference. But instead of a tuple of `(image, label)` pairs. The dataset passed to `model.predict` only need `image` (no `label`):

In [None]:
# image only dataset
d_image = tf.data.Dataset.from_tensor_slices(images).batch(32)

# for prediction
model.predict(d_image, verbose=2)