# Build Tensorflow Input Pipeline

- The **tf.data** API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training.
- The **tf.data** API makes it possible to handle large amounts of data, read from different data formats, and perform complex transformations.

- The **tf.data** API introduces a **tf.data.Dataset** abstraction that represents a sequence of elements, in which each element consists of one or more components. For example, in an image pipeline, an element might be a single training example, with a pair of tensor components representing the image and its label.

- There are two distinct ways to create a dataset:

  - A data **source** constructs a *Dataset* from data stored in memory or in one or more files.

   - A data **transformation** constructs a dataset from one or more *tf.data.Dataset objects*.


In [2]:
import tensorflow as tf
import os
import pandas as pd
import numpy as np

## Basic Mechaince

- To create an input pipeline, you must start with a data **source**.
- For example, to construct a Dataset from data in memory, you can use **tf.data.Dataset.from_tensors()** or **tf.data.Dataset.from_tensor_slices()**. 
- Alternatively, if your input data is stored in a file in the recommended **TFRecord** format, you can use **tf.data.TFRecordDataset()**.
- ***Note we will create a separate notebook for TFRecord format. [more](https://towardsdatascience.com/a-practical-guide-to-tfrecords-584536bc786c)***


- Once you have a **Dataset** object, you can transform it into a new Dataset by chaining method calls on the **tf.data.Dataset** object. For example, you can apply per-element transformations such as **Dataset.map**, and multi-element transformations such as **Dataset.batch**.


### From Tensors
- This method creates a dataset with a single element using the given input tensor.
- It is useful when you have a single large tensor or a small number of tensors, and you want to treat them as a single item in the dataset.
- The resulting dataset will have only one element, which means that each epoch of training will see the entire dataset as one batch.

In [60]:
tensor_1 = tf.constant([[0, 1, 2],
                        [3, 4, 5]])
tensor_2 = tf.constant([[6, 7, 8],
                        [9, 10, 11]])

In [61]:
dataset = tf.data.Dataset.from_tensors(tensor_1)

In [62]:
for item in dataset:
    print(item)
    print("=======")

tf.Tensor(
[[0 1 2]
 [3 4 5]], shape=(2, 3), dtype=int32)


2023-07-23 16:39:45.844267: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype int32 and shape [2,3]
	 [[{{node Placeholder/_0}}]]


In [63]:
dataset = tf.data.Dataset.from_tensors([tensor_1, tensor_2])

In [64]:
for item in dataset.repeat(2).batch(3): #since all the tensors are treaded as a single tensor, batching does not work
    print(item)
    print("===")

tf.Tensor(
[[[[ 0  1  2]
   [ 3  4  5]]

  [[ 6  7  8]
   [ 9 10 11]]]


 [[[ 0  1  2]
   [ 3  4  5]]

  [[ 6  7  8]
   [ 9 10 11]]]], shape=(2, 2, 2, 3), dtype=int32)
===


2023-07-23 16:39:46.111890: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype int32 and shape [2,2,3]
	 [[{{node Placeholder/_0}}]]


Here, you can see list a tensors are converted into a single element dataset

### From TensorSlices

- The Dataset object is a Python iterable. This makes it possible to consume its elements using a for loop
- The simplest way to create a dataset is to create it from a python list
- This method creates a dataset by slicing the input tensor along the first dimension (axis=0) and forms elements from the slices.
- It is particularly useful when you have a larger dataset consisting of multiple examples or samples, and you want to treat each element separately during training or inference.
- The resulting dataset will have as many elements as there are slices along the first dimension of the input tensor.

In [20]:
dataset = tf.data.Dataset.from_tensor_slices([0, 1, 2, 3, 4, 5])
for element in dataset:
    print(element)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)


2023-07-23 15:53:48.124104: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype int32 and shape [6]
	 [[{{node Placeholder/_0}}]]


In [58]:
dataset = tf.data.Dataset.from_tensor_slices(tensor_1)
for item in dataset.repeat(4).batch(3):
    print(item)
    print("========")

tf.Tensor(
[[0 1 2]
 [3 4 5]
 [0 1 2]], shape=(3, 3), dtype=int32)
tf.Tensor(
[[3 4 5]
 [0 1 2]
 [3 4 5]], shape=(3, 3), dtype=int32)
tf.Tensor(
[[0 1 2]
 [3 4 5]], shape=(2, 3), dtype=int32)


2023-07-23 16:39:29.515953: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype int32 and shape [2,3]
	 [[{{node Placeholder/_0}}]]


Unlike Dataset.From_tensors(), from_tensor_slides() created dataset with multi-elements by slicing the input tensort along the first dimension axis=0.

In summary, use **from_tensors()** when you have a single large tensor or a small number of tensors to be treated as one element, and use **from_tensor_slices()** when you have a larger dataset with multiple elements and you want to process them individually.

### Tips: If all of your input data fits into the memory 
- If all of your input data fits in memory, the simplest way to create a **Dataset** from them is to convert them to **tf.Tensor** objects and use **Dataset.from_tensor_slices**.
- Note that we don't have to explictly convert the python list or numpy array into tf.Tensor using tf.constant()/tf.Variable() function, we can directly pass them into from_tensor_slices.

In [25]:
train, test = tf.keras.datasets.fashion_mnist.load_data()

In [26]:
images, labels = train
images.shape, labels.shape

((60000, 28, 28), (60000,))

In [28]:
type(images), type(labels)

(numpy.ndarray, numpy.ndarray)

In [29]:
dataset = tf.data.Dataset.from_tensor_slices((images, labels))

In [35]:
for item in dataset:
    img, label = item
    print("Image shape: ", img.shape)
    print("label: ", label)
    print("Type of the image: ", type(img))
    print("Type of the label: ", type(label))
    break

Image shape:  (28, 28)
label:  tf.Tensor(9, shape=(), dtype=uint8)
Type of the image:  <class 'tensorflow.python.framework.ops.EagerTensor'>
Type of the label:  <class 'tensorflow.python.framework.ops.EagerTensor'>


***Note: The above code snippet will embed the features and labels arrays in your TensorFlow graph as tf.constant() operations. This works well for a small dataset, but wastes memory---because the contents of the array will be copied multiple times---and can run into the 2GB limit for the tf.GraphDef protocol buffer.***

Dataset Object allows to itterate over batch also

In [39]:
for batch in dataset.batch(10, drop_remainder=True):
    imgs, labels = batch
    print("Images shape: ", imgs.shape)
    print("Labels shape: ", labels.shape)
    break

Images shape:  (10, 28, 28)
Labels shape:  (10,)


2023-07-23 16:30:52.679317: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype uint8 and shape [60000]
	 [[{{node Placeholder/_1}}]]


# TextLineDataset
