# Tensorflow Input Pipeline

Link to the Youtube video tutorial: https://www.youtube.com/watch?v=VFEOskzhhbc&list=PLeo1K3hjS3uu7CxAacxVndI4bE_o3BDtO&index=44

1) **Motivation of using Tensorflow input pipeline:**
    1) <img src="hidden\photo1.png" alt="This image is a representation of the simple neural network" style="width: 400px;"/>  <br />
        1) Let's say you are building your typical cats and dogs image classification model. These images are obviously stored on hard disk and you need to load these images into RAM into some kind of numpy array or pandas data frame. You have to convert these images into numbers because machine learning model understand numbers they don't understand images. So now you have loaded them into lesson Numpy x train y train and you give it to your model for training things are looking fine when you have thousand images

    2) <img src="hidden\photo2.png" alt="This image is a representation of the simple neural network" style="width: 400px;"/>  <br />
        1) what if you have 10 million images in in deep learning environment, you know typically you have a lot of data. When you're running this on your computer which has only 8 gigabyte of RAM, when you try to load it you know what your computer is going to tell you? It will be like too much data buddy, I cannot handle it! please help me! 

    3) <img src="hidden\photo3.png" alt="This image is a representation of the simple neural network" style="width: 400px;"/>  <br />
        1) Alright so one approach of tackling this issue is how about we load these images into batches this is called a streaming approach. So batch 1 is thousand images, you load this into some kind of special data structures (by the way this table is not your Numpy array or pandas data frame, it is some kind of special data structure and we'll talk about what that data structure is). You load thousand images so you give batch 1 for your model training. Then you do batch 2, batch 3, batch 4 and so on. And things work perfectly. So now you'll ask me what is that special data structure? well that special data structure is tf.data.Dataset, and this is what helps you build your Tensorflow input pipeline. In order to build Tensorflow input pipeline, you need to use tf.data API framework and tf.data.Dataset is the main class in this framework.

    4) <img src="hidden\photo4.png" alt="This image is a representation of the simple neural network" style="width: 400px;"/>  <br />
        1) All right, what if I have some blurry images? I don't want to directly load the images and do my model training because you all know we have to do data cleaning & data transformation such as scaling, things like that. tf.data.Dataset fortunately has a lot of good API to support the transformation. So for example here, the red row is that blurry image and you can use filter function. You can say .filter() and this .filter() is a custom function defined by you, where you will detect if the image is blurry or not. We are not going to go into details on how exactly you detect the blurry image, but you get the point you can have a custom filter function which you can supply to tf.data.Dataset and it will filter it out. You see the red row is gone now in this particular instance of this data structure and then you can do model training.

    5) <img src="hidden\photo5.png" alt="This image is a representation of the simple neural network" style="width: 400px;"/>  <br />
        1) You might want to do more transformation where you all know typically when you are training your image dataset, you want to scale it. So all these values by the way that you're seeing I don't know if you noticed but there are three dimensional arrays. You know RGB, so an image is presented by RGB channels and these values are from 0 to 255 and it's a usual practice that we scale this by dividing it by 255. So now, you can do .map() and then define a lambda function, if you are aware about python lambda function it's a simple function which will do x divided by 255 on each of these values, so you can see that 34 divided by 255 is 0.13... Alright, and then you can do your model training. So overall you can use tf.data.Dataset to do filtering, mapping, shuffling and lot of different transformations now.

    6) <img src="hidden\photo6.png" alt="This image is a representation of the simple neural network" style="width: 400px;"/>  <br />
        1) What if I can write all this transformation in a single line of code? This is how it looks. This is a single line of code that forming your complete data input pipeline (Tensorflow input pipeline). So the first step, list image list files that will load the images from your hard disk into memory. Then you do .map() so .map() is like you know pandas .apply() where you want to run some transformation on your images. So I have just loaded these images from hard disk I would probably want to convert it into Numpy array and then do some transformation. By the way, Numpy array is internally it's inside your tf data set so tf.data.Dataset is kind of you know providing abstractions over it so you essentially your Numpy array is converted to a tensor you know and the tensor is an underlying data structure for tf.data.Dataset. Now you converted these images into array extracted label from the folder and then the next step would be filtering blurry images. Then you do mapping. So mapping is just your scaling, you know bringing values from zero to one and that is your tf.data.Dataset. So that first step is called building data pipeline. In this pipeline, you perform ETL (extract, transform & load), all kind of transformations. I just showed you few transformation you can do repeat you can do batching you can do so many transformation we'll look at some of those in our coding which is part two of this video but you get an idea that you build a data input pipeline. Then the second step would be training the model where you supply tf_dataset in your model.fit() until now if you've seen my previous videos we would use either Numpy array or Pandas data frame as an input of fit function but now we'll be using tf_dataset.

    7) <img src="hidden\photo7.png" alt="This image is a representation of the simple neural network" style="width: 400px;"/>  <br />
        1) You can load text files or spreadsheet. It's not just images. You can load any kind of data. You can load images from cloud, it doesn't have to be your local hard disk, and you can use this data input pipeline for doing batch loading, shuffling, filtering mapping and all of this is called ETL (extract, transform, & load). In the end, what you get is your tf_dataset which you can directly feed into your tensorflow model.

    8) <img src="hidden\photo8.png" alt="This image is a representation of the simple neural network" style="width: 400px;"/>  <br />
        1) Just to summarize, the tensorflow input pipeline offers two big benefits. The first benefit is you can handle huge dataset easily by streaming them from either disk or any other cloud storage. The second benefit is you can apply various type of transformation which you typically need to train your deep learning model.

In [1]:
import tensorflow as tf

# Create a tf_dataset (whose samples are numbers)

In this tutorial, tf_dataset is a tensor object created using tensorflow API (tf.data.Dataset.from_tensor_slices)

In [2]:
# The daily_sales_numbers is the dataset in this tutorial. The daily_sales_numbers stores 21 thousand dollars, 22 thousand dollars,.... and vice versa. However, the negative value in daily_sales_numbers are the error datas (because daily sales numbers cannot be negative values).
daily_sales_numbers = [21, 22, -108, 31, -1, 32, 34, 31]

# Create a tf dataset (tensor object) called tf_dataset, from a python list
tf_dataset = tf.data.Dataset.from_tensor_slices(daily_sales_numbers)
print(tf_dataset)

<_TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>


## Perform different operations on a dataset separately

### Methods to access the elements in the tf_dataset 

In [3]:
print('Print the elements in tf_dataset as tensor object:')
for sales in tf_dataset:
    print(sales) # Each element in the tf_dataset is a tensor object
 
print('\nPrint the elements in tf_dataset as numpy object (Convert each tensor object into numpy array using numpy()):')
for sales in tf_dataset:
    print(sales.numpy()) # Convert each element in the tf_dataset from a tensor object into numpy object, using numpy()

print('\nPrint the elements in tf_dataset as numpy object (Convert each tensor object into numpy array using as_numpy_iterator()):')
for sales in tf_dataset.as_numpy_iterator(): # Convert each element in the tf_dataset from a tensor object into numpy object, using as_numpy_iterator()
    print(sales) 

Print the elements in tf_dataset as tensor object:
tf.Tensor(21, shape=(), dtype=int32)
tf.Tensor(22, shape=(), dtype=int32)
tf.Tensor(-108, shape=(), dtype=int32)
tf.Tensor(31, shape=(), dtype=int32)
tf.Tensor(-1, shape=(), dtype=int32)
tf.Tensor(32, shape=(), dtype=int32)
tf.Tensor(34, shape=(), dtype=int32)
tf.Tensor(31, shape=(), dtype=int32)

Print the elements in tf_dataset as numpy object (Convert each tensor object into numpy array using numpy()):
21
22
-108
31
-1
32
34
31

Print the elements in tf_dataset as numpy object (Convert each tensor object into numpy array using as_numpy_iterator()):
21
22
-108
31
-1
32
34
31


In [4]:
print('\nPrint the first 3 elements in tf_dataset as numpy object (Convert each tensor object into numpy array using numpy()):')
for sales in tf_dataset.take(3): #.take(N) means only get the first N samples of the tf_dataset (means the loop only runs for N iterations). At each iteration, a taken sample will be stored in sales. 
    print(sales.numpy()) # Convert each element in the tf_dataset from a tensor object into numpy object, using numpy()

# Filter/remove the error data in the dataset (In this tutorial, the error data is sample in negative value), by using the self-defined filter function (In this tutorial, the filter function is filter(lambda x: x>0))
tf_dataset_filtered = tf_dataset.filter(lambda x: x>0)
print('\nPrint the elements/samples in tf_dataset_filtered [with error data removed] as numpy object (Convert each tensor object into numpy array using as_numpy_iterator()):')
for sales in tf_dataset_filtered.as_numpy_iterator():
    print(sales)

# Convert the currency of each sample from USD into Rupees, with the ratio of 1 USD : 72 Rupees, using the self-defined map function (In this tutorial, the map function is filter(lambda x: x*72))
tf_dataset_filtered_CurrencyChanged = tf_dataset_filtered.map(lambda x: x*72)
print('\nPrint the elements/samples in tf_dataset_filtered_CurrencyChanged [with error data removed & currency changed] as numpy object (Convert each tensor object into numpy array using as_numpy_iterator()):')
for sales in tf_dataset_filtered_CurrencyChanged.as_numpy_iterator():
    print(sales)


Print the first 3 elements in tf_dataset as numpy object (Convert each tensor object into numpy array using numpy()):
21
22
-108

Print the elements/samples in tf_dataset_filtered [with error data removed] as numpy object (Convert each tensor object into numpy array using as_numpy_iterator()):
21
22
31
32
34
31

Print the elements/samples in tf_dataset_filtered_CurrencyChanged [with error data removed & currency changed] as numpy object (Convert each tensor object into numpy array using as_numpy_iterator()):
1512
1584
2232
2304
2448
2232


In [5]:
# Randomly shuffle the samples in tf_dataset_filtered_CurrencyChanged, using the buffer size of 3. The detail explanation of shuffle buffer size: https://stackoverflow.com/questions/53514495/what-does-batch-repeat-and-shuffle-do-with-tensorflow-dataset
tf_dataset_filtered_CurrencyChanged_shuffled = tf_dataset_filtered_CurrencyChanged.shuffle(3)
print('Print the elements/samples in tf_dataset_filtered_CurrencyChanged_shuffled [with error data removed, currency changed, shuffled] as numpy object (Convert each tensor object into numpy array using as_numpy_iterator()):')
for sales in tf_dataset_filtered_CurrencyChanged_shuffled.as_numpy_iterator():
    print(sales)

Print the elements/samples in tf_dataset_filtered_CurrencyChanged_shuffled [with error data removed, currency changed, shuffled] as numpy object (Convert each tensor object into numpy array using as_numpy_iterator()):
1584
2304
2448
1512
2232
2232


In [6]:
# Do batching, by creating batches of size 2 (means each batch will have 2 samples of the dataset)
tf_dataset_filtered_CurrencyChanged_shuffled_batched = tf_dataset_filtered_CurrencyChanged_shuffled.batch(2)

print('Print the elements/samples in tf_dataset_filtered_CurrencyChanged_shuffled_batched [with error data removed, currency changed, shuffled, and batched] as numpy object (Convert each tensor object into numpy array using as_numpy_iterator()):')
for sales_batch in tf_dataset_filtered_CurrencyChanged_shuffled_batched: # Since the dataset called tf_dataset_filtered_CurrencyChanged_shuffled only have 6 samples, by defining each batch will have 2 samples, there will be 3 batches to accomodate all the samples of the dataset.
    print(sales_batch.numpy())

Print the elements/samples in tf_dataset_filtered_CurrencyChanged_shuffled_batched [with error data removed, currency changed, shuffled, and batched] as numpy object (Convert each tensor object into numpy array using as_numpy_iterator()):
[1584 2304]
[2448 2232]
[2232 1512]


## Perform different operations on a dataset in one single line (using Tensorflow input pipeline)

In [7]:
# Perform different operations on the same dataset (tf_dataset) in one single line (start with filtering, followed by mapping, shuffling, and ends with batching). In other words, the one single line is called the tensorflow input pipeline.
tf_dataset_OperationsInSingleLine = tf_dataset.filter(lambda x: x>0).map(lambda y: y*72).shuffle(3).batch(2)

print('Print the elements/samples in tf_dataset_OperationsInSingleLine [with error data removed, currency changed, shuffled, and batched] as numpy object (Convert each tensor object into numpy array using as_numpy_iterator()):')
for sales in tf_dataset_OperationsInSingleLine.as_numpy_iterator():
    print(sales)

Print the elements/samples in tf_dataset_OperationsInSingleLine [with error data removed, currency changed, shuffled, and batched] as numpy object (Convert each tensor object into numpy array using as_numpy_iterator()):
[2232 1512]
[1584 2232]
[2304 2448]


# Create a tf_dataset (whose samples are images)

The purpose of this tutorial is to give you an idea of tensorflow input pipeline. You will be using that while training tensorflow deep learning models. So here, we are not doing any training, we are just building the pipeline.

In [8]:
# Load the image dataset by storing the list of the directory of all image files of the dataset in the image_dataset variable, through providing the list as the input of tf.data.dataset.list_files(). Shuffle=False means you don't want to provide the directory of the images in a random sequence. Shuffle=True means you want to provide the directory of the images in a random sequence.
images_dataset = tf.data.Dataset.list_files('deep-learning-keras-tf-tutorial/44_tf_data_pipeline/images/*/*', shuffle=False)

# Show the first 3 image directories stored in the image_dataset variable
print('The first 3 image directories stored in the image_dataset variable:')
for file in images_dataset.take(3):
    print(file.numpy())

The first 3 image directories stored in the image_dataset variable:
b'deep-learning-keras-tf-tutorial\\44_tf_data_pipeline\\images\\cat\\20 Reasons Why Cats Make the Best Pets....jpg'
b'deep-learning-keras-tf-tutorial\\44_tf_data_pipeline\\images\\cat\\7 Foods Your Cat Can_t Eat.jpg'
b'deep-learning-keras-tf-tutorial\\44_tf_data_pipeline\\images\\cat\\A cat appears to have caught the....jpg'


## Data Preprocessing

### Shuffle the data

In [9]:
# Randomly shuffle the samples/directories in image_dataset variable, using buffer size of 200. So now, image_dataset_shuffled contains the cat and dog image directory in mixed way (means not having all the cat image directories first, only followed by all the dog image directories)
image_dataset_shuffled = images_dataset.shuffle(200)
print('\nThe first 3 image directories stored in the image_dataset_shuffled variable:')
for file in image_dataset_shuffled.take(3):
    print(file.numpy())

# , so that the images can be accessed on the disk and read their features


The first 3 image directories stored in the image_dataset_shuffled variable:
b'deep-learning-keras-tf-tutorial\\44_tf_data_pipeline\\images\\dog\\66 gifts for dogs or dog lovers to get_yythk....jpg'
b'deep-learning-keras-tf-tutorial\\44_tf_data_pipeline\\images\\dog\\How to make your dog feel comfortable....jpg'
b'deep-learning-keras-tf-tutorial\\44_tf_data_pipeline\\images\\dog\\The 25 Cutest Dog Breeds - Most....jpg'


In [10]:
# Define the unique classes (ground truth)
class_names = ["cat","dog"]

### Split dataset into train and test sets

In [11]:
# Count the number of samples/images in the dataset
image_count = len(image_dataset_shuffled)
print('The dataset consists of ' + str(image_count) + ' images.')

# Set the train set having 80% samples of the dataset
train_size = int(image_count*0.8)

# Get the first train_size samples in image_dataset_shuffled variable, then store them in the X_train variable
train_dataset = image_dataset_shuffled.take(train_size)

# Skip the first train_size samples in image_dataset_shuffled variable. Get the (train_size+1)th to last sample in image_dataset_shuffled variable, then store them in the X_test variable
test_dataset = image_dataset_shuffled.skip(train_size)

print('The train set consists of ' + str(len(train_dataset)) + ' images.')
print('The test set consists of ' + str(len(test_dataset)) + ' images.')

The dataset consists of 130 images.
The train set consists of 104 images.
The test set consists of 26 images.


#### Extra: Explain the concept of retrieve the ground truth (label) of an image, based on the image directory

In [12]:
# For example, take an image directory
for file in train_dataset.take(1):
    s = file

print('The 1st image path in the train set:\n', s.numpy())

import os

# Split the directory into words, using the os separator (os.path.sep)
label_example = tf.strings.split(s, os.path.sep)[3]
print('\nThe label of the 1st image in the train set: ', label_example)

The 1st image path in the train set:
 b'deep-learning-keras-tf-tutorial\\44_tf_data_pipeline\\images\\cat\\Cat Advice _ Collecting a Urine Sample....jpg'

The label of the 1st image in the train set:  tf.Tensor(b'cat', shape=(), dtype=string)


### Retrieve the label (ground truth) of all images/samples of the dataset

In [13]:
# Self-define a function to retrieve the label (ground truth) of the provided images/samples
def get_label(file_path):
    return tf.strings.split(file_path, os.path.sep)[3]

### Retrieve the data (features) of all images/samples of the dataset (Their features are obtained by loading them from laptop's disk into laptop's RAM, then read their numpy array data)

In [14]:
# Self-define a function to retrieve the data (features) & label (ground truth) of the provided images/samples simultaneuosly
def process_image(file_path):
    label = get_label(file_path)
    img = tf.io.read_file(file_path) # Read the file 
    img = tf.image.decode_jpeg(img) # Decode the file to get its features. Since the file is a jpeg image. So I need to decode the file using a function called decode_jpeg()
    img = tf.image.resize(img, [128, 128]) # Resize the image to the dimension of 128 x 128 pixels 
    return img, label # Return the feature and label of the images

### Perform different operations on a dataset separately

#### Retrieve the features & label of each image/file simultaneously

A.map(B) means the function B is applied on each elements in variable A

In [15]:
iteration = 0

# The map() applies the process_image function on each sample in train_dataset, so the features and label of each sample in train_dataset is retrieved.
train_dataset_features_label = train_dataset.map(process_image)

# At each iteration, the retrieved features and label of a sample is stored in img and label variables respectively, in sequence.
for img, label in train_dataset_features_label.take(3):
    print('For the train set image/file at index ',iteration)
    print('The features:\n', img.numpy())
    print('The label:\n', label.numpy())
    print('\n\n')
    iteration += 1

For the train set image/file at index  0
The features:
 [[[255. 255. 255.   0.]
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]
  ...
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]]

 [[255. 255. 255.   0.]
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]
  ...
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]]

 [[255. 255. 255.   0.]
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]
  ...
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]]

 ...

 [[255. 255. 255.   0.]
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]
  ...
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]]

 [[255. 255. 255.   0.]
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]
  ...
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]]

 [[255. 255. 255.   0.]
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]
  ...
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]
  [255. 255. 255.   0.]]]
The label:
 b'dog'



For 

#### Scale the features of each image (into the range between 0 and 1)

In [16]:
# Self-define a function to scale the features of each image (into the range between 0 and 1). Since each pixel value of a color channel has maximum value of 255, we divide all pixel by 255 to perform scaling.
def scale(image, label):
    return image/255, label

# Scale the features of each image stored in train_dataset_features_label, then save the outputs in train_dataset_features_label_scale
train_dataset_features_label_scale = train_dataset_features_label.map(scale)
iteration = 0

# Show the scaled features and labels of provided image
for image, label in train_dataset_features_label_scale.take(3):
    print('For the train set image/file at index ',iteration)
    print('The features:\n', image.numpy())
    print('The label:\n', label.numpy())
    print('\n\n')
    iteration += 1

For the train set image/file at index  0
The features:
 [[[0.90539217 0.90539217 0.9132353 ]
  [0.9112745  0.9112745  0.90882355]
  [0.91764706 0.91764706 0.91764706]
  ...
  [0.8419118  0.8144608  0.79093134]
  [0.8392157  0.8039216  0.78431374]
  [0.8352941  0.80784315 0.78431374]]

 [[0.902451   0.902451   0.9102941 ]
  [0.9112745  0.9112745  0.9191176 ]
  [0.91780025 0.91780025 0.91780025]
  ...
  [0.8392157  0.8117647  0.7882353 ]
  [0.83927697 0.8120098  0.7884804 ]
  [0.8352941  0.80784315 0.78431374]]

 [[0.902451   0.90588236 0.9137255 ]
  [0.90588236 0.90588236 0.90588236]
  [0.91764706 0.91764706 0.91764706]
  ...
  [0.83985907 0.8124081  0.7888787 ]
  [0.84313726 0.8156863  0.7921569 ]
  [0.8352941  0.80784315 0.78431374]]

 ...

 [[0.74289215 0.907598   0.9703431 ]
  [0.7345588  0.9098039  0.9703431 ]
  [0.75818014 0.91847426 0.9655331 ]
  ...
  [0.6884804  0.88848037 0.97083336]
  [0.69240195 0.89240193 0.9747549 ]
  [0.6960478  0.9000306  0.972549  ]]

 [[0.7585478  0.92

### Perform different operations on a dataset in one single line (using Tensorflow input pipeline)

In [17]:
# Load the image dataset by storing the list of the directory of all image files of the dataset in the image_dataset variable, through providing the list as the input of tf.data.dataset.list_files(). Shuffle=False means you don't want to provide the directory of the images in a random sequence. Shuffle=True means you want to provide the directory of the images in a random sequence.
images_dataset_new = tf.data.Dataset.list_files('deep-learning-keras-tf-tutorial/44_tf_data_pipeline/images/*/*', shuffle=False)

train_set_new_OperationsInSingleLine = train_dataset.map(process_image).map(scale)
iter = 0

# Show the scaled features and labels of provided image
for image, label in train_set_new_OperationsInSingleLine.take(3):
    print('For the train set image/file at index ',iter)
    print('The features:\n', image.numpy())
    print('The label:\n', label.numpy())
    print('\n\n')
    iter += 1

For the train set image/file at index  0
The features:
 [[[0.7189951  0.7464461  0.36997548]
  [0.7246017  0.74813116 0.3638174 ]
  [0.7137255  0.7411765  0.3675245 ]
  ...
  [0.7019866  0.72833467 0.35015798]
  [0.71059763 0.7302055  0.36942115]
  [0.71115196 0.72683823 0.38566175]]

 [[0.70055145 0.73976713 0.36329657]
  [0.7151482  0.7425992  0.3661286 ]
  [0.70729166 0.74049    0.36492226]
  ...
  [0.69099265 0.72408086 0.35716912]
  [0.704473   0.72408086 0.3711397 ]
  [0.6993959  0.72025985 0.38833582]]

 [[0.6784314  0.7176471  0.34117648]
  [0.69842124 0.7258722  0.34940162]
  [0.68952876 0.72874445 0.35117093]
  ...
  [0.66041666 0.7069853  0.3529412 ]
  [0.68625826 0.7215524  0.37253276]
  [0.6862745  0.7176471  0.38535538]]

 ...

 [[0.5170295  0.68026483 0.38969153]
  [0.32928923 0.53282875 0.00992647]
  [0.56155026 0.744492   0.46907648]
  ...
  [0.6569001  0.7903694  0.56205386]
  [0.56078434 0.7137255  0.29793295]
  [0.57408756 0.73479915 0.38698587]]

 [[0.52640647 0.67