# Optimize Tensorflow Pipeline Performance: Prefetch & Cache

Link to the Youtube tutorial video: https://www.youtube.com/watch?v=MLEKEplgCas&list=PLeo1K3hjS3uu7CxAacxVndI4bE_o3BDtO&index=45

To observe the significant outputs in this tutorial, CPU & GPU are required. Must activate the anaconda virtual environment that enables & recognizes GPU (enter: activate GPUEnv) on anaconda prompt before runing this script.

In [44]:
import tensorflow as tf
import time

In [45]:
# Create a class with tf.data.Dataset as a base class. When you supply "tf.data.Dataset" as the argument, the FileDataSet class will be derived from "tf.data.Dataset".
class FileDataSet(tf.data.Dataset): # In this tutorail, we are measuring the performance (the training time) using prefetch. We will see how using prefetch you can optimize the use of CPU and GPU, and you can get a better training performance. To mimic the real world scenario (EG: latencies in reading files or reading objects from the storage), we are creating this dummy class
    def read_files_in_batches(num_samples):
        # open file  # Assume in real-life, you have some codes to open the file containing your training dataset
        time.sleep(0.03) # Mimic the delay in opening the file in real-life
        for sample_index in range(num_samples): # Mimic read the files/samples in your training dataset
            time.sleep(0.015) # Mimic the delay in reading each sample in real-life
            yield(sample_index,) # yield() return a generator. Detail information of yield(): https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python

    def __new__(cls, num_samples = 3): # This __new__ function overites the read_files_in_batches, when FileDataSet is called. cls is the class reference. Set the num_samples = 3 as default.
        return tf.data.Dataset.from_generator( # Use a generator
            cls.read_files_in_batches, # Perform the operations specified in read_files_in_batches
            output_signature=tf.TensorSpec(shape=(1,), dtype= tf.int64), # output_signature specifies the format of data the __new__ will return
            args = (num_samples,) # The argument (Here, is num_samples) you supply to the function of read_files_in_batches
        )



In [46]:
# Self-define a benchmark function, to evaluate the training performance (in terms of time). The function takes the dataset and number of epochs (Here, we set the number of epochs = 2 as default) as the inputs.
def benchmark(dataset, num_epochs=2): # dataset is actually the FileDataset()
    for epoch_num in range(num_epochs): # Go through all epochs
        for sample in dataset: # Go through all samples in your dataset
            time.sleep(0.01) # Mimic the delay in going through each sample in your dataset in real-life

# Effect of prefetch()

## Benchmark the performance (training time) of FileDataSet()

In FileDataSet(), the CPU fetches a batch of samples, then the GPU performs training on that batch of samples, then the CPU only fetches another batch of samples. These operations are performed sequentially (one-by-one).

<img src="hidden\photo1.png" alt="This image is a representation of the simple neural network" style="width: 450px;"/>  <br />


In [47]:
%%timeit # Use the line magic of timeit to get the performance (the training time)
benchmark(FileDataSet()) # Benchmark the performance (training time) of FileDataSet()

# Insights:
# 1) Without prefetching, the performance (training time) is 388 ms, the longest duration compared to the ones of prefetch(1) & prefetch(tf.data.AUTOTUNE) in this tutorial

360 ms ± 8.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Benchmark the performance (training time) of FileDataSet().prefetch()

In FileDataSet().prefetch(), when the training is performed by GPU, at the same time, a new batch of training samples is prefetched by CPU for next training.

 <img src="hidden\photo2.png" alt="This image is a representation of the simple neural network" style="width: 450px;"/>  <br />

### Using prefetch(1), meaning prefetch 1 batch

In [48]:
%%timeit # Use the line magic of timeit to get the performance (the training time)
benchmark(FileDataSet().prefetch(1)) # Benchmark the performance (training time) of FileDataSet().prefetch(1). .prefetch(1) means to prefetch 1 batch of samples while your GPU is training; .prefetch(tf.data.AUTOTUNE) means to autotune will figure out on its own how many batches it wants to prefetch while your GPU is training. We can use .prefetch() on FileDataset() because FileDataset() has the class of tf.data.Dataset.

# Insights:
# 1) With prefetch(1), the performance (training time) is 323 ms, the shortest duration compared to the ones of without prefetching & prefetch(tf.data.AUTOTUNE) in this tutorial

319 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Using prefetch(tf.data.AUTOTUNE), meaning the number of batches to be prefetched is determined by autotune

In [49]:
%%timeit # Use the line magic of timeit to get the performance (the training time)
benchmark(FileDataSet().prefetch(tf.data.AUTOTUNE)) # Benchmark the performance (training time) of FileDataSet().prefetch(1). .prefetch(1) means to prefetch 1 batch of samples; .prefetch(tf.data.AUTOTUNE) means to autotune will figure out on its own how many batches it wants to prefetch. We can use .prefetch() on FileDataset() because FileDataset() has the class of tf.data.Dataset.

# Insights:
# 1) With prefetch(tf.data.AUTOTUNE), the performance (training time) is 332 ms, the middle duration compared to the ones of prefetch(1) & without prefetching in this tutorial

324 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Effect of cache()

1) <img src="hidden\photo3.png" alt="This image is a representation of the simple neural network" style="width: 450px;"/>  <br />
    1) Without cache(), operations of opening the file containing dataset, read the files/samples of the dataset, and perform mapping on the files/samples are repeated for every epoch before training.

2) <img src="hidden\photo4.png" alt="This image is a representation of the simple neural network" style="width: 450px;"/>  <br />
    1) With cache(), operations of opening the file containing dataset, read the files/samples of the dataset, and perform mapping on the files/samples only performed 1 time at the first epoch before training. The processed files/images will be stored in the cache of the laptop. At the upcoming epochs (2nd epoch, 3rd epoch, ... ), the processed files/images are retrieved from the cache to perform training (without performing the operations to process the files/samples again). Hence, cache() saves(improves) training times.

## Use a simple dataset to explain the concept of cache() in a simple way

In [50]:
# Create a new dataset, consisting of a bunch of numbers
dataset = tf.data.Dataset.range(5)

# Show the samples 
print('The samples in the dataset variable:')
for sample in dataset:
    print(sample.numpy())

# Compute the square of each sample in the dataset
dataset = dataset.map(lambda x: x**2)
print('\nThe samples in the dataset_squared variable:')
for sample in dataset:
    print(sample.numpy())

The samples in the dataset variable:
0
1
2
3
4

The samples in the dataset_squared variable:
0
1
4
9
16


In [51]:
# Cache the dataset (means store the dataset in cache. So later when we call dataset again, the squared samples of the dataset are retrieved/read from cache of the laptop, but the function map(lambda x: x**2) is not executed again). If without cache, everytime you call dataset in this case, you perform the function map(lambda x: x**2) on samples of dataset before you get the squared samples.
dataset = dataset.cache()
print('\nThe samples in the dataset_squared variable:')
list(dataset.as_numpy_iterator()) # The alternative method to print the elements in a variable (similar to: for sample in dataset_squared: print(sample.numpy())


The samples in the dataset_squared variable:


[0, 1, 4, 9, 16]

## Use the FileDataSet dataset to explain the concept of cache() in a practical way

In [52]:
# Self-define a function to provide delay
def mapped_function(s): # Here, s is just dummy variable which we ignore.
    tf.py_function(lambda: time.sleep(0.03), [], ()) # This delay is the only thing we want in this function.
    return s

### Benchmark the performance of training without cache()

In [53]:
%%timeit -n1 -r1 # %%time is a 'cell magic' and has to be the first thing in the IPython (Jupyter) cell. I can reproduce this error if for example I have a comment first. When %%time is not the first thing in the cell, IPython tries to interpret it as a 'line magic' hence the error you see.
benchmark(FileDataSet().map(mapped_function), 5) # 5 means 5 epochs

# Insights:
# 1) Without cache(), the performance (training time) is 1.52 s, the longest duration compared to the ones of with cache() in this tutorial

1.52 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Benchmark the performance of training with cache()

In [55]:
%%timeit -n1 -r1
benchmark(FileDataSet().map(mapped_function).cache(), 5) # 5 means 5 epochs

# Insights:
# 1) With cache(), the performance (training time) is 567 ms, the shortest duration compared to the ones of without cache() in this tutorial
# 2) Because what would cache() have done is, see I'm running it for 5 epochs. At the first epoch, when I call mapped_function, it will introduce a delay. But at the second time, the data is cached. So at the second time, on our second, third, fourth, and fifth epoch, it is not calling this mapped_function. It is using the map data from the cache itself.

567 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
