<a href="https://colab.research.google.com/github/rahiakela/hands-on-machine-learning-with-scikit-learn-keras-and-tensorflow/blob/13-loading-and-preprocessing-data-with-tensorflow/1_tensorflow_data_api.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading and Preprocessing Data with TensorFlow

Ingesting a large dataset and preprocessing it efficiently can be tricky to implement with other **Deep Learning** libraries, but **TensorFlow** makes it easy thanks to the **Data API**: you just create a dataset object, tell it where to get the data, then transform it in any way you want, and **TensorFlow** takes care of all the implementation details, such as multithreading, queuing, batching, prefetching, and so on.

Off the shelf, the **Data API** can read from text files (such as CSV files), binary files with fixed-size records, and binary files that use **TensorFlow’s TFRecord format**, which supports records of varying sizes. TFRecord is a flexible and efficient binary format based on Protocol Buffers (an open source binary format). 

The **Data API** also has support for reading from SQL databases. Moreover, many Open Source extensions are available to read from all sorts of data sources, such as **Google’s BigQuery** service.

However, reading huge datasets efficiently is not the only difficulty: the data also needs to be preprocessed. Indeed, it is not always composed strictly of convenient numerical fields: sometimes there will be text features, categorical features, and so on.

To handle this, TensorFlow provides the **Features API**: it lets you easily convert these
features to numerical features that can be consumed by your neural network. 

For example, categorical features with a large number of categories (such as cities, or words) can be encoded using embeddings (an embedding is a trainable dense vector that represents a category).

**Data API** cover the **TFRecord format** and the **Features
API** that is related to  these projects.

* **TF Transform (tf.Transform)** makes it possible to write a single preprocessing function that can be run both in batch mode on your full training set, before training (to speed it up), and then exported to a TF Function and incorporated into your trained model, so that once it is deployed in production, it can take care of preprocessing new instances on the fly.
* **TF Datasets (TFDS)** provides a convenient function to download many common datasets of all kinds, including large ones like **ImageNet**, and it provides convenient dataset objects to manipulate them using the **Data API**.

## Setup

In [1]:
import sys
assert sys.version_info >= (3, 5)  # Python ≥3.5 is required

import sklearn 
assert sklearn.__version__ >= "0.20"  # Scikit-Learn ≥0.20 is required

# %tensorflow_version only exists in Colab.
try:
  %tensorflow_version 2.x
  IS_COLAB = True
except Exception:
  IS_COLAB = False
  pass

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= '2.0'

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

if not tf.config.list_physical_devices('GPU'):
    print("No GPU was detected. LSTMs and CNNs can be very slow without a GPU.")
    if IS_COLAB:
        print("Go to Runtime > Change runtime and select a GPU hardware accelerator.")

TensorFlow 2.x selected.
No GPU was detected. LSTMs and CNNs can be very slow without a GPU.
Go to Runtime > Change runtime and select a GPU hardware accelerator.


## The Data API

The whole **Data API** revolves around the concept of a dataset: as you might suspect, this represents a sequence of data items. Usually you will use datasets that gradually read data from disk, but for simplicity let’s just create a dataset entirely in RAM using tf.data.Dataset.from_tensor_slices():

In [2]:
X = tf.range(10)  # any data tensor
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

The from_tensor_slices() function takes a tensor and creates a tf.data.Dataset whose elements are all the slices of X (along the first dimension), so this dataset contains 10 items: tensors 0, 1, 2, …, 9.

In [3]:
for item in dataset:
  print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


Equivalently

In [4]:
dataset = tf.data.Dataset.range(10)
for item in dataset:
  print(item)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)


### Chaining Transformations

Once you have a dataset, you can apply all sorts of transformations to it by calling its transformation methods. Each method returns a new dataset, so you can chain transformations like this.

<img src='https://github.com/rahiakela/img-repo/blob/master/hands-on-machine-learning-keras-tensorflow/chaining-dataset-transformations.png?raw=1' width='800'/>

In [5]:
dataset = dataset.repeat(3).batch(7)
for item in dataset:
  print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64)
tf.Tensor([8 9], shape=(2,), dtype=int64)


We first call the repeat() method on the original dataset, and it returns a new dataset that will repeat the items of the original dataset 3 times. Of course, this will not copy the whole data in memory 3 times! In fact, if you call this method with no arguments, the new dataset will repeat the source dataset forever. Then we call the batch() method on this new dataset, and again this creates a new dataset. This one will group the items of the previous dataset in batches of 7 items.

Finally, we iterate over the items of this final dataset. As you can see, the batch() method had to output a final batch of size 2 instead of 7, but you can call it with drop_remainder=True if you want it to drop this final batch so that all batches have the exact same size.

In [10]:
dataset = tf.data.Dataset.range(10)
dataset = dataset.repeat(5).batch(9)
for item in dataset:
  print(item)

tf.Tensor([0 1 2 3 4 5 6 7 8], shape=(9,), dtype=int64)
tf.Tensor([9 0 1 2 3 4 5 6 7], shape=(9,), dtype=int64)
tf.Tensor([8 9 0 1 2 3 4 5 6], shape=(9,), dtype=int64)
tf.Tensor([7 8 9 0 1 2 3 4 5], shape=(9,), dtype=int64)
tf.Tensor([6 7 8 9 0 1 2 3 4], shape=(9,), dtype=int64)
tf.Tensor([5 6 7 8 9], shape=(5,), dtype=int64)


In [11]:
dataset = tf.data.Dataset.range(10)
dataset = dataset.repeat(5).batch(9, drop_remainder=True)  # discard the remainder
for item in dataset:
  print(item)

tf.Tensor([0 1 2 3 4 5 6 7 8], shape=(9,), dtype=int64)
tf.Tensor([9 0 1 2 3 4 5 6 7], shape=(9,), dtype=int64)
tf.Tensor([8 9 0 1 2 3 4 5 6], shape=(9,), dtype=int64)
tf.Tensor([7 8 9 0 1 2 3 4 5], shape=(9,), dtype=int64)
tf.Tensor([6 7 8 9 0 1 2 3 4], shape=(9,), dtype=int64)


You can also apply any transformation you want to the items by calling the map() method.

In [12]:
dataset = dataset.map(lambda x: x * 2)
for item in dataset:
  print(item)

tf.Tensor([ 0  2  4  6  8 10 12 14 16], shape=(9,), dtype=int64)
tf.Tensor([18  0  2  4  6  8 10 12 14], shape=(9,), dtype=int64)
tf.Tensor([16 18  0  2  4  6  8 10 12], shape=(9,), dtype=int64)
tf.Tensor([14 16 18  0  2  4  6  8 10], shape=(9,), dtype=int64)
tf.Tensor([12 14 16 18  0  2  4  6  8], shape=(9,), dtype=int64)


This function is the one you will call to apply any preprocessing you want to your data. Sometimes, this will include computations that can be quite intensive, such as reshaping or rotating an image, so you will usually want to spawn multiple threads to speed things up: it’s as simple as setting the num_parallel_calls argument.

While the map() applies a transformation to each item, the apply() method applies a transformation to the dataset as a whole.

For example, the following code “unbatches” the dataset, by applying the unbatch() function to the dataset.Each item in the new dataset will be a single integer tensor instead of a batch of 7 integers:

In [22]:
dataset1 = tf.data.Dataset.range(10)
dataset1 = dataset1.repeat(5).batch(9, drop_remainder=True)  # discard the remainder
dataset1 = dataset1.apply(tf.data.experimental.unbatch())
for item in dataset1:
  print(item)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype

It is also possible to simply filter the dataset using the filter() method:

In [24]:
dataset1 = dataset1.filter(lambda x: x < 5)
for item in dataset1:
  print(item)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)


You will often want to look at just a few items from a dataset. You can use the take() method for that:

In [25]:
for item in dataset1.take(3):
  print(item)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)


### Shuffling the Data