# Chapter 13: Loading and Preprocessing Data with TensorFlow

In [1]:
# Preliminaries
import sklearn
import tensorflow as tf
from tensorflow import keras
import numpy as np
import os

# Random seeds
np.random.seed(42)
tf.random.set_seed(42)

# Plots
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

__Chapter overview__ :
- Data API
- TFRecord format
- How to create custom preprocessing layers & use standard keras ones
- Two tf projects:
    1. `tf.Transform` : single preprocessing function
        - Runs in batch mode before training
        - Exported to tf function
        - Incorporated in trained model to preprocess new instances
    2. tf datasets (TFDS)
        - Function to download many common datasets
        - Convenient dataset objects to manipulate

## 1. Data API

tf __dataset__ : sequence of data items; usually read in gradually from disk

In [20]:
# Create dataset entirely in RAM
X = tf.range(10)
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

`from_tensor_slices()` takes tensor and creates `tf.data.Dataset` object whose elements are all slices of X along first dimension

_Alternatively_ : `tf.data.Dataset.range(10)`

In [21]:
# Iterate over datset's items
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


### 1.1 Chaining transformations

#### 1.1.1 Basic transforms

Once have dataset, can appy transformations by calling transform methods; each method returns new dataset

In [22]:
dataset1 = dataset.repeat(3).batch(7)
for item in dataset1:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


_Key parts_ :
- `repeat` : repeats elements of dataset 3 times; does not copy all the data in memory three times
- `batch` : groups items in previous dataset in batches of 7 items
    - Final batch only has 2; can drop this with `drop_remainder = True`

In [23]:
dataset2 = dataset.repeat(3).batch(7, drop_remainder = True)
for item in dataset2:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)


- Dataset methods don't modify datasets; create new ones! 
- Make sure to keep reference to new datasets

#### 1.1.2 Methods

Can also transform by `map()` method

In [25]:
dataset1 = dataset1.map(lambda x: x * 2)
for item in dataset1:
    print(item)

tf.Tensor([ 0  2  4  6  8 10 12], shape=(7,), dtype=int32)
tf.Tensor([14 16 18  0  2  4  6], shape=(7,), dtype=int32)
tf.Tensor([ 8 10 12 14 16 18  0], shape=(7,), dtype=int32)
tf.Tensor([ 2  4  6  8 10 12 14], shape=(7,), dtype=int32)
tf.Tensor([16 18], shape=(2,), dtype=int32)


- `map()`: transforms each item (elementwise)
- `apply()` : transformation to dataset as a whole

In [26]:
# ex1: unbatch entire dataset
dataset1 = dataset1.apply(tf.data.experimental.unbatch())

In [30]:
# ex2: filter dataset
dataset1 = dataset.filter(lambda x: x < 10)

In [31]:
for item in dataset1:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


In [33]:
# To look at a few items only:
for item in dataset1.take(3):
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)


### 1.2 Shuffling the data