# Chapter 13: Loading and Preprocessing Data with TensorFlow

In [1]:
# Preliminaries
import sklearn
import tensorflow as tf
from tensorflow import keras
import numpy as np
import os

# Random seeds
np.random.seed(42)
tf.random.set_seed(42)

# Plots
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

__Chapter overview__ :
- Data API
- TFRecord format
- How to create custom preprocessing layers & use standard keras ones
- Two tf projects:
    1. `tf.Transform` : single preprocessing function
        - Runs in batch mode before training
        - Exported to tf function
        - Incorporated in trained model to preprocess new instances
    2. tf datasets (TFDS)
        - Function to download many common datasets
        - Convenient dataset objects to manipulate

## 1. Data API

tf __dataset__ : sequence of data items; usually read in gradually from disk

In [39]:
# Create dataset entirely in RAM
X = tf.range(10)
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

`from_tensor_slices()` takes tensor and creates `tf.data.Dataset` object whose elements are all slices of X along first dimension

_Alternatively_ : `tf.data.Dataset.range(10)`

In [40]:
# Iterate over datset's items
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


### 1.1 Chaining transformations

#### 1.1.1 Basic transforms

Once have dataset, can appy transformations by calling transform methods; each method returns new dataset

In [41]:
dataset_og = dataset
dataset = dataset.repeat(3).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


_Key parts_ :
- `repeat` : repeats elements of dataset 3 times; does not copy all the data in memory three times
- `batch` : groups items in previous dataset in batches of 7 items
    - Final batch only has 2; can drop this with `drop_remainder = True`

In [42]:
dataset2 = dataset_og.repeat(3).batch(7, drop_remainder = True)
for item in dataset2:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)


- Dataset methods don't modify datasets; create new ones! 
- Make sure to keep reference to new datasets

#### 1.1.2 Methods

Can also transform by `map()` method

In [43]:
dataset = dataset.map(lambda x: x * 2)
for item in dataset:
    print(item)

tf.Tensor([ 0  2  4  6  8 10 12], shape=(7,), dtype=int32)
tf.Tensor([14 16 18  0  2  4  6], shape=(7,), dtype=int32)
tf.Tensor([ 8 10 12 14 16 18  0], shape=(7,), dtype=int32)
tf.Tensor([ 2  4  6  8 10 12 14], shape=(7,), dtype=int32)
tf.Tensor([16 18], shape=(2,), dtype=int32)


- `map()`: transforms each item (elementwise)
- `apply()` : transformation to dataset as a whole

In [45]:
# ex1: unbatch entire dataset
dataset = dataset.unbatch()

In [46]:
# ex2: filter dataset to items less than 10
dataset = dataset.filter(lambda x: x < 10)

In [48]:
# To look at a few items only:
for item in dataset1.take(3):
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)


### 1.2 Shuffling the data

_How shuffling works_
1. Creates new dataset
2. Starts by filling buffer with first items of source dataset
3. Replaces with fresh one from source dataset until iterated entirely through source dataset
4. Continues to pull items randomly from buffer until empty

Need large enough buffer to have effective shuffling!

In [51]:
tf.random.set_seed(42)

# Repeat 0-9, 3 times
dataset = tf.data.Dataset.range(10).repeat(3)
# Buffer size 3, batches of 7
dataset = dataset.shuffle(buffer_size = 3, seed = 42).batch(7)
for item in dataset:
    print(item)

tf.Tensor([1 3 0 4 2 5 6], shape=(7,), dtype=int64)
tf.Tensor([8 7 1 0 3 2 5], shape=(7,), dtype=int64)
tf.Tensor([4 6 9 8 9 7 0], shape=(7,), dtype=int64)
tf.Tensor([3 1 4 5 2 8 7], shape=(7,), dtype=int64)
tf.Tensor([6 9], shape=(2,), dtype=int64)


If call `repeat` on shuffled dataset, generates new order at each iteration

If want to turn this off, set `reshuffle_each_iteration = False`

### 1.3 Dealing with large files

#### 1.3.1  Split larger dataset into smaller datasets

_Example_ : consider the California housing dataset

In [52]:
# Load, split train/val/test, scale
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()

x_train_full, x_test, y_train_full, y_test = train_test_split(housing.data,
                                                             housing.target.reshape(-1, 1),
                                                             random_state = 42)
x_train, x_valid, y_train, y_valid = train_test_split(x_train_full, y_train_full,
                                                     random_state = 42)

scaler = StandardScaler()
scaler.fit(x_train)
x_mean = scaler.mean_
x_std = scaler.scale_

One option for large dataset: split into many files, then have tf read these in parallel

_Example_ : split housing dataset & save into 20 .csv's

In [58]:
def save_to_multiple_csv_files(data, name_prefix, header = None, n_parts = 10):
    housing_dir = "housing"
    os.makedirs(housing_dir, exist_ok = True)
    path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")
    
    filepaths = []
    m = len(data)
    for file_idx, row_indices in enumerate(np.array_split(np.arange(m), n_parts)):
        part_csv = path_format.format(name_prefix, file_idx)
        filepaths.append(part_csv)
        with open(part_csv, "wt", encoding = "utf-8") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
            for row_idx in row_indices:
                f.write(",".join([repr(col) for col in data[row_idx]]))
                f.write("\n")
    return filepaths

In [59]:
train_data = np.c_[x_train, y_train]
valid_data = np.c_[x_valid, y_valid]
test_data = np.c_[x_test, y_test]
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)

train_filepaths = save_to_multiple_csv_files(train_data, "train", header, n_parts = 20)
valid_filepaths = save_to_multiple_csv_files(valid_data, "valid", header, n_parts=10)
test_filepaths = save_to_multiple_csv_files(test_data, "test", header, n_parts=10)

Examine first few lines of csv file

In [60]:
import pandas as pd

pd.read_csv(train_filepaths[0]).head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
0,3.5214,15.0,3.049945,1.106548,1447.0,1.605993,37.63,-122.43,1.442
1,5.3275,5.0,6.49006,0.991054,3464.0,3.44334,33.69,-117.39,1.687
2,3.1,29.0,7.542373,1.591525,1328.0,2.250847,38.44,-122.98,1.621
3,7.1736,12.0,6.289003,0.997442,1054.0,2.695652,33.55,-117.7,2.621
4,2.0549,13.0,5.312457,1.085092,3297.0,2.244384,33.93,-116.93,0.956


Text mode:

In [61]:
with open(train_filepaths[0]) as f:
    for i in range(5):
        print(f.readline(), end = "")

MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
3.5214,15.0,3.0499445061043287,1.106548279689234,1447.0,1.6059933407325193,37.63,-122.43,1.442
5.3275,5.0,6.490059642147117,0.9910536779324056,3464.0,3.4433399602385686,33.69,-117.39,1.687
3.1,29.0,7.5423728813559325,1.5915254237288134,1328.0,2.2508474576271187,38.44,-122.98,1.621
7.1736,12.0,6.289002557544757,0.9974424552429667,1054.0,2.6956521739130435,33.55,-117.7,2.621


Examine the filepaths:

In [62]:
train_filepaths

['housing/my_train_00.csv',
 'housing/my_train_01.csv',
 'housing/my_train_02.csv',
 'housing/my_train_03.csv',
 'housing/my_train_04.csv',
 'housing/my_train_05.csv',
 'housing/my_train_06.csv',
 'housing/my_train_07.csv',
 'housing/my_train_08.csv',
 'housing/my_train_09.csv',
 'housing/my_train_10.csv',
 'housing/my_train_11.csv',
 'housing/my_train_12.csv',
 'housing/my_train_13.csv',
 'housing/my_train_14.csv',
 'housing/my_train_15.csv',
 'housing/my_train_16.csv',
 'housing/my_train_17.csv',
 'housing/my_train_18.csv',
 'housing/my_train_19.csv']

#### 1.3.2 Interleaving files

Create training dataset containing only file paths

In [63]:
filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed = 42)

`list_files` returns dataset that shuffles the filepaths

In [64]:
for filepath in filepath_dataset:
    print(filepath)

tf.Tensor(b'housing/my_train_15.csv', shape=(), dtype=string)
tf.Tensor(b'housing/my_train_08.csv', shape=(), dtype=string)
tf.Tensor(b'housing/my_train_03.csv', shape=(), dtype=string)
tf.Tensor(b'housing/my_train_01.csv', shape=(), dtype=string)
tf.Tensor(b'housing/my_train_10.csv', shape=(), dtype=string)
tf.Tensor(b'housing/my_train_05.csv', shape=(), dtype=string)
tf.Tensor(b'housing/my_train_19.csv', shape=(), dtype=string)
tf.Tensor(b'housing/my_train_16.csv', shape=(), dtype=string)
tf.Tensor(b'housing/my_train_02.csv', shape=(), dtype=string)
tf.Tensor(b'housing/my_train_09.csv', shape=(), dtype=string)
tf.Tensor(b'housing/my_train_00.csv', shape=(), dtype=string)
tf.Tensor(b'housing/my_train_07.csv', shape=(), dtype=string)
tf.Tensor(b'housing/my_train_12.csv', shape=(), dtype=string)
tf.Tensor(b'housing/my_train_04.csv', shape=(), dtype=string)
tf.Tensor(b'housing/my_train_17.csv', shape=(), dtype=string)
tf.Tensor(b'housing/my_train_11.csv', shape=(), dtype=string)
tf.Tenso

Can call `interleave` to create datset to pull five paths from `filepath_dataset`; for each one, calls given function (e.g. lambda)

In [66]:
n_readers = 5
dataset = filepath_dataset.interleave(lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
                                     cycle_length = n_readers)

In [67]:
for line in dataset.take(5):
    print(line.numpy())

b'4.6477,38.0,5.03728813559322,0.911864406779661,745.0,2.5254237288135593,32.64,-117.07,1.504'
b'8.72,44.0,6.163179916317992,1.0460251046025104,668.0,2.794979079497908,34.2,-118.18,4.159'
b'3.8456,35.0,5.461346633416459,0.9576059850374065,1154.0,2.8778054862842892,37.96,-122.05,1.598'
b'3.3456,37.0,4.514084507042254,0.9084507042253521,458.0,3.2253521126760565,36.67,-121.7,2.526'
b'3.6875,44.0,4.524475524475524,0.993006993006993,457.0,3.195804195804196,34.04,-118.15,1.625'


Thus far, seven datasets:
- filepath dataset
- interleave dataset
- five `TextLineDatasets` creatd internally by interleave dataset

_Notes_ : 
- Preferable for files to have same length for interleaving
- Can read in paralel by `num_parallel_calls`

#### 1.3.3 Details

Field 4 interpreted as string

In [70]:
record_defaults = [0, np.nan, tf.constant(np.nan, dtype = tf.float64), "Hello", tf.constant([])]
parsed_fields = tf.io.decode_csv('1,2,3,4,5', record_defaults)
parsed_fields

[<tf.Tensor: shape=(), dtype=int32, numpy=1>,
 <tf.Tensor: shape=(), dtype=float32, numpy=2.0>,
 <tf.Tensor: shape=(), dtype=float64, numpy=3.0>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'4'>,
 <tf.Tensor: shape=(), dtype=float32, numpy=5.0>]

All missing fields replaced with default when provided

In [71]:
parsed_fields = tf.io.decode_csv(',,,,5', record_defaults)
parsed_fields

[<tf.Tensor: shape=(), dtype=int32, numpy=0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=nan>,
 <tf.Tensor: shape=(), dtype=float64, numpy=nan>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'Hello'>,
 <tf.Tensor: shape=(), dtype=float32, numpy=5.0>]

Fifth field compulsory since provided `tf.constant([])` as "default value" - get exception if not provided

In [72]:
try:
    parsed_fields = tf.io.decode_csv(',,,,', record_defaults)
except tf.errors.InvalidArgumentError as ex:
    print(ex)

Field 4 is required but missing in record 0! [Op:DecodeCSV]


Number of fields should exactly match number of fields in `record_defaults`

In [73]:
try:
    parsed_fields = tf.io.decode_csv('1,2,3,4,5,6,7', record_defaults)
except tf.errors.InvalidArgumentError as ex:
    print(ex)

Expect 5 fields but have 7 in record 0 [Op:DecodeCSV]


### 1.4 Preprocessing the data

#### 1.4.1 Preprocessing function

Need to parse and scale the data from all of the .csv's

In [76]:
print("x_mean:", x_mean)
print("x_std:", x_std)

x_mean: [ 3.89175860e+00  2.86245478e+01  5.45593655e+00  1.09963474e+00
  1.42428122e+03  2.95886657e+00  3.56464315e+01 -1.19584363e+02]
x_std: [1.90927329e+00 1.26409177e+01 2.55038070e+00 4.65460128e-01
 1.09576000e+03 2.36138048e+00 2.13456672e+00 2.00093304e+00]


In [77]:
n_inputs = 8

# Assume mean, stddev known
@tf.function
    # Default values
    defs = [0.] * n_inputs + [tf.constant([], dtype = tf.float32)]
    # Takes one csv line at a time and parses
    # Arg1: line to parse
    # Arg2: array containing default value for each csv column
    fields = tf.io.decode_csv(line, record_defaults = defs)
    # Output: list of scalar tensors -- need 1D arrays!
    # Turn these into 1D arrays via tf.stack
    x = tf.stack(fields[:-1])
    y = tf.stack(fields[-1:])
    # Return scaled features
    return (x - x_mean) / x_std, y

In [78]:
preprocess(b'4.2083,44.0,5.3232,0.9171,846.0,2.3370,37.47,-122.2,2.782')

(<tf.Tensor: shape=(8,), dtype=float32, numpy=
 array([ 0.16579157,  1.216324  , -0.05204565, -0.39215982, -0.5277444 ,
        -0.2633488 ,  0.8543046 , -1.3072058 ], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([2.782], dtype=float32)>)

#### 1.4.2 Putting it all together

_Helper function_ : create & return dataset
- Efficiently load CA housing data from multiple .csv's
- Preprocess, shuffle
- Optional: repeat, batch

In [79]:
def csv_reader_dataset(filepaths, repeat = 1, n_readers = 5,
                      n_read_threads = None, shuffle_buffer_size = 10000,
                      n_parse_threads = 5, batch_size = 32):
    dataset = tf.data.Dataset.list_files(filepaths).repeat(repeat)
    dataset = dataset.interleave(lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
                                cycle_length = n_readers,
                                num_parallel_calls = n_read_threads)
    dataset = dataset.shuffle(shuffle_buffer_size)
    dataset = dataset.map(preprocess, num_parallel_calls = n_parse_threads)
    dataset = dataset.batch(batch_size)
    return dataset.prefetch(1)

In [80]:
tf.random.set_seed(42)

train_set = csv_reader_dataset(train_filepaths, batch_size = 3)
for x_batch, y_batch in train_set.take(2):
    print("x =", x_batch)
    print("y =", y_batch)
    print()

x = tf.Tensor(
[[ 0.5804519  -0.20762321  0.05616303 -0.15191229  0.01343246  0.00604472
   1.2525111  -1.3671792 ]
 [ 5.818099    1.8491895   1.1784915   0.28173092 -1.2496178  -0.3571987
   0.7231292  -1.0023477 ]
 [-0.9253566   0.5834586  -0.7807257  -0.28213993 -0.36530012  0.27389365
  -0.76194876  0.72684526]], shape=(3, 8), dtype=float32)
y = tf.Tensor(
[[1.752]
 [1.313]
 [1.535]], shape=(3, 1), dtype=float32)

x = tf.Tensor(
[[-0.8324941   0.6625668  -0.20741376 -0.18699841 -0.14536144  0.09635526
   0.9807942  -0.67250353]
 [-0.62183803  0.5834586  -0.19862501 -0.3500319  -1.1437552  -0.3363751
   1.107282   -0.8674123 ]
 [ 0.8683102   0.02970133  0.3427381  -0.29872298  0.7124906   0.28026953
  -0.72915536  0.86178064]], shape=(3, 8), dtype=float32)
y = tf.Tensor(
[[0.919]
 [1.028]
 [2.182]], shape=(3, 1), dtype=float32)



#### 1.4.3 Prefetching

`prefetch(1)` :
- While training, dataset works in parallel to get next batch ready
- If loading & preprocessing multithreaded, can exploit multiple CPU/GPU cores

If dataset small enough to fit in memory, can speed up training by using `cache()` method to cache content into RAM
- Generally done after loading & preprocessing, but before shuffling/repeating/batching/prefetching
- Allows each instance to be read/preprocessed once, etc.

### 1.5 Using the dataset with tf.keras

Use `csv_reader_dataset()` to create dataset; do not need to repeat

In [81]:
train_set = csv_reader_dataset(train_filepaths, repeat = None)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)

In [82]:
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

In [83]:
model = keras.models.Sequential([
    keras.layers.Dense(30, activation = "relu", input_shape = x_train.shape[1:]),
    keras.layers.Dense(1)
])

In [84]:
model.compile(loss = "mse", optimizer = keras.optimizers.SGD(lr = 1e-3))

In [85]:
batch_size = 32
model.fit(train_set, 
          steps_per_epoch = len(x_train) // batch_size,
          epochs = 10,
          validation_data = valid_set)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f97be7d5bd0>

In [86]:
model.evaluate(test_set, steps = len(x_test) // batch_size)



0.4787752032279968

In [90]:
model.predict(test_set, steps = len(x_test) // batch_size)

array([[2.3576407],
       [2.255291 ],
       [1.4437605],
       ...,
       [0.5654393],
       [3.9442453],
       [1.0232248]], dtype=float32)

#### 1.5.1 Custom training loop

In [91]:
optimizer = keras.optimizers.Nadam(lr = 0.01)
loss_fn = keras.losses.mean_squared_error

n_epochs = 5
batch_size = 32
n_steps_per_epoch = len(x_train) // batch_size
total_steps = n_epochs * n_steps_per_epoch
global_step = 0

for x_batch, y_batch in train_set.take(total_steps):
    global_step += 1
    print("\rGlobal step {}/{}".format(global_step, total_steps), end = "")
    with tf.GradientTape() as tape:
        y_pred = model(x_batch)
        main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
        loss = tf.add_n([main_loss] + model.losses)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

Global step 1810/1810

#### 1.5.2 tf.function version

In [92]:
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

In [94]:
optimizer = keras.optimizers.Nadam(lr = 0.01)
loss_fn = keras.losses.mean_squared_error

@tf.function
def train(model, n_epochs, batch_size = 32, n_readers = 5,
          n_read_threads = 5, shuffle_buffer_size = 10000,
          n_parse_threads = 5):
    
    train_set = csv_reader_dataset(train_filepaths,
                                  repeat = n_epochs,
                                  n_readers = n_readers,
                                  n_read_threads = n_read_threads,
                                  shuffle_buffer_size = shuffle_buffer_size,
                                  n_parse_threads = n_parse_threads,
                                  batch_size = batch_size)
    
    for x_batch, y_batch in train_set:
        with tf.GradientTape() as tape:
            y_pred = model(x_batch)
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            loss = tf.add_n([main_loss] + model.losses)
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        
train(model, 5)

In [95]:
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

In [97]:
optimizer = keras.optimizers.Nadam(lr=0.01)
loss_fn = keras.losses.mean_squared_error

@tf.function
def train(model, n_epochs, batch_size=32,
          n_readers=5, n_read_threads=5, shuffle_buffer_size=10000, n_parse_threads=5):
    train_set = csv_reader_dataset(train_filepaths, repeat=n_epochs, n_readers=n_readers,
                       n_read_threads=n_read_threads, shuffle_buffer_size=shuffle_buffer_size,
                       n_parse_threads=n_parse_threads, batch_size=batch_size)
    n_steps_per_epoch = len(x_train) // batch_size
    total_steps = n_epochs * n_steps_per_epoch
    global_step = 0
    for X_batch, y_batch in train_set.take(total_steps):
        global_step += 1
        if tf.equal(global_step % 100, 0):
            tf.print("\rGlobal step", global_step, "/", total_steps)
        with tf.GradientTape() as tape:
            y_pred = model(X_batch)
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            loss = tf.add_n([main_loss] + model.losses)
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

train(model, 5)

Global step 100 / 1810
Global step 200 / 1810
Global step 300 / 1810
Global step 400 / 1810
Global step 500 / 1810
Global step 600 / 1810
Global step 700 / 1810
Global step 800 / 1810
Global step 900 / 1810
Global step 1000 / 1810
Global step 1100 / 1810
Global step 1200 / 1810
Global step 1300 / 1810
Global step 1400 / 1810
Global step 1500 / 1810
Global step 1600 / 1810
Global step 1700 / 1810
Global step 1800 / 1810


Description of each method in `Dataset` class:

In [98]:
for m in dir(tf.data.Dataset):
    if not (m.startswith("_") or m.endswith("_")):
        func = getattr(tf.data.Dataset, m)
        if hasattr(func, "__doc__"):
            print("● {:21s}{}".format(m + "()", func.__doc__.split("\n")[0]))

● apply()              Applies a transformation function to this dataset.
● as_numpy_iterator()  Returns an iterator which converts all elements of the dataset to numpy.
● batch()              Combines consecutive elements of this dataset into batches.
● cache()              Caches the elements in this dataset.
● cardinality()        Returns the cardinality of the dataset, if known.
● concatenate()        Creates a `Dataset` by concatenating the given dataset with this dataset.
● element_spec()       The type specification of an element of this dataset.
● enumerate()          Enumerates the elements of this dataset.
● filter()             Filters this dataset according to `predicate`.
● flat_map()           Maps `map_func` across this dataset and flattens the result.
● from_generator()     Creates a `Dataset` whose elements are generated by `generator`.
● from_tensor_slices() Creates a `Dataset` whose elements are slices of the given tensors.
● from_tensors()       Creates a `Dataset` 

## 2. TFRecord Format

TODO

## 3. Preprocessing Input Features

TODO