![BTS](https://github.com/vfp1/bts-mbds-data-science-foundations-2019/raw/master/sessions/img/Logo-BTS.jpg)

# Session 05: Other TF APIs
### Victor F. Pajuelo Madrigal <victor.pajuelo@bts.tech> - Advanced Data Analysis (23-04-2020)

Open this notebook in Google Colaboratory: [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vfp1/bts-advanced-data-analysis-2020/blob/master/S05_Other_TF_APIs/05_Advanced_Data_Analysis_Other_TF_APIs_NOTsolved.ipynb)

**Resources (code patched and updated from):**
* TensorFlow Authors
* Aurelien Geron's O'Reilly's "Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow"

# Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20 and TensorFlow ≥2.0.

In [0]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
    !pip install -q -U tfx==0.21.2
    print("You can safely ignore the package incompatibility errors.")
except Exception:
    pass

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "data"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

[K     |████████████████████████████████| 1.1MB 3.4MB/s 
[K     |████████████████████████████████| 153kB 64.8MB/s 
[K     |████████████████████████████████| 59.2MB 73kB/s 
[K     |████████████████████████████████| 4.9MB 59.2MB/s 
[K     |████████████████████████████████| 276kB 57.0MB/s 
[K     |████████████████████████████████| 1.9MB 49.3MB/s 
[K     |████████████████████████████████| 112kB 68.9MB/s 
[K     |████████████████████████████████| 245kB 54.2MB/s 
[K     |████████████████████████████████| 2.4MB 49.3MB/s 
[K     |████████████████████████████████| 1.5MB 43.7MB/s 
[K     |████████████████████████████████| 3.0MB 41.4MB/s 
[K     |████████████████████████████████| 204kB 57.2MB/s 
[K     |████████████████████████████████| 10.4MB 37.6MB/s 
[K     |████████████████████████████████| 6.7MB 42.4MB/s 
[K     |████████████████████████████████| 61kB 9.4MB/s 
[K     |████████████████████████████████| 225kB 66.9MB/s 
[K     |████████████████████████████████| 1.2MB 59.5MB/s 


---

---
# TensorFlow as NumPy

UUID - #S5C1

## Tensors and operations

### Tensors

You can create a tensor with `tf.constant()`. For example, here is a tensor representing a **matrix** with **two rows** and **three columns of floats**:

In [0]:
tf.constant([[1., 2., 3.], [4., 5., 6.]]) # matrix

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

A tf.Tensor can also be a **scalar**:

In [0]:
tf.constant(42) # scalar

<tf.Tensor: shape=(), dtype=int32, numpy=42>

Just like if we will have an `ndarray`, a `tf.Tensor` has a shape and a data type (`dtype`) or the number of dimensions:

In [0]:
t = tf.constant([[1., 2., 3.], [4., 5., 6.]])
t

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

In [0]:
t.shape

TensorShape([2, 3])

In [0]:
t.dtype

tf.float32

In [0]:
t.ndim

2

### Indexing

Indexing works in a similar way to NumPy

In [0]:
# Get all the rows from column 1 to the end (remember that indexing starts at 0)
t[:, 1:]

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[2., 3.],
       [5., 6.]], dtype=float32)>

In [0]:
"""
Using ellipsis to expand the number of : objects needed to
make a selection tuple of the same lenght as t.ndim. Remember, 
only one ellipsis can be present in indexing
"""
t[..., 1]

<tf.Tensor: shape=(2,), dtype=float32, numpy=array([2., 5.], dtype=float32)>

In [0]:
"""
Using an ellipsis with tf.newaxis. 
This serves to expand the dimensions of the resulting selection by one unit-length dimension. 
The added dimension is the position of the newaxis object in the selection tuple.
"""
t[..., 1, tf.newaxis]

<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[2.],
       [5.]], dtype=float32)>

In [0]:
t[tf.newaxis, ..., 1]

<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[2., 5.]], dtype=float32)>

### Ops (operations)

All sort of tensor (multidimensional arrays) are possible

`t + 10` is equivalent to calling `tf.add(t, 10)` (indeed, Python calls the magic method `t.__add__(10)`, which just calls tf.add(t, 10))

In [0]:
t + 10

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[11., 12., 13.],
       [14., 15., 16.]], dtype=float32)>

In [0]:
tf.add(t, 10)

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[11., 12., 13.],
       [14., 15., 16.]], dtype=float32)>

Other operators like - and * are also supported

In [0]:
t - 10

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[-9., -8., -7.],
       [-6., -5., -4.]], dtype=float32)>

In [0]:
t * 10

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[10., 20., 30.],
       [40., 50., 60.]], dtype=float32)>

You can also use square functions

In [0]:
tf.square(t)

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[ 1.,  4.,  9.],
       [16., 25., 36.]], dtype=float32)>

The @ operator was added in Python 3.5, for matrix multiplication: it is equivalent to calling the tf.matmul() function

In [0]:
t @ tf.transpose(t)

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[14., 32.],
       [32., 77.]], dtype=float32)>

In [0]:
tf.matmul(t, tf.transpose(t))

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[14., 32.],
       [32., 77.]], dtype=float32)>

#### Other operations

You will find **all the basic math operations you need** (`tf.add()`, `tf.multiply()`, `tf.square()`, `tf.exp()`, `tf.sqrt()`, etc.) and most operations that you can find in NumPy (e.g., `tf.reshape()`, `tf.squeeze()`, `tf.tile()`). 

Some functions **have a different name than in NumPy**; for instance, `tf.reduce_mean()`, `tf.reduce_sum()`, `tf.reduce_max()`, and `tf.math.log()` are the equivalent of `np.mean()`, `np.sum()`, `np.max()` and `np.log()`. 

**When the name differs, there is often a good reason for it.** For example, in TensorFlow you must write `tf.transpose(t)`; you cannot just write `t.T` like in NumPy. 

The reason is that **the `tf.transpose()` function does not do exactly the same thing as NumPy’s `T` attribute**: 

 * In TensorFlow, a new tensor is created with its own copy of the transposed data
 * In NumPy, `t.T` is just a transposed view on the same data. 
 
Similarly, **the `tf.reduce_sum()` operation is named this way because its GPU kernel (i.e., GPU implementation) uses a *reduce algorithm* that does not guarantee the order in which the elements are added**: because 32-bit floats have limited precision, the result may change ever so slightly every time you call this operation. 

The same is true of `tf.reduce_mean()` (but of course tf.reduce_max() is deterministic).

### From/To NumPy

You can create a tensor from a NumPy array and you can create a NumPy array from a tensor.

In [0]:
# Turn a NumPy array into a tensor
a = np.array([2., 4., 5.])
tf.constant(a)

<tf.Tensor: shape=(3,), dtype=float64, numpy=array([2., 4., 5.])>

In [0]:
# Turn a tensor to NumPy array
t.numpy()

array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)

In [0]:
# Turn a tensor to NumPy array
np.array(t)

array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)

You can even apply TensorFlow operations to NumPy arrays and NumPy operations to tensors:

In [0]:
# Apply TensorFlow square operation to NumPy array
# Returns a tensor
tf.square(a)

<tf.Tensor: shape=(3,), dtype=float64, numpy=array([ 4., 16., 25.])>

In [0]:
# Apply NumPy square operation to a tensor
# Returns an array
np.square(t)

array([[ 1.,  4.,  9.],
       [16., 25., 36.]], dtype=float32)

**WARNING!!!** NumPy uses 64-bit precision by default, while TensorFlow uses 32-bit.

This is because 32-bit precision is generally more than enough for neural networks, plus it runs faster and uses less RAM. So when you create a tensor from a NumPy array, make sure to set `dtype=tf.float32`.

In [0]:
b = tf.constant(a, dtype=tf.float32)
b

<tf.Tensor: shape=(3,), dtype=float32, numpy=array([2., 4., 5.], dtype=float32)>

In [0]:
tf.square(b)

<tf.Tensor: shape=(3,), dtype=float32, numpy=array([ 4., 16., 25.], dtype=float32)>

### Type Conversions

This can be a nightmare at first, but it becomes extremely useful. Just be patient.

**Type conversions can significantly hurt performance**, and they can easily go unnoticed when they are done automatically. To avoid this, **TensorFlow does not perform any type conversions automatically**: 
* TensorFlow just *raises an exception if you try to execute an operation on tensors with incompatible types*. 
* You cannot add a float tensor and an integer tensor, and you cannot even add a 32-bit float and a 64-bit float

In [0]:
tf.constant(2.0) + tf.constant(40)

InvalidArgumentError: ignored

In [0]:
try:
    tf.constant(2.0) + tf.constant(40)
except tf.errors.InvalidArgumentError as ex:
    print(ex)

cannot compute AddV2 as input #1(zero-based) was expected to be a float tensor but is a int32 tensor [Op:AddV2]


In [0]:
try:
    tf.constant(2.0) + tf.constant(40., dtype=tf.float64)
except tf.errors.InvalidArgumentError as ex:
    print(ex)

cannot compute AddV2 as input #1(zero-based) was expected to be a float tensor but is a double tensor [Op:AddV2] name: add/


This can be very, very, very annoying at first. But this is done for a good reason. 

And you can always use tf.cast() if you really need to convert types:

In [0]:
t2 = tf.constant(40., dtype=tf.float64)
tf.constant(2.0) + tf.cast(t2, tf.float32)

<tf.Tensor: shape=(), dtype=float32, numpy=42.0>

In [0]:
t3 = tf.cast(t2, tf.int32)
tf.constant(2.0) + tf.cast(t3, tf.float32)

<tf.Tensor: shape=(), dtype=float32, numpy=42.0>

### Variables

The `tf.Tensor` values we’ve seen so far **are immutable**: you cannot modify them. 

This means that **we cannot use regular tensors to implement weights in a neural network, since they need to be tweaked by backpropagation**.

Plus, other parameters may also need to change over time (e.g., a momentum optimizer keeps track of past gradients). What we need is a `tf.Variable`

In [0]:
v = tf.Variable([[1., 2., 3.], [4., 5., 6.]])
v

<tf.Variable 'Variable:0' shape=(2, 3) dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

**A `tf.Variable` acts much like a `tf.Tensor`**: 

* You can perform the same operations with it
* It plays nicely with NumPy as well
* It is just as picky with data types. 

However, you can modify a `tf.Variable` while not with a `tf.Tensor`:

* For instance, a `tf.Variable` can be modified in place using the `assign()` method (or `assign_add()` or `assign_sub()`, which increment or decrement the variable by the given value).
* You can also **modify individual cells** (or slices), by using the cell’s (or slice’s) `assign()` method (direct item assignment will not work) or by using the `scatter_nd_update()` method:

In [0]:
# Assign a value of 2*v to itself
# => [[2., 4., 6.], [8., 10., 12.]]
v.assign(2 * v)

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

In [0]:
# Assign a value of 4 to v location 0,1
# => [[2., 42., 6.], [8., 10., 12.]]
v[0, 1].assign(42)

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

In [0]:
# Asign 0,1 to the last column
# => [[2., 42., 0.], [8., 10., 1.]]
v[:, 2].assign([0., 1.])

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  0.],
       [ 8., 10.,  1.]], dtype=float32)>

Direct item assignement will never work

In [0]:
v[1]

<tf.Tensor: shape=(3,), dtype=float32, numpy=array([ 8., 10.,  1.], dtype=float32)>

In [0]:
try:
    v[1] = [7., 8., 9.]
except TypeError as ex:
    print(ex)

'ResourceVariable' object does not support item assignment


In [0]:
# Update given index location with update values
# => [[100., 42., 0.], [8., 10., 200.]]
v.scatter_nd_update(indices=[[0, 0], [1, 2]],
                    updates=[100., 200.])

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[100.,  42.,   0.],
       [  8.,  10., 200.]], dtype=float32)>

Understanding `assign`, `assign_add`, `assign_sub`

In [0]:
v2 = tf.Variable(1.)
v2

<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>

In [0]:
v2.assign(4.) 

<tf.Variable 'UnreadVariable' shape=() dtype=float32, numpy=4.0>

In [0]:
v2.assign_add(1.5)

<tf.Variable 'UnreadVariable' shape=() dtype=float32, numpy=5.5>

In [0]:
v2.assign_sub(1.5)

<tf.Variable 'UnreadVariable' shape=() dtype=float32, numpy=4.0>

**WARNING !!** In practice you will rarely have to create variables manually, since model parameters will generally be updated directly by the optimizers, so you will rarely need to update variables manually.

### Tensor Arrays

Are lists of tensors. They have a fixed size by default but can optionally be made dynamic. All tensors they contain must have the same shape and data type.

In [0]:
# Creating an array of Tensors
array = tf.TensorArray(dtype=tf.float32, size=3)
array = array.write(0, tf.constant([1., 2.]))
array = array.write(1, tf.constant([3., 10.]))
array = array.write(2, tf.constant([5., 7.]))

Stacking the lists of tensors

In [0]:
array.stack()

<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[ 1.,  2.],
       [ 3., 10.],
       [ 5.,  7.]], dtype=float32)>



---



---


# The Data API

## Datasets

UUID - #S5C2

In [0]:
X = tf.range(10) #This could be any data tensor
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

Equivalently:

In [0]:
dataset = tf.data.Dataset.range(10)

In [0]:
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)


## Chaining transformations

UUID - #S5C3

Repeat the dataset three times and split it in batches of 7

In [0]:
dataset = dataset.repeat(3).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64)
tf.Tensor([8 9], shape=(2,), dtype=int64)


Repeat the dataset three times and split it in batches of 10


In [0]:
dataset2 = tf.data.Dataset.range(10)
dataset2 = dataset2.repeat(3).batch(10)
for item in dataset2:
    print(item)

tf.Tensor([0 1 2 3 4 5 6 7 8 9], shape=(10,), dtype=int64)
tf.Tensor([0 1 2 3 4 5 6 7 8 9], shape=(10,), dtype=int64)
tf.Tensor([0 1 2 3 4 5 6 7 8 9], shape=(10,), dtype=int64)


Use drop_remainder=True to have the batches with all the same size

In [0]:
dataset3 = tf.data.Dataset.range(10)
dataset3 = dataset3.repeat(3).batch(7, drop_remainder=True)
for item in dataset3:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64)


Use the **`map()`** method to transform the items in a dataset:

In [0]:
dataset = dataset.map(lambda x: x * 2)

In [0]:
for item in dataset:
    print(item)

tf.Tensor([ 0  2  4  6  8 10 12], shape=(7,), dtype=int64)
tf.Tensor([14 16 18  0  2  4  6], shape=(7,), dtype=int64)
tf.Tensor([ 8 10 12 14 16 18  0], shape=(7,), dtype=int64)
tf.Tensor([ 2  4  6  8 10 12 14], shape=(7,), dtype=int64)
tf.Tensor([16 18], shape=(2,), dtype=int64)


Use the **`map()`** method with **`num_parallel_calls`**:

In [0]:
dataset4 = dataset.map(lambda x: x * 2, num_parallel_calls=2)

In [0]:
for item in dataset4:
    print(item)

tf.Tensor([ 0  4  8 12 16 20 24], shape=(7,), dtype=int64)
tf.Tensor([28 32 36  0  4  8 12], shape=(7,), dtype=int64)
tf.Tensor([16 20 24 28 32 36  0], shape=(7,), dtype=int64)
tf.Tensor([ 4  8 12 16 20 24 28], shape=(7,), dtype=int64)
tf.Tensor([32 36], shape=(2,), dtype=int64)


Using **apply()** method to apply a transformation to the whole dataset instead item to item

In [0]:
# Create dataset with range
dataset5 = tf.data.Dataset.range(100) 

# Function to create a filter
def dataset_fn(ds): 
  return ds.filter(lambda x: x < 5) 

# Apply the function
dataset5 = dataset5.apply(dataset_fn) 

# Print items filtered within dataset
list(dataset5.as_numpy_iterator()) 

[0, 1, 2, 3, 4]

Using **`unbatch()`** to get the original dataset:

In [0]:
# Create dataset with range
dataset6 = tf.data.Dataset.range(10) 

# Apply the repeat and batch
dataset6 = dataset6.repeat(3).batch(7, drop_remainder=True)

# Apply the unbatch
dataset6 = dataset6.unbatch() 

# Print items filtered within dataset
list(dataset6.as_numpy_iterator()) 

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 0,
 1,
 2,
 3,
 4,
 5,
 6,
 7]

The **`filter()`** method also allows us to filter elements in an unbatched dataset:

In [0]:
# Create dataset with range
dataset7 = tf.data.Dataset.range(10) 

# Apply the repeat and batch
dataset7 = dataset7.repeat(3).batch(7, drop_remainder=True)

# Apply an unbatch function
dataset7 = dataset7.unbatch()

# Apply a filter
dataset7 = dataset7.filter(lambda x: x < 10)  # keep only items < 10

In [0]:
for item in dataset7.take(3):
    print(item)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)


## Shuffling the data

UUID - #S5C4

In [0]:
tf.random.set_seed(42)

dataset = tf.data.Dataset.range(10).repeat(3)
dataset = dataset.shuffle(buffer_size=3, seed=42).batch(7)
for item in dataset:
    print(item)

tf.Tensor([1 3 0 4 2 5 6], shape=(7,), dtype=int64)
tf.Tensor([8 7 1 0 3 2 5], shape=(7,), dtype=int64)
tf.Tensor([4 6 9 8 9 7 0], shape=(7,), dtype=int64)
tf.Tensor([3 1 4 5 2 8 7], shape=(7,), dtype=int64)
tf.Tensor([6 9], shape=(2,), dtype=int64)


In [0]:
tf.random.set_seed(42)

dataset = tf.data.Dataset.range(10).repeat(3)
dataset = dataset.shuffle(buffer_size=3, seed=42, reshuffle_each_iteration=True).batch(7).repeat(4)
for item in dataset:
    print(item)

tf.Tensor([1 3 0 4 2 5 6], shape=(7,), dtype=int64)
tf.Tensor([8 7 1 0 3 2 5], shape=(7,), dtype=int64)
tf.Tensor([4 6 9 8 9 7 0], shape=(7,), dtype=int64)
tf.Tensor([3 1 4 5 2 8 7], shape=(7,), dtype=int64)
tf.Tensor([6 9], shape=(2,), dtype=int64)
tf.Tensor([1 3 0 2 6 7 5], shape=(7,), dtype=int64)
tf.Tensor([8 9 4 0 2 1 3], shape=(7,), dtype=int64)
tf.Tensor([4 6 7 8 9 1 2], shape=(7,), dtype=int64)
tf.Tensor([0 5 4 6 3 5 9], shape=(7,), dtype=int64)
tf.Tensor([7 8], shape=(2,), dtype=int64)
tf.Tensor([0 1 3 4 5 6 7], shape=(7,), dtype=int64)
tf.Tensor([2 8 1 2 0 9 3], shape=(7,), dtype=int64)
tf.Tensor([6 5 7 4 0 9 2], shape=(7,), dtype=int64)
tf.Tensor([1 3 5 8 4 7 8], shape=(7,), dtype=int64)
tf.Tensor([9 6], shape=(2,), dtype=int64)
tf.Tensor([2 3 1 0 5 6 8], shape=(7,), dtype=int64)
tf.Tensor([7 9 0 4 3 4 5], shape=(7,), dtype=int64)
tf.Tensor([1 2 8 7 6 9 0], shape=(7,), dtype=int64)
tf.Tensor([2 1 4 5 3 7 6], shape=(7,), dtype=int64)
tf.Tensor([9 8], shape=(2,), dtype=int64)


In [0]:
tf.random.set_seed(42)

dataset = tf.data.Dataset.range(10).repeat(3)
dataset = dataset.shuffle(buffer_size=3, seed=42, reshuffle_each_iteration=False).batch(7).repeat(4)
for item in dataset:
    print(item)

tf.Tensor([0 2 3 5 6 4 8], shape=(7,), dtype=int64)
tf.Tensor([9 0 1 1 7 3 2], shape=(7,), dtype=int64)
tf.Tensor([4 5 7 8 9 0 6], shape=(7,), dtype=int64)
tf.Tensor([1 2 4 6 7 3 9], shape=(7,), dtype=int64)
tf.Tensor([8 5], shape=(2,), dtype=int64)
tf.Tensor([0 2 3 5 6 4 8], shape=(7,), dtype=int64)
tf.Tensor([9 0 1 1 7 3 2], shape=(7,), dtype=int64)
tf.Tensor([4 5 7 8 9 0 6], shape=(7,), dtype=int64)
tf.Tensor([1 2 4 6 7 3 9], shape=(7,), dtype=int64)
tf.Tensor([8 5], shape=(2,), dtype=int64)
tf.Tensor([0 2 3 5 6 4 8], shape=(7,), dtype=int64)
tf.Tensor([9 0 1 1 7 3 2], shape=(7,), dtype=int64)
tf.Tensor([4 5 7 8 9 0 6], shape=(7,), dtype=int64)
tf.Tensor([1 2 4 6 7 3 9], shape=(7,), dtype=int64)
tf.Tensor([8 5], shape=(2,), dtype=int64)
tf.Tensor([0 2 3 5 6 4 8], shape=(7,), dtype=int64)
tf.Tensor([9 0 1 1 7 3 2], shape=(7,), dtype=int64)
tf.Tensor([4 5 7 8 9 0 6], shape=(7,), dtype=int64)
tf.Tensor([1 2 4 6 7 3 9], shape=(7,), dtype=int64)
tf.Tensor([8 5], shape=(2,), dtype=int64)


# Split the California dataset to multiple CSV files

## Fetching input

UUID - #S5C5

Let's start by loading and preparing the California housing dataset. We first load it, then split it into a training set, a validation set and a test set, and finally we scale it:

In [0]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
scaler.fit(X_train)
X_mean = scaler.mean_
X_std = scaler.scale_

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to /root/scikit_learn_data
  


For a very large dataset that does not fit in memory, you will typically want to split it into many files first, then have TensorFlow read these files in parallel. To demonstrate this, let's start by splitting the housing dataset and save it to 20 CSV files:

In [0]:
def save_to_multiple_csv_files(data, name_prefix, header=None, n_parts=10):
    housing_dir = os.path.join("datasets", "housing")
    os.makedirs(housing_dir, exist_ok=True)
    path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")

    filepaths = []
    m = len(data)
    for file_idx, row_indices in enumerate(np.array_split(np.arange(m), n_parts)):
        part_csv = path_format.format(name_prefix, file_idx)
        filepaths.append(part_csv)
        with open(part_csv, "wt", encoding="utf-8") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
            for row_idx in row_indices:
                f.write(",".join([repr(col) for col in data[row_idx]]))
                f.write("\n")
    return filepaths

In [0]:
# NumPy c_ == Translates slice objects to concatenation along the second axis.
train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)

train_filepaths = save_to_multiple_csv_files(train_data, "train", header, n_parts=20)
valid_filepaths = save_to_multiple_csv_files(valid_data, "valid", header, n_parts=10)
test_filepaths = save_to_multiple_csv_files(test_data, "test", header, n_parts=10)

Okay, now let's take a peek at the first few lines of one of these CSV files:

In [0]:
with open(train_filepaths[0]) as f:
    for i in range(5):
        print(f.readline(), end="")

MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
3.5214,15.0,3.0499445061043287,1.106548279689234,1447.0,1.6059933407325193,37.63,-122.43,1.442
5.3275,5.0,6.490059642147117,0.9910536779324056,3464.0,3.4433399602385686,33.69,-117.39,1.687
3.1,29.0,7.5423728813559325,1.5915254237288134,1328.0,2.2508474576271187,38.44,-122.98,1.621
7.1736,12.0,6.289002557544757,0.9974424552429667,1054.0,2.6956521739130435,33.55,-117.7,2.621


In [0]:
train_filepaths

['datasets/housing/my_train_00.csv',
 'datasets/housing/my_train_01.csv',
 'datasets/housing/my_train_02.csv',
 'datasets/housing/my_train_03.csv',
 'datasets/housing/my_train_04.csv',
 'datasets/housing/my_train_05.csv',
 'datasets/housing/my_train_06.csv',
 'datasets/housing/my_train_07.csv',
 'datasets/housing/my_train_08.csv',
 'datasets/housing/my_train_09.csv',
 'datasets/housing/my_train_10.csv',
 'datasets/housing/my_train_11.csv',
 'datasets/housing/my_train_12.csv',
 'datasets/housing/my_train_13.csv',
 'datasets/housing/my_train_14.csv',
 'datasets/housing/my_train_15.csv',
 'datasets/housing/my_train_16.csv',
 'datasets/housing/my_train_17.csv',
 'datasets/housing/my_train_18.csv',
 'datasets/housing/my_train_19.csv']

In [0]:
valid_filepaths

['datasets/housing/my_valid_00.csv',
 'datasets/housing/my_valid_01.csv',
 'datasets/housing/my_valid_02.csv',
 'datasets/housing/my_valid_03.csv',
 'datasets/housing/my_valid_04.csv',
 'datasets/housing/my_valid_05.csv',
 'datasets/housing/my_valid_06.csv',
 'datasets/housing/my_valid_07.csv',
 'datasets/housing/my_valid_08.csv',
 'datasets/housing/my_valid_09.csv']

In [0]:
test_filepaths

['datasets/housing/my_test_00.csv',
 'datasets/housing/my_test_01.csv',
 'datasets/housing/my_test_02.csv',
 'datasets/housing/my_test_03.csv',
 'datasets/housing/my_test_04.csv',
 'datasets/housing/my_test_05.csv',
 'datasets/housing/my_test_06.csv',
 'datasets/housing/my_test_07.csv',
 'datasets/housing/my_test_08.csv',
 'datasets/housing/my_test_09.csv']

## Building an Input Pipeline

UUID - #S5C6


In [0]:
filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)

In [0]:
for filepath in filepath_dataset:
    print(filepath)

tf.Tensor(b'datasets/housing/my_train_15.csv', shape=(), dtype=string)
tf.Tensor(b'datasets/housing/my_train_08.csv', shape=(), dtype=string)
tf.Tensor(b'datasets/housing/my_train_03.csv', shape=(), dtype=string)
tf.Tensor(b'datasets/housing/my_train_01.csv', shape=(), dtype=string)
tf.Tensor(b'datasets/housing/my_train_10.csv', shape=(), dtype=string)
tf.Tensor(b'datasets/housing/my_train_05.csv', shape=(), dtype=string)
tf.Tensor(b'datasets/housing/my_train_19.csv', shape=(), dtype=string)
tf.Tensor(b'datasets/housing/my_train_16.csv', shape=(), dtype=string)
tf.Tensor(b'datasets/housing/my_train_02.csv', shape=(), dtype=string)
tf.Tensor(b'datasets/housing/my_train_09.csv', shape=(), dtype=string)
tf.Tensor(b'datasets/housing/my_train_00.csv', shape=(), dtype=string)
tf.Tensor(b'datasets/housing/my_train_07.csv', shape=(), dtype=string)
tf.Tensor(b'datasets/housing/my_train_12.csv', shape=(), dtype=string)
tf.Tensor(b'datasets/housing/my_train_04.csv', shape=(), dtype=string)
tf.Ten

Using the **`interleave()`** method

In [0]:
n_readers = 5
dataset = filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length=n_readers)

In [0]:
for line in dataset.take(5):
    print(line.numpy())

b'4.6477,38.0,5.03728813559322,0.911864406779661,745.0,2.5254237288135593,32.64,-117.07,1.504'
b'8.72,44.0,6.163179916317992,1.0460251046025104,668.0,2.794979079497908,34.2,-118.18,4.159'
b'3.8456,35.0,5.461346633416459,0.9576059850374065,1154.0,2.8778054862842892,37.96,-122.05,1.598'
b'3.3456,37.0,4.514084507042254,0.9084507042253521,458.0,3.2253521126760565,36.67,-121.7,2.526'
b'3.6875,44.0,4.524475524475524,0.993006993006993,457.0,3.195804195804196,34.04,-118.15,1.625'


## Preprocessing Inputs

UUID - #S5C7

What is a byte string? Let's check that with the code below. Notice that field 4 is interpreted as a string.

In [0]:
# We create a list of types of data
record_defaults=[0, np.nan, tf.constant(np.nan, dtype=tf.float64), "Hello", tf.constant([])]

# We parse the fields (we create a fake file '1,2,3,4,5') with record defaults
parsed_fields = tf.io.decode_csv('1,2,3,4,5', record_defaults)

# Notice that all the Tensors have a number, but the 4 has a b in front. 
# It is because "Hello" is a string
parsed_fields

[<tf.Tensor: shape=(), dtype=int32, numpy=1>,
 <tf.Tensor: shape=(), dtype=float32, numpy=2.0>,
 <tf.Tensor: shape=(), dtype=float64, numpy=3.0>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'4'>,
 <tf.Tensor: shape=(), dtype=float32, numpy=5.0>]

Notice that all missing fields are replaced with their default value, when provided:

In [0]:
parsed_fields = tf.io.decode_csv(',,,,5', record_defaults)
parsed_fields

[<tf.Tensor: shape=(), dtype=int32, numpy=0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=nan>,
 <tf.Tensor: shape=(), dtype=float64, numpy=nan>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'Hello'>,
 <tf.Tensor: shape=(), dtype=float32, numpy=5.0>]

The 5th field is compulsory (since we provided `tf.constant([])` as the "default value"), so we get an exception if we do not provide it:

In [0]:
try:
    parsed_fields = tf.io.decode_csv(',,,,', record_defaults)
except tf.errors.InvalidArgumentError as ex:
    print(ex)

Field 4 is required but missing in record 0! [Op:DecodeCSV]


The number of fields should match exactly the number of fields in the `record_defaults`:

In [0]:
try:
    parsed_fields = tf.io.decode_csv('1,2,3,4,5,6,7', record_defaults)
except tf.errors.InvalidArgumentError as ex:
    print(ex)

Expect 5 fields but have 7 in record 0 [Op:DecodeCSV]


### Preprocessing function

UUID - #S5C8

In [0]:
n_inputs = 8 # X_train.shape[-1]

@tf.function
def preprocess(line):
    defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
    fields = tf.io.decode_csv(line, record_defaults=defs)
    x = tf.stack(fields[:-1])
    y = tf.stack(fields[-1:])
    return (x - X_mean) / X_std, y

In [0]:
preprocess(b'4.2083,44.0,5.3232,0.9171,846.0,2.3370,37.47,-122.2,2.782')

(<tf.Tensor: shape=(8,), dtype=float32, numpy=
 array([ 0.16579157,  1.216324  , -0.05204565, -0.39215982, -0.5277444 ,
        -0.2633488 ,  0.8543046 , -1.3072058 ], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([2.782], dtype=float32)>)

## EXERCISES IN CLASS: Building an entire ETL function in TF

UUID - #S5E1

Let's try to build an entire pipeline ourselves! How it will look like?

1. List the filepaths and repeat if necessary
2. Interleave the filepaths
3. Shuffling the data with certain buffer size
4. Call a preprocessing function (perhaps the one that we just build?) and use map() to transform the dataset accordingly
5. Create and return a batch

## EXERCISE IN CLASS: Use the ETL pipeline into the Keras

UUID - #S5E2

Just a hint: `input_shape=X_train.shape[1:]`

But you need to tell me why :) 

Build a network that predicts house prices using the ETL you just created. How does it perform? Try to visualize the network in TensorBoard.

# Bonus: extra Dataset functions

Here is a short description of each method in the `Dataset` class:

In [0]:
for m in dir(tf.data.Dataset):
    if not (m.startswith("_") or m.endswith("_")):
        func = getattr(tf.data.Dataset, m)
        if hasattr(func, "__doc__"):
            print("● {:21s}{}".format(m + "()", func.__doc__.split("\n")[0]))

● apply()              Applies a transformation function to this dataset.
● as_numpy_iterator()  Returns an iterator which converts all elements of the dataset to numpy.
● batch()              Combines consecutive elements of this dataset into batches.
● cache()              Caches the elements in this dataset.
● concatenate()        Creates a `Dataset` by concatenating the given dataset with this dataset.
● element_spec()       The type specification of an element of this dataset.
● enumerate()          Enumerates the elements of this dataset.
● filter()             Filters this dataset according to `predicate`.
● flat_map()           Maps `map_func` across this dataset and flattens the result.
● from_generator()     Creates a `Dataset` whose elements are generated by `generator`.
● from_tensor_slices() Creates a `Dataset` whose elements are slices of the given tensors.
● from_tensors()       Creates a `Dataset` with a single element, comprising the given tensors.
● interleave()      