<a href="https://colab.research.google.com/github/michelucci/TF20-Notes/blob/master/TF_2_0_Notes_Working_with_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

tf.keras.backend.clear_session()  # For easy reset of notebook state

In [0]:
from tensorflow import keras


# Use of ```tf.data.Dataset```

Reference

https://www.tensorflow.org/beta/guide/data

The ```tf.data API``` introduces two new abstractions to TensorFlow:

- A ```tf.data.Dataset``` represents a sequence of elements, in which each element contains one or more Tensor objects. For example, in an image pipeline, an element might be a single training example, with a pair of tensors representing the image data and a label. There are two distinct ways to create a dataset:

  - Creating a __source__ (e.g. ```Dataset.from_tensor_slices()```) constructs a dataset from one or more ```tf.Tensor``` objects.

  - Applying a transformation (e.g. ```Dataset.batch()```) constructs a dataset from one or more ```tf.data.Dataset``` objects.

- A ```tf.data.Iterator``` provides the main way to extract elements from a dataset. The operation returned by ```Iterator.get_next()``` yields the next element of a Dataset when executed, and typically acts as the interface between input pipeline code and your model. The simplest iterator is a "one-shot iterator", which is associated with a particular Dataset and iterates through it once. For more sophisticated uses, the ```Iterator.initializer``` operation enables you to reinitialize and parameterize an iterator with different datasets, so that you can, for example, iterate over training and validation data multiple times in the same program.

To work with ```Datasets``` you need to follow the steps:

1. First define a source. For example from some data you already have with ```tf.data.Dataset.from_tensors()``` or from ```tf.data.Dataset.from_tensor_slices()```

2. Now you have a ```Dataset``` object. You can transform it in another ```Dataset``` object by chaining methods on it. For example using ```Dataset.map()``` (to apply a function to each element), or do multielement transformation with ```Dataset.batch()```.

3. Then you create an ```iterator``` object that provides access to one element of the Dataset at a time. For example with ```Dataset.make_one_shot_iterator()```.

In [0]:
dataset = tf.data.Dataset.from_tensor_slices([8, 3, 0, 8, 2, 1])

In [0]:
for elem in dataset:
  print(elem)
  print(elem.numpy())

tf.Tensor(8, shape=(), dtype=int32)
8
tf.Tensor(3, shape=(), dtype=int32)
3
tf.Tensor(0, shape=(), dtype=int32)
0
tf.Tensor(8, shape=(), dtype=int32)
8
tf.Tensor(2, shape=(), dtype=int32)
2
tf.Tensor(1, shape=(), dtype=int32)
1


# Creating an iterator

In [0]:
it = iter(dataset)

print(next(it).numpy())

8


# Dataset structure

In [0]:
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random.uniform([4, 10]))

dataset1.element_spec

TensorSpec(shape=(10,), dtype=tf.float32, name=None)

In [0]:
dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))

dataset2.element_spec

(TensorSpec(shape=(), dtype=tf.float32, name=None),
 TensorSpec(shape=(100,), dtype=tf.int32, name=None))

In [0]:
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

dataset3.element_spec

(TensorSpec(shape=(10,), dtype=tf.float32, name=None),
 (TensorSpec(shape=(), dtype=tf.float32, name=None),
  TensorSpec(shape=(100,), dtype=tf.int32, name=None)))

# Other examples

In [0]:
dataset1 = tf.data.Dataset.from_tensor_slices(
    tf.random.uniform([4, 10], minval=1, maxval=10, dtype=tf.int32))

dataset1

<TensorSliceDataset shapes: (10,), types: tf.int32>

In [0]:
for z in dataset1:
  print(z.numpy())

[9 9 8 9 2 6 9 8 6 1]
[2 7 7 2 2 5 4 2 9 8]
[4 8 4 8 9 7 2 2 4 3]
[1 3 2 1 2 9 7 4 2 4]


In [0]:
dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

In [0]:
for a, (b,c) in dataset3:
  print('shapes: {a.shape}, {b.shape}, {c.shape}'.format(a=a, b=b, c=c))

shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)


# MNIST with ```tf.Dataset```

In [0]:
train, test = tf.keras.datasets.fashion_mnist.load_data()

In [0]:
images, labels = train
images = images/255

images.shape

(60000, 28, 28)

In [0]:
print(labels[:10])

[9 0 0 3 0 2 7 2 5 5]


In [0]:

images = images.reshape((60000, 784))
print(images.shape)

dataset = tf.data.Dataset.from_tensor_slices((images, labels))
train_dataset = dataset.shuffle(buffer_size=1024).batch(64)

(60000, 784)


In [0]:
model = keras.models.Sequential()
model.add(keras.layers.Dense(10, input_shape = (784,), activation = "relu"))
model.add(keras.layers.Dense(10,  activation = "relu"))
model.add(keras.layers.Dense(10, activation = "softmax"))

model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(), optimizer='adam', metrics=['accuracy'])

In [0]:
train_dataset

<BatchDataset shapes: ((None, 784), (None,)), types: (tf.float64, tf.uint8)>

In [0]:
model.fit(train_dataset, epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f5ba71c92b0>

# Python generators

In [0]:
def count(stop):
  i = 0
  while i < stop:
    yield i
    i += 1

In [0]:
for n in count(5):
  print(n)

0
1
2
3
4


In [0]:
ds_counter = tf.data.Dataset.from_generator(count, args=[25], output_types=tf.int32, output_shapes = (), )

In [0]:
for count_batch in ds_counter.repeat().batch(10).take(10):
  print(count_batch.numpy())

[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24  0  1  2  3  4]
[ 5  6  7  8  9 10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]
[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24  0  1  2  3  4]
[ 5  6  7  8  9 10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]


# Example flower Photos

In [0]:
flowers = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)

Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz


In [0]:
img_gen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, rotation_range=20)

In [0]:
images, labels = next(img_gen.flow_from_directory(flowers))

Found 3670 images belonging to 5 classes.


In [0]:
print(images.dtype, images.shape)
print(labels.dtype, labels.shape)

float32 (32, 256, 256, 3)
float32 (32, 5)


In [0]:
ds = tf.data.Dataset.from_generator(
    img_gen.flow_from_directory, args=[flowers], 
    output_types=(tf.float32, tf.float32), 
    output_shapes = ([32,256,256,3],[32,5])
)

ds

<DatasetV1Adapter shapes: ((32, 256, 256, 3), (32, 5)), types: (tf.float32, tf.float32)>