In [4]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import pathlib

In [13]:
# This should point to the small dataset of the Kaggle Dogs vs Cats competition that was created in a previous notebook
data_folder = pathlib.Path('../data/kaggle_dogs_vs_cats_small')

### Loading Image Data into a `Dataset` class for Training

The Tensorflow `Dataset` class is TF's default method to load data. From its [docs]([Title](https://www.tensorflow.org/api_docs/python/tf/data/Dataset)):

> The tf.data.Dataset API supports writing descriptive and efficient input pipelines. Dataset usage follows a common pattern:
>1. Create a source dataset from your input data.
>2. Apply dataset transformations to preprocess the data.
>3. Iterate over the dataset and process the elements.

>Iteration happens in a streaming fashion, so the full dataset does not need to fit into memory.

To read more about the `Dataset` class, see Tensorflow's [Data Guide](https://www.tensorflow.org/guide/data) and [Data Performance Guide](https://www.tensorflow.org/guide/data_performance).

Keras offers several utility functions to load data into TF's `Dataset` class. **See Keras [docs]([Title](https://keras.io/api/data_loading/)) for a list of all the data types that it can load**. One such utility function is the  `image_dataset_from_directory` that we use here. Note that there are additional ways to image data (see [tutorial]([Title](https://www.tensorflow.org/tutorials/load_data/images))).

## The `Dataset` class
Before we go into loading images, let's take a look at the `Dataset` class. It acts like a Python iterator, in which in each call it provides data for either training or evaluation. The data can be a single data sample, or a batch Tensorflow optimizes the performance of the data load, and covers the whole data set per epoch during training.

Here we define a data as a `NumPy` array:

In [14]:
random_numbers = np.random.normal(size=(1000, 16))

In [19]:
print(type(random_numbers))
print(random_numbers.shape)
print(random_numbers.dtype)
print(random_numbers[:4])

<class 'numpy.ndarray'>
(1000, 16)
float64
[[ 0.28163258 -0.46810717  0.12479456  1.27036153  0.99383731  1.20054864
  -0.00316159 -0.30338952  0.89201479 -0.95646887  0.0287871  -0.2557161
  -0.73116838  0.40990051 -0.14357482  0.03638614]
 [ 0.31092888 -0.5429373  -0.52020321  1.15555237  0.66089029  1.24451936
   0.1914797   0.40182437 -0.05717869 -1.10696724  0.92676361  0.79312175
  -0.94112144 -0.61725669 -0.82011052 -0.25030754]
 [-0.68873302 -1.49687682 -0.4459783   2.54131492 -1.53784571 -0.32989175
   0.11651781 -0.32314156 -1.00848958 -0.95996169  0.66446594 -0.71444421
   1.8433425  -1.20494051  0.37084589  0.31745678]
 [ 0.54387202  1.95881492  0.14069934 -0.14561345 -0.75947224 -1.35716761
   0.23288172  1.80303935  0.4051701   1.16463857  1.04769526 -0.17474879
  -0.78197064  0.43326643  1.15578448  0.37671455]]


Next, using the method `from_tensor_slices` we define a `Dataset` instance: 

The tensor is sliced along its first dimension, with each slice being of a single index of the first dimension
[docs](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices)

In [20]:
dataset = tf.data.Dataset.from_tensor_slices(random_numbers)

In [21]:
type(dataset)

tensorflow.python.data.ops.from_tensor_slices_op._TensorSliceDataset

In [22]:
for i, element in enumerate(dataset):
    print(element.shape)
    if i >= 2:
        break

(16,)
(16,)
(16,)


In [23]:
for i, element in enumerate(dataset):
    print(element)
    if i >= 2:
        break

tf.Tensor(
[ 0.28163258 -0.46810717  0.12479456  1.27036153  0.99383731  1.20054864
 -0.00316159 -0.30338952  0.89201479 -0.95646887  0.0287871  -0.2557161
 -0.73116838  0.40990051 -0.14357482  0.03638614], shape=(16,), dtype=float64)
tf.Tensor(
[ 0.31092888 -0.5429373  -0.52020321  1.15555237  0.66089029  1.24451936
  0.1914797   0.40182437 -0.05717869 -1.10696724  0.92676361  0.79312175
 -0.94112144 -0.61725669 -0.82011052 -0.25030754], shape=(16,), dtype=float64)
tf.Tensor(
[-0.68873302 -1.49687682 -0.4459783   2.54131492 -1.53784571 -0.32989175
  0.11651781 -0.32314156 -1.00848958 -0.95996169  0.66446594 -0.71444421
  1.8433425  -1.20494051  0.37084589  0.31745678], shape=(16,), dtype=float64)


Using this `Dataset`, we can define a `BatchDataset` instance:

In [24]:
batched_dataset = dataset.batch(32)
for i, element in enumerate(batched_dataset):
    print(element.shape)
    if i >= 2:
        break

(32, 16)
(32, 16)
(32, 16)


Here every batch has 32 samples, each with 16 elements (floats)

In [26]:
type(batched_dataset)

tensorflow.python.data.ops.batch_op._BatchDataset

### Using Keras Utility Functions to Create a `Dataset` for Images
Keras offers a utility class, `image_dataset_from_directory`, that loads images into a `Dataset` class.

**To the Student**: According to the [docs](https://keras.io/api/data_loading/image/) of `image_dataset_from_directory`:
* What folder structure does `image_dataset_from_directory` expect? 
* What is the output of `image_dataset_from_directory`?
* What is the function of the `image_size` argument?

In [30]:
from tensorflow.keras.utils import image_dataset_from_directory

train_dataset = image_dataset_from_directory(
    data_folder / "train",
    image_size=(180, 180),
    batch_size=32)
validation_dataset = image_dataset_from_directory(
    data_folder / "validation",
    image_size=(180, 180),
    batch_size=32)
test_dataset = image_dataset_from_directory(
    data_folder / "test",
    image_size=(180, 180),
    batch_size=32)

Found 2000 files belonging to 2 classes.
Found 1000 files belonging to 2 classes.
Found 2000 files belonging to 2 classes.


In [31]:
type(train_dataset)

tensorflow.python.data.ops.batch_op._BatchDataset

Displaying the shapes of the data and labels yielded by the `Dataset`:

In [32]:
for data_batch, labels_batch in train_dataset:
    print("data batch shape:", data_batch.shape)
    print("labels batch shape:", labels_batch.shape)
    break

data batch shape: (32, 180, 180, 3)
labels batch shape: (32,)
