# Load images with `tf.data`

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/tutorials/load_data/images"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/load_data/images.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/load_data/images.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

## Setup

Start by [installing TensorFlow](https://www.tensorflow.org/install/).

And testing the installation:

In [0]:
import tensorflow as tf
tf.enable_eager_execution()
tf.VERSION

## Retrieve the images

Before you start any training, you'll need a set of images to teach the network about the new classes you want to recognize. We've created an archive of creative-commons licensed flower photos to use initially. 

In [0]:
import pathlib
data_root = tf.keras.utils.get_file('flower_photos','https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz', untar=True)
data_root = pathlib.Path(data_root)

After downloading 218MB, you should now have a copy of the flower photos available in your working directory.

In [0]:
for item in data_root.iterdir():
  print(item)

## Inspect the images
Now let's have a quick look at a couple of the images, so we know what we're dealing with:

In [0]:
attributions = (data_root/"LICENSE.txt").read_text().splitlines()[4:]
attributions = [line.split(' CC-BY') for line in attributions]
attributions = dict(attributions)

In [0]:
import IPython.display as display

def show_image(image_path):
    display.display(display.Image(str(image_path)))
    
    image_rel = pathlib.Path(image_path).relative_to(data_root)
    caption = "Image (CC BY 2.0) " + ' - '.join(attributions[str(image_rel)].split(' - ')[:-1])
    display.display(display.HTML("<div>%s</div>" % caption))

In [0]:
import random
all_image_paths = list(data_root.glob('*/*'))
random.shuffle(all_image_paths)

show_image(random.choice(all_image_paths))
show_image(random.choice(all_image_paths))
show_image(random.choice(all_image_paths))

## Determine the label for each image

List the available labels:

In [0]:
label_names = sorted(item.name for item in data_root.glob('*/') if item.is_dir())
label_names

Assign an index to each label:

In [0]:
label_to_index = dict((name, index) for index,name in enumerate(label_names))
label_to_index

Create a list of every file, and its label index

In [0]:
all_image_labels = [label_to_index[path.parent.name] for path in all_image_paths]
all_image_labels[:10]

## A basic `tf.data.Dataset`

The easiest way to build a `tf.data.Dataset` is using the `from_tensor_slices` method.

Slicing the array of strings, results in a dataset of strings:

In [0]:
string_paths = [str(path) for path in all_image_paths]

path_ds = tf.data.Dataset.from_tensor_slices(string_paths)

The `output_shapes` and `output_types` fields describe the content of each item in the dataset. In this case it is a set of scalar binary-strings

In [0]:
print('shape: ', repr(path_ds.output_shapes))
print('type: ', path_ds.output_types)
print()
print(path_ds)

## A dataset of `(image, label)` pairs

Slicing both the paths and labels together gives a dataset of `(path, label)` pairs.

In [0]:
path_ds = tf.data.Dataset.from_tensor_slices((string_paths, all_image_labels))

In [0]:
for path,cls in path_ds.take(3):
  print(path.numpy(), " : ", label_names[cls.numpy()])
  print()

The `shapes` and `types`, are now tuples of shapes and types as well, describing each field:

In [0]:
print('shape: ', path_ds.output_shapes)
print('type: ', path_ds.output_types)
print()
print(path_ds)

## Load and format the images

Define a simple function to load, decode and format the image data.

In [0]:
def format_image(path):
  image = tf.read_file(path)
  image = tf.image.decode_jpeg(image, channels=3)
  image = tf.image.resize_images(image, [192, 192])
  image = (image/128.0) - 1

  return image

In [0]:
def process_image_label(path,label):
  image = tf.read_file(path)
  image = format_image(image)

  label = tf.cast(label, dtype=tf.int64)
  return image, label

Use the `map` method to convert the dataset of `(path,label)` pairs to `(image,label)` pairs.

Like all dataset methods, `map` does not execute the transformation immediately. It only executes as needed.

In [0]:
# many threads will be waiting on disk reads.
image_ds = path_ds.map(process_image_label, num_parallel_calls=16)
image_ds

In [0]:
import matplotlib.pyplot as plt

for image,label in image_ds.take(3):
  plt.figure()
  plt.imshow((image+1)/2)
  plt.title(label_names[label])
  plt.grid(False)

## Pipe to a model for training

To train a model with this dataset you will want the data:

* To be well shuffeled.
* To be batched.
* To repeat forever.
* Batches to available immediately when needed.

These features can be easily added using the `tf.data` api.

In [0]:
BATCH_SIZE = 32
# Shuffling the paths takes less memory than shuffling the images.
# Setting a buffer size larger than the dataset ensures that the data is completely shuffled.
ds = path_ds.shuffle(buffer_size=10000) 
ds = ds.map(process_image_label, num_parallel_calls=16)
ds = ds.batch(BATCH_SIZE).prefetch(1).repeat()
ds

### Quick Transfer learning with `keras.Applications`

Grab a copy of mobilenet v2 from `tf.keras.applications` and set it to be non-trainable:

In [0]:
mobile_net = tf.keras.applications.MobileNetV2(input_shape=[192, 192, 3], include_top=False)
mobile_net.trainable=False

Drop that into a `tf.keras.Sequential` model.

The mobilenet returns a `6x6` spatial grid of features for each image. So use `GlobalAveragePooling2D` to average over those space dimensions, before the output `Dense` layer:

In [0]:
model = tf.keras.Sequential([
  mobile_net,
  # This mobilnet returns a 6x6 feature map, take the spatial average.
  tf.keras.layers.GlobalAveragePooling2D(),
  tf.keras.layers.Dense(len(label_names))
])

Compile the model to describe the training procedure:

In [0]:
model.compile(optimizer=tf.train.AdamOptimizer(), 
              loss=tf.keras.losses.sparse_categorical_crossentropy,
              metrics=["accuracy"])

In [0]:
model.summary()

In [0]:
len(model.trainable_variables) # Dense `weights` and `bias`

Start the model training:

In [0]:
model.fit(ds, epochs=1, steps_per_epoch=3)  # steps_per_epoch=math.ceil(len(all_image_paths)/BATCH_SIZE)

## Performance
The simple pipeline used above reads each file individually, on each epoch. This is fine for local training on CPU but may not be sufficient for GPU training, and is totally inapprpriate for any sort of distributed training. 

Streaming files sequentially can be much more efficient.

To investigate, first build a simple function to check the performance of our datasets:

In [0]:
import time

def timeit(ds, batches=100):
  overall_start = time.time()
  # Fetch a single batch to prime the pipeline (fill the shuffle buffer),
  # before starting the timer
  it = iter(ds.take(batches+1))
  next(it)

  start = time.time()
  for i,(images,labels) in enumerate(it):
    if i%10 == 0:
      print('.',end='')
  print()
  end = time.time()

  duration = end-start
  print("100 batches: {} s".format(duration))
  print("{:0.5f} Images/s".format(BATCH_SIZE*batches/duration))
  print("Total time: {}s".format(end-overall_start))

The performance of out current dataset is:

In [0]:
timeit(ds)

### Cache

Use `tf.data.Dataset.cache` to easily take advantage of the performance boost of working with in memory data.

Here the images are cached, after being pre-precessed (decoded and resized):

In [0]:
ds = path_ds.map(process_image_label, num_parallel_calls=16)
ds = ds.cache()
ds = ds.shuffle(buffer_size=10000) 
ds = ds.batch(BATCH_SIZE).prefetch(1).repeat()
ds

In [0]:
timeit(ds)

In [0]:
timeit(ds)

If the data doesn't fit in memory, use a cache file. 

The cache file also has the advantage that it can be-

In [0]:
ds = path_ds.map(process_image_label, num_parallel_calls=16)
ds = ds.cache(filename='./cache.tf-data')
ds = ds.shuffle(buffer_size=10000)
ds = ds.batch(BATCH_SIZE).prefetch(1).repeat()
ds

In [0]:
timeit(ds)

In [0]:
timeit(ds)

### TFRecord File

TFRecord files are a simple format to store a sequence of binary blobs. In this case, it gives most of the performance boost of using the `.cache` method.

In [0]:
images_ds = tf.data.Dataset.from_tensor_slices(string_paths).map(tf.read_file)
tfrec = tf.contrib.data.TFRecordWriter('images.tfrec')
tfrec.write(images_ds)

In [0]:
images_ds = tf.data.TFRecordDataset('images.tfrec').map(format_image, num_parallel_calls=16)
labels_ds = tf.data.Dataset.from_tensor_slices(all_image_labels)

ds = tf.data.Dataset.zip((images_ds, labels_ds))
ds = ds.shuffle(buffer_size=10000).batch(BATCH_SIZE).prefetch(1)
ds

In [0]:
timeit(ds)