##### Copyright 2018 The TensorFlow Authors.

In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Using TFRecords and TF Examples

The Example structure as well as the TFRecord format are extremely useful for describing input data in the TensorFlow API. It allows developers to preprocess their data only once for multiple purposes, and allows developers to store their data locally. 

The [`tf.Example`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto) [protocol buffer](https://developers.google.com/protocol-buffers/) (a protocol buffer is also called a message) is specifically designed for use with TensorFlow, as well as higher-level APIs such as [TFX](https://www.tensorflow.org/tfx/) and [Keras](https://www.tensorflow.org/guide/keras). This notebook will demonstrate how to create, parse, and use the `tf.Example` message, and then store, read, and write this data in the `.tfrecords` format. This tutorial includes an end-to-end example of reading/writing image data as TF Examples in the TFRecord format. 

Note that, while extremely useful, using these structures is ultimately optional if using the [tf.data API](https://www.tensorflow.org/api_docs/python/tf/data) makes more sense.

In [0]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
tf.enable_eager_execution()

import numpy as np

## Data Types In `tf.Example`

The `tf.Example` type is generic enough to accept a wide range of data types. While the following [three types](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto#L4) of features are compatible with `tf.Example`, most other generic types can be coerced into one of these.

1. [`bytes_list`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto#L65) (the following types can be coerced)

  - `string`
  - `byte`

1. [`float_list`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto#L68) (the following types can be coerced)

  - `float` (`float32`)
  - `double` (`float64`)

1. [`int64_list`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto#L65) (the following types can be coerced)

  - `bool`
  - `enum`
  - `int32`
  - `uint32`
  - `int64`
  - `uint64`

In order to convert a standard type to a `tf.Example`-compatible type, we can use the following functions. Each function takes a single input value and returns one of the 3 `list` types above.

In [0]:
# The following functions can be used to convert a value to a type compatible
# with tf.Example.

def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

Below are some examples of how these functions work. Note the varying input types and the standardizes output types. If the input type for a function does not match one of the coercible types stated above, the function will raise an exception (e.g. `_int64_feature(1.0)` will error out, since `1.0` is a float, so should be used with the `_float_feature` function instead).

In [0]:
print(_bytes_feature('test_string'))
print(_bytes_feature(bytes('test_bytes')))

print(_float_feature(np.exp(1)))

print(_int64_feature(True))
print(_int64_feature(1))

## Creating A `tf.Example` Message

Suppose you want to create a `tf.Example` message from existing data. In practice, the dataset may come from anywhere, but the procedure of creating the `tf.Example` message from a single observation will be the same. 

1. Within each observation, each value needs to be converted to one of the 3 compatible types, using one of the functions above. 

1. We create a map (dictionary) from the feature name string to the encoded feature value produced in #1.

1. The map produced in #2 is converted to a [`Features` message](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto#L85).

In this notebook, we will create a dataset using NumPy. 

This dataset will have 4 features.
- a boolean feature, `False` or `True` with equal probability
- a random bytes feature, uniform across the entire support
- an integer feature uniformly randomly chosen from `[-10000, 10000)`
- a float feature from a standard normal distribution

Consider a sample consisting of 10,000 independently and identically distributed observations from each of the above distributions.

In [0]:
# the number of observations in the dataset
n_observations = int(1e4)

# boolean feature, encoded as False or True
feature0 = np.random.choice([False, True], n_observations)

# bytes feature
feature1 = np.random.bytes(n_observations)

# integer feature, random between -10000 and 10000
feature2 = np.random.randint(-10000, 10000, n_observations)

# float feature, from a standard normal distribution
feature3 = np.random.randn(n_observations)

Each of these features can be coerced into a `tf.Example`-compatible type using one of `_bytes_feature`, `_float_feature`, `_int64_feature`. We can then create a `tf.Example` message from these encoded features.

In [0]:
def create_example(features):
  """
  Creates a tf.Example message ready to be written to a file.
  
  Inputs:
    - features: a 4-list of the values in the observation
  """
  
  # Create a dictionary mapping the feature name to the tf.Example-compatible
  # data type.
  
  feature = {
      'feature0': _int64_feature(features[0]),
      'feature1': _bytes_feature(features[1]),
      'feature2': _int64_feature(features[2]),
      'feature3': _float_feature(features[3]),
  }
  
  # Create a Features message using tf.train.Example.
  
  return tf.train.Example(features=tf.train.Features(feature=feature))

For example, suppose we have a single observation from the dataset, `[False, bytes('example'), -1234, 0.9876]`. We can create and print the `tf.Example` message for this observation using `create_message()`. Each single observation will be written as a `Features` message as per the above. Note that the `tf.Example` [message](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto#L88) is just a wrapper around the `Features` message.

In [0]:
# This is an example observation from the dataset.

example_observation = [False, bytes('example'), -1234, 0.9876]

print(create_example(example_observation))

## Writing `tf.Example` Messages To A `.tfrecords` File

We now write the 10,000 observations to the file `test.tfrecords`. Each observation is converted to a `tf.Example` message, then written to file. We can then verify that the file `test.tfrecords` has been created.

In [0]:
# Write the tf.Example observations to test.tfrecords.

writer = tf.python_io.TFRecordWriter('test.tfrecords')

for i in range(n_observations):
  example = create_example([feature0[i], feature1[i], feature2[i], feature3[i]])
  writer.write(example.SerializeToString())

writer.close()

In [0]:
!ls

## Reading A `.tfrecords` File

Suppose we now want to read this data back, to be input as data into a model.

The following example imports the data as is, as a `tf.Example` message. This can be useful to verify that a the file contains the data that we expect. This can also be useful if the input data is stored as TFRecords but you would prefer to input NumPy data (or some other input data type), for example [here](https://www.tensorflow.org/guide/datasets#consuming_numpy_arrays), since this example allows us to read the values themselves.

We iterate through the TFRecords in the infile, extract the `tf.Example` message, and can read/store the values within.

In [0]:
record_iterator = tf.python_io.tf_record_iterator(path='test.tfrecords')

for string_record in record_iterator:
  example = tf.train.Example()
  example.ParseFromString(string_record)
  
  print(example)
  
  # Exit after 1 iteration as this is purely demonstrative.
  break

The features of the `example` object (created above of type `tf.Example`) can be accessed using its getters (similarly to any protocol buffer message). `example.features` returns a `repeated feature` message, then getting the `feature` message returns a map of feature name to feature value (stored in Python as a dictionary).

In [0]:
print(dict(example.features.feature))

From this dictionary, you can get any given value as with a dictionary.

In [0]:
print(example.features.feature['feature3'])

Now, we can access the value using the getters again.

In [0]:
print(example.features.feature['feature3'].float_list.value)

## Using The `Dataset` Object

We can also read the `.tfrecords` file into a [`dataset` object](https://www.tensorflow.org/api_docs/python/tf/data/Dataset). More information on consuming the TFRecord object into a Dataset can be found [here](https://www.tensorflow.org/guide/datasets#consuming_tfrecord_data). Using this datatset structure can be useful for standardizing input data and optimizing performance. It is also easier and quicker to use this object.

In [0]:
filenames = ['test.tfrecords']
dataset = tf.data.TFRecordDataset(filenames)

Each record in this dataset is an `EagerTensor` type, as [eager execution](https://www.tensorflow.org/guide/eager) was enabled at the start of this notebook. These tensors can be parsed using the function below.

In [0]:
def _parse_function(example_proto):
 
  # Create a dictionary of features.
  
  features = {
      'feature0': tf.FixedLenFeature([], tf.int64, default_value=0),
      'feature1': tf.FixedLenFeature([], tf.string, default_value=''),
      'feature2': tf.FixedLenFeature([], tf.int64, default_value=0),
      'feature3': tf.FixedLenFeature([], tf.float32, default_value=0.0),
  }
  
  # Parse the input tf.Example proto using the dictionary above.
  
  return tf.parse_single_example(example_proto, features)

Now, we can use eager execution to display the observations in the dataset. Note that there are 10,000 observations in this dataset, but we only display the first 10. The data is displayed as a dictionary of features. Each item is a `tf.Tensor`, and the `numpy` element of this tensor displays the value of the feature.

In [0]:
for record in dataset.take(10):
  print(_parse_function(record))

## Reading/Writing Image Data

This is an example of how to read and write image data using TFRecords. The purpose of this is to show how, end to end, input data (in this case an image) and write the data as a `.tfrecords` file, then read the file back and display the image.

This can be useful if, for example, you want to use several models on the same input dataset. Instead of storing the image data raw, it can be preprocessed into the TFRecords format, and that can be used in all further processing and modelling. 

First, let's download [this](https://commons.wikimedia.org/wiki/File:Felis_catus-cat_on_snow.jpg) adorable image of a cat in the snow, and [this](https://upload.wikimedia.org/wikipedia/commons/f/fe/New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg) awesome picture of the Williamsburg Bridge, NYC under construction.

In [0]:
# These imports are relevant for displaying and encoding image strings.

import base64

from IPython.display import Image

In [0]:
!wget -O 'cat_in_snow.jpg' 'https://upload.wikimedia.org/wikipedia/commons/b/b6/Felis_catus-cat_on_snow.jpg'
!wget -O 'williamsburg_bridge.jpg' 'https://upload.wikimedia.org/wikipedia/commons/f/fe/New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg'

In [0]:
!ls

As we did earlier, we can now encode the features as types compatible with `tf.Example`. In this case, we will not only store the raw image string as a feature, but we will store the height, width, depth, and an arbitrary `label` feature, which is used when we write the file to distinguish between the cat image and the bridge image. We will use `0` for the cat image, and `1` for the bridge image. 

In [0]:
image_labels = {
    'cat_in_snow.jpg': 0,
    'williamsburg_bridge.jpg': 1,
}

In [0]:
# This is an example, just using the cat image.

file = open('cat_in_snow.jpg', 'rb').read()

image_shape = tf.image.decode_jpeg(file).shape
image_string = base64.b64encode(file)

label = image_labels['cat_in_snow.jpg']

# Create a dictionary with features that may be relevant.

feature = {
    'height': _int64_feature(image_shape[0]),
    'width': _int64_feature(image_shape[1]),
    'depth': _int64_feature(image_shape[2]),
    'label': _int64_feature(label),
    'image_raw': _bytes_feature(image_string),
}

tf_example = tf.train.Example(features=tf.train.Features(feature=feature))
print(tf_example)

We see that all of the features are now stores in the `tf.Example` message. Now, we functionalize the code above and write the example messages to a file, `images.tfrecords`.

In [0]:
# Write the raw image files to images.tfrecords.
# First, process the two images into tf.Example messages.
# Then, write to a .tfrecords file.

writer = tf.python_io.TFRecordWriter('images.tfrecords')

for filename, label in image_labels.items():
  
  file = open(filename, 'rb').read()

  image_shape = tf.image.decode_jpeg(file).shape
  image_string = base64.b64encode(file)

  feature = {
      'height': _int64_feature(image_shape[0]),
      'width': _int64_feature(image_shape[1]),
      'depth': _int64_feature(image_shape[2]),
      'label': _int64_feature(label),
      'image_raw': _bytes_feature(image_string),
  }
  
  tf_example = tf.train.Example(features=tf.train.Features(feature=feature))
  writer.write(tf_example.SerializeToString())

writer.close()

In [0]:
!ls

We now have the file `images.tfrecords`. We can now iterate over the records in the file to read back what we wrote. Since, for our use case we will just reproduce the image, the only feature we need is the raw image string. We can extract that using the getters described above, namely `example.features.feature['image_raw'].bytes_list.value[0]`. We also use the labels to determine which record is the cat as opposed to the bridge.

In [0]:
record_iterator = tf.python_io.tf_record_iterator(path='images.tfrecords')

# Create a dictionary mapping the image label to the bytes string.

image_bytes = {}

for string_record in record_iterator:
  example = tf.train.Example()
  example.ParseFromString(string_record)
  
  label = example.features.feature['label'].int64_list.value[0]
  
  image_bytes[label] = example.features.feature['image_raw'].bytes_list.value[0]

Now, we create new blank JPEG files, that we will write the decoded image strings to.

In [0]:
with open('cat_in_snow_from_tfrecords.jpg', 'w') as f:
  f.write(base64.b64decode(image_bytes[image_labels['cat_in_snow.jpg']]))

with open('williamsburg_bridge_from_tfrecords.jpg', 'w') as f:
  f.write(base64.b64decode(image_bytes[image_labels['williamsburg_bridge.jpg']]))

Let's display these images! Remember these are not the raw images, these have been encoded as a `.tfrecords` file and then read back into raw image format.

In [0]:
Image(filename='cat_in_snow_from_tfrecords.jpg', width=500)

In [0]:
Image(filename='williamsburg_bridge_from_tfrecords.jpg', width=500)

In practice however, it is less practical to work directly with the raw TFRecords format than with the `dataset` object. This is also true for images, where we can easily load image files into a dataset that is ready to use. This example follows the documentation [here](https://www.tensorflow.org/guide/datasets#decoding_image_data_and_resizing_it). We first define another `_parse_function` to parse an image file into a decoded image, then leverage the `from_tensor_slices` method to load these images into a dataset.

In [0]:
def _parse_function(filename, label):
  image_string = tf.read_file(filename)
  image_decoded = tf.image.decode_jpeg(image_string)
  
  return image_decoded, label

In [0]:
filenames = tf.constant(['cat_in_snow.jpg', 'williamsburg_bridge.jpg'])
labels = tf.constant([0, 1])

dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)

Again using eager execution, we can print the records in the dataset. Each record is a tuple of a part of the image and the label. Each has type `tf.Tensor`, the first is an array of non-trivial shape (as it is part of an image), and the second is the label.

From the below, we see that the first record comes from the cat image (as the label is `0`) and the second is from the bridge image (as the label is `1`).

In [0]:
for record in dataset.take(2):
  print(record)