diff --git a/site/en/tutorials/load_data/tf-records.ipynb b/site/en/tutorials/load_data/tf-records.ipynb
index 632441b8748..3e4cec99b26 100644
--- a/site/en/tutorials/load_data/tf-records.ipynb
+++ b/site/en/tutorials/load_data/tf-records.ipynb
@@ -6,14 +6,20 @@
"name": "tf-records.ipynb",
"version": "0.3.2",
"provenance": [],
- "collapsed_sections": []
+ "private_outputs": true,
+ "collapsed_sections": [],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
}
},
"cells": [
{
"metadata": {
- "colab_type": "text",
- "id": "t09eeeR5prIJ"
+ "id": "pL--_KGdYoBz",
+ "colab_type": "text"
},
"cell_type": "markdown",
"source": [
@@ -22,9 +28,9 @@
},
{
"metadata": {
- "cellView": "form",
+ "id": "uBDvXpYzYnGj",
"colab_type": "code",
- "id": "GCCk8_dHpuNf",
+ "cellView": "form",
"colab": {}
},
"cell_type": "code",
@@ -51,7 +57,37 @@
},
"cell_type": "markdown",
"source": [
- "# Using TFRecords and TF Examples"
+ "# Using TFRecords and `tf.Example`\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "3pkUd_9IZCFO",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "To read data efficiently it can be helpful to serialize your data and store it in a set of files (100-200MB each) that can each be read linearly. This is especially true if the data is being streamed over a network. This can also be useful for caching any data-preprocessing.\n",
+ "\n",
+ "The TFRecord format is a simple format for storing a sequence of binary records.\n",
+ "\n",
+ "[Protocol buffers](https://developers.google.com/protocol-buffers/) are a cross-platform, cross-language library for efficient serialization of structured data.\n",
+ "\n",
+ "Protocol messages are defined by `.proto` files, these are often the easiest way to understand a message type. \n",
+ "\n",
+ "The `tf.Example` message (or protobuf) is a flexible message type that represents a `{\"string\": value}` mapping. It is designed for use with TensorFlow and is used throughout the higher-level APIs such as [TFX](https://www.tensorflow.org/tfx/)."
]
},
{
@@ -61,11 +97,20 @@
},
"cell_type": "markdown",
"source": [
- "The Example structure as well as the TFRecord format are extremely useful for describing input data in the TensorFlow API. It allows developers to preprocess their data only once for multiple purposes, and allows developers to store their data locally. \n",
"\n",
- "The [`tf.Example`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto) [protocol buffer](https://developers.google.com/protocol-buffers/) (a protocol buffer is also called a message) is specifically designed for use with TensorFlow, as well as higher-level APIs such as [TFX](https://www.tensorflow.org/tfx/) and [Keras](https://www.tensorflow.org/guide/keras). This notebook will demonstrate how to create, parse, and use the `tf.Example` message, and then store, read, and write this data in the `.tfrecords` format. This tutorial includes an end-to-end example of reading/writing image data as TF Examples in the TFRecord format. \n",
+ "This notebook will demonstrate how to create, parse, and use the `tf.Example` message, and then serialize, write, and read `tf.Example` messages to and from `.tfrecord` files.\n",
"\n",
- "Note that, while extremely useful, using these structures is ultimately optional if using the [tf.data API](https://www.tensorflow.org/api_docs/python/tf/data) makes more sense."
+ "Note: While useful, these structures are optional. There is no need to convert existing code to use TFRecords, unless you are using [`tf.data`](https://www.tensorflow.org/guide/datasets) and reading data is still the bottleneck to training. See [Data Input Pipeline Performance](https://www.tensorflow.org/guide/performance/datasets) for dataset performance tips."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "WkRreBf1eDVc",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Setup"
]
},
{
@@ -83,11 +128,22 @@
"import tensorflow as tf\n",
"tf.enable_eager_execution()\n",
"\n",
- "import numpy as np"
+ "import numpy as np\n",
+ "import IPython.display as display"
],
"execution_count": 0,
"outputs": []
},
+ {
+ "metadata": {
+ "id": "e5Kq88ccUWQV",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## `tf.Example`"
+ ]
+ },
{
"metadata": {
"id": "VrdQHgvNijTi",
@@ -95,7 +151,7 @@
},
"cell_type": "markdown",
"source": [
- "## Data Types In `tf.Example`"
+ "### Data types for `tf.Example`"
]
},
{
@@ -105,19 +161,21 @@
},
"cell_type": "markdown",
"source": [
- "The `tf.Example` type is generic enough to accept a wide range of data types. While the following [three types](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto#L4) of features are compatible with `tf.Example`, most other generic types can be coerced into one of these.\n",
+ "Fundamentally a `tf.Example` is a `{\"string\": tf.train.Feature}` mapping.\n",
+ "\n",
+ "The `tf.train.Feature` message type can accept one of the following three types (See the [`.proto` file]((https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto) for reference). Most other generic types can be coerced into one of these.\n",
"\n",
- "1. [`bytes_list`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto#L65) (the following types can be coerced)\n",
+ "1. `tf.train.BytesList` (the following types can be coerced)\n",
"\n",
" - `string`\n",
" - `byte`\n",
"\n",
- "1. [`float_list`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto#L68) (the following types can be coerced)\n",
+ "1. `tf.train.FloatList` (the following types can be coerced)\n",
"\n",
" - `float` (`float32`)\n",
" - `double` (`float64`)\n",
"\n",
- "1. [`int64_list`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto#L65) (the following types can be coerced)\n",
+ "1. `tf.train.Int64List` (the following types can be coerced)\n",
"\n",
" - `bool`\n",
" - `enum`\n",
@@ -134,7 +192,9 @@
},
"cell_type": "markdown",
"source": [
- "In order to convert a standard type to a `tf.Example`-compatible type, we can use the following functions. Each function takes a single input value and returns one of the 3 `list` types above."
+ "In order to convert a standard TensorFlow type to a `tf.Example`-compatible `tf.train.Feature`, you can use the following shortcut functions:\n",
+ "\n",
+ "Each function takes a scalar input value and returns a `tf.train.Feature` containing one of the three `list` types above."
]
},
{
@@ -163,6 +223,16 @@
"execution_count": 0,
"outputs": []
},
+ {
+ "metadata": {
+ "id": "Wst0v9O8hgzy",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Note: To stay simple, this example only uses scalar inputs. The simplest way to handle non-scalar features is to use `tf.serialize_tensor` to convert tensors to binary-strings. Strings are scalars in tensorflow. Use `tf.parse_tensor` to convert the binary-string back to a tensor."
+ ]
+ },
{
"metadata": {
"id": "vsMbkkC8xxtB",
@@ -181,8 +251,8 @@
},
"cell_type": "code",
"source": [
- "print(_bytes_feature('test_string'))\n",
- "print(_bytes_feature(bytes('test_bytes')))\n",
+ "print(_bytes_feature(b'test_string'))\n",
+ "print(_bytes_feature(u'test_bytes'.encode('utf-8')))\n",
"\n",
"print(_float_feature(np.exp(1)))\n",
"\n",
@@ -192,6 +262,31 @@
"execution_count": 0,
"outputs": []
},
+ {
+ "metadata": {
+ "id": "nj1qpfQU5qmi",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "All proto messages can be serialized to a binary-string using the `.SerializeToString` method."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "5afZkORT5pjm",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "feature = _float_feature(np.exp(1))\n",
+ "\n",
+ "feature.SerializeToString()"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
{
"metadata": {
"id": "laKnw9F3hL-W",
@@ -199,7 +294,7 @@
},
"cell_type": "markdown",
"source": [
- "## Creating A `tf.Example` Message"
+ "### Creating a `tf.Example` message"
]
},
{
@@ -211,7 +306,7 @@
"source": [
"Suppose you want to create a `tf.Example` message from existing data. In practice, the dataset may come from anywhere, but the procedure of creating the `tf.Example` message from a single observation will be the same. \n",
"\n",
- "1. Within each observation, each value needs to be converted to one of the 3 compatible types, using one of the functions above. \n",
+ "1. Within each observation, each value needs to be converted to a `tf.train.Feature` containing one of the 3 compatible types, using one of the functions above. \n",
"\n",
"1. We create a map (dictionary) from the feature name string to the encoded feature value produced in #1.\n",
"\n",
@@ -250,11 +345,12 @@
"# boolean feature, encoded as False or True\n",
"feature0 = np.random.choice([False, True], n_observations)\n",
"\n",
- "# bytes feature\n",
- "feature1 = np.random.bytes(n_observations)\n",
- "\n",
"# integer feature, random between -10000 and 10000\n",
- "feature2 = np.random.randint(-10000, 10000, n_observations)\n",
+ "feature1 = np.random.randint(0, 5, n_observations)\n",
+ "\n",
+ "# bytes feature\n",
+ "strings = np.array([b'cat', b'dog', b'chicken', b'horse', b'goat'])\n",
+ "feature2 = strings[feature1]\n",
"\n",
"# float feature, from a standard normal distribution\n",
"feature3 = np.random.randn(n_observations)"
@@ -280,27 +376,25 @@
},
"cell_type": "code",
"source": [
- "def create_example(features):\n",
+ "def serialize_example(feature0, feature1, feature2, feature3):\n",
" \"\"\"\n",
" Creates a tf.Example message ready to be written to a file.\n",
- " \n",
- " Inputs:\n",
- " - features: a 4-list of the values in the observation\n",
" \"\"\"\n",
" \n",
" # Create a dictionary mapping the feature name to the tf.Example-compatible\n",
" # data type.\n",
" \n",
" feature = {\n",
- " 'feature0': _int64_feature(features[0]),\n",
- " 'feature1': _bytes_feature(features[1]),\n",
- " 'feature2': _int64_feature(features[2]),\n",
- " 'feature3': _float_feature(features[3]),\n",
+ " 'feature0': _int64_feature(feature0),\n",
+ " 'feature1': _int64_feature(feature1),\n",
+ " 'feature2': _bytes_feature(feature2),\n",
+ " 'feature3': _float_feature(feature3),\n",
" }\n",
" \n",
" # Create a Features message using tf.train.Example.\n",
" \n",
- " return tf.train.Example(features=tf.train.Features(feature=feature))"
+ " example_proto = tf.train.Example(features=tf.train.Features(feature=feature))\n",
+ " return example_proto.SerializeToString()"
],
"execution_count": 0,
"outputs": []
@@ -312,7 +406,7 @@
},
"cell_type": "markdown",
"source": [
- "For example, suppose we have a single observation from the dataset, `[False, bytes('example'), -1234, 0.9876]`. We can create and print the `tf.Example` message for this observation using `create_message()`. Each single observation will be written as a `Features` message as per the above. Note that the `tf.Example` [message](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto#L88) is just a wrapper around the `Features` message."
+ "For example, suppose we have a single observation from the dataset, `[False, 4, bytes('goat'), 0.9876]`. We can create and print the `tf.Example` message for this observation using `create_message()`. Each single observation will be written as a `Features` message as per the above. Note that the `tf.Example` [message](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto#L88) is just a wrapper around the `Features` message."
]
},
{
@@ -325,190 +419,216 @@
"source": [
"# This is an example observation from the dataset.\n",
"\n",
- "example_observation = [False, bytes('example'), -1234, 0.9876]\n",
+ "example_observation = []\n",
"\n",
- "print(create_example(example_observation))"
+ "serialized_example = serialize_example(False, 4, b'goat', 0.9876)\n",
+ "serialized_example"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "CKn5uql2lAaN",
+ "id": "_pbGATlG6u-4",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
- "## Writing `tf.Example` Messages To A `.tfrecords` File"
+ "To decode the message use the `tf.train.Example.FromString` method."
]
},
{
"metadata": {
- "id": "LNW_FA-GQWXs",
+ "id": "dGim-mEm6vit",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "example_proto = tf.train.Example.FromString(serialized_example)\n",
+ "example_proto"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "y-Hjmee-fbLH",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
- "We now write the 10,000 observations to the file `test.tfrecords`. Each observation is converted to a `tf.Example` message, then written to file. We can then verify that the file `test.tfrecords` has been created."
+ "## TFRecord files using `tf.data`"
]
},
{
"metadata": {
- "id": "MKPHzoGv7q44",
- "colab_type": "code",
- "colab": {}
+ "id": "GmehkCCT81Ez",
+ "colab_type": "text"
},
- "cell_type": "code",
+ "cell_type": "markdown",
"source": [
- "# Write the tf.Example observations to test.tfrecords.\n",
- "\n",
- "writer = tf.python_io.TFRecordWriter('test.tfrecords')\n",
+ "The `tf.data` module also provides tools for reading and writing data in tensorflow."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "1FISEuz8ubu3",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### Writing a TFRecord file\n",
"\n",
- "for i in range(n_observations):\n",
- " example = create_example([feature0[i], feature1[i], feature2[i], feature3[i]])\n",
- " writer.write(example.SerializeToString())\n",
+ "The easiest way to get the data into a dataset is to use the `from_tensor_slices` method.\n",
"\n",
- "writer.close()"
- ],
- "execution_count": 0,
- "outputs": []
+ "Applied to an array, it returns a dataset of scalars."
+ ]
},
{
"metadata": {
- "id": "EjdFHHJMpUUo",
+ "id": "mXeaukvwu5_-",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
- "!ls"
+ "tf.data.Dataset.from_tensor_slices(feature1)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "wtQ7k0YWQ1cz",
+ "id": "f-q0VKyZvcad",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
- "## Reading A `.tfrecords` File"
+ "Applies to a tuple of arrays, it returns a dataset of tuples:"
]
},
{
"metadata": {
- "id": "utkozytkQ-2K",
- "colab_type": "text"
+ "id": "H5sWyu1kxnvg",
+ "colab_type": "code",
+ "colab": {}
},
- "cell_type": "markdown",
+ "cell_type": "code",
"source": [
- "Suppose we now want to read this data back, to be input as data into a model.\n",
- "\n",
- "The following example imports the data as is, as a `tf.Example` message. This can be useful to verify that a the file contains the data that we expect. This can also be useful if the input data is stored as TFRecords but you would prefer to input NumPy data (or some other input data type), for example [here](https://www.tensorflow.org/guide/datasets#consuming_numpy_arrays), since this example allows us to read the values themselves.\n",
- "\n",
- "We iterate through the TFRecords in the infile, extract the `tf.Example` message, and can read/store the values within."
- ]
+ "features_dataset = tf.data.Dataset.from_tensor_slices((feature0, feature1, feature2, feature3))\n",
+ "features_dataset"
+ ],
+ "execution_count": 0,
+ "outputs": []
},
{
"metadata": {
- "id": "36ltP9B8OezA",
+ "id": "m1C-t71Nywze",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
- "record_iterator = tf.python_io.tf_record_iterator(path='test.tfrecords')\n",
- "\n",
- "for string_record in record_iterator:\n",
- " example = tf.train.Example()\n",
- " example.ParseFromString(string_record)\n",
- " \n",
- " print(example)\n",
- " \n",
- " # Exit after 1 iteration as this is purely demonstrative.\n",
- " break"
+ "# Use `take(1)` to only pull one example from the dataset.\n",
+ "for f0,f1,f2,f3 in features_dataset.take(1):\n",
+ " print(f0)\n",
+ " print(f1)\n",
+ " print(f2)\n",
+ " print(f3)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "i3uquiiGTZTK",
+ "id": "mhIe63awyZYd",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
- "The features of the `example` object (created above of type `tf.Example`) can be accessed using its getters (similarly to any protocol buffer message). `example.features` returns a `repeated feature` message, then getting the `feature` message returns a map of feature name to feature value (stored in Python as a dictionary)."
+ "Use the `tf.data.Dataset.map` method to apply a function to each element of a `Dataset`.\n",
+ "\n",
+ "The mapped function must operate in TensorFlow graph mode: It must operate on and return `tf.Tensors`. A non-tensor function, like `create_example`, can be wrapped with `tf.py_func` to make it compatible.\n",
+ "\n",
+ "Using `tf.py_func` requires that you specify the shape and type information that is otherwise unavailable:"
]
},
{
"metadata": {
- "id": "-UNzS7vsUBs0",
+ "id": "apB5KYrJzjPI",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
- "print(dict(example.features.feature))"
+ "def tf_serialize_example(f0,f1,f2,f3):\n",
+ " tf_string = tf.py_func(\n",
+ " serialize_example, \n",
+ " (f0,f1,f2,f3), # pass these args to the above function.\n",
+ " tf.string) # the return type is `tf.string`.\n",
+ " return tf.reshape(tf_string, ()) # The result is a scalar"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "u1M-WrbqUUVW",
+ "id": "CrFZ9avE3HUF",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
- "From this dictionary, you can get any given value as with a dictionary."
+ "Apply this function to each element in the dataset:"
]
},
{
"metadata": {
- "id": "2yCBu70IUb2H",
+ "id": "VDeqYVbW3ww9",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
- "print(example.features.feature['feature3'])"
+ "serialized_features_dataset = features_dataset.map(tf_serialize_example)\n",
+ "serialized_features_dataset"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "4dw6_OI9UiNZ",
+ "id": "p6lw5VYpjZZC",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
- "Now, we can access the value using the getters again."
+ "And write them to a TFRecord file:"
]
},
{
"metadata": {
- "id": "BdDYjDnDUlFe",
+ "id": "vP1VgTO44UIE",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
- "print(example.features.feature['feature3'].float_list.value)"
+ "filename = 'test.tfrecord'\n",
+ "writer = tf.data.experimental.TFRecordWriter(filename)\n",
+ "writer.write(serialized_features_dataset)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "y-Hjmee-fbLH",
+ "id": "6aV0GQhV8tmp",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
- "## Using The `Dataset` Object"
+ "### Reading a TFRecord file"
]
},
{
@@ -518,7 +638,11 @@
},
"cell_type": "markdown",
"source": [
- "We can also read the `.tfrecords` file into a [`dataset` object](https://www.tensorflow.org/api_docs/python/tf/data/Dataset). More information on consuming the TFRecord object into a Dataset can be found [here](https://www.tensorflow.org/guide/datasets#consuming_tfrecord_data). Using this datatset structure can be useful for standardizing input data and optimizing performance. It is also easier and quicker to use this object."
+ "We can also read the TFRecord file using the `tf.data.TFRecordDataset` class. \n",
+ "\n",
+ "More information on consuming TFRecord files using `tf.data` can be found [here](https://www.tensorflow.org/guide/datasets#consuming_tfrecord_data). \n",
+ "\n",
+ "Using `TFRecordDataset`s can be useful for standardizing input data and optimizing performance."
]
},
{
@@ -529,8 +653,37 @@
},
"cell_type": "code",
"source": [
- "filenames = ['test.tfrecords']\n",
- "dataset = tf.data.TFRecordDataset(filenames)"
+ "filenames = [filename]\n",
+ "raw_dataset = tf.data.TFRecordDataset(filenames)\n",
+ "raw_dataset"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "6_EQ9i2E_-Fz",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "At this point the dataset contains serialized `tf.train.Example` messages. When iterated over it returns these as scalar string tensors. \n",
+ "\n",
+ "Use the `.take` method to only show the first 10 records.\n",
+ "\n",
+ "Note: iterating over a `tf.data.Dataset` only works with eager execution enabled."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "hxVXpLz_AJlm",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "for raw_record in raw_dataset.take(10):\n",
+ " print(repr(raw_record))"
],
"execution_count": 0,
"outputs": []
@@ -542,7 +695,9 @@
},
"cell_type": "markdown",
"source": [
- "Each record in this dataset is an `EagerTensor` type, as [eager execution](https://www.tensorflow.org/guide/eager) was enabled at the start of this notebook. These tensors can be parsed using the function below."
+ "These tensors can be parsed using the function below.\n",
+ "\n",
+ "Note: The `feature_description` is necessary here because datasets use graph-execution, and need this description to build their shape and type signature."
]
},
{
@@ -553,20 +708,51 @@
},
"cell_type": "code",
"source": [
+ "# Create a description of the features. \n",
+ "feature_description = {\n",
+ " 'feature0': tf.FixedLenFeature([], tf.int64, default_value=0),\n",
+ " 'feature1': tf.FixedLenFeature([], tf.int64, default_value=0),\n",
+ " 'feature2': tf.FixedLenFeature([], tf.string, default_value=''),\n",
+ " 'feature3': tf.FixedLenFeature([], tf.float32, default_value=0.0),\n",
+ "}\n",
+ "\n",
"def _parse_function(example_proto):\n",
- " \n",
- " # Create a dictionary of features.\n",
- " \n",
- " features = {\n",
- " 'feature0': tf.FixedLenFeature([], tf.int64, default_value=0),\n",
- " 'feature1': tf.FixedLenFeature([], tf.string, default_value=''),\n",
- " 'feature2': tf.FixedLenFeature([], tf.int64, default_value=0),\n",
- " 'feature3': tf.FixedLenFeature([], tf.float32, default_value=0.0),\n",
- " }\n",
- " \n",
" # Parse the input tf.Example proto using the dictionary above.\n",
- " \n",
- " return tf.parse_single_example(example_proto, features)"
+ " return tf.parse_single_example(example_proto, feature_description)"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "gWETjUqhEQZf",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Or use `tf.parse example` to parse a whole batch at once."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "AH73hav6Bnmg",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Apply this finction to each item in the dataset using the `tf.data.Dataset.map` method:"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "6Ob7D-zmBm1w",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "parsed_dataset = raw_dataset.map(_parse_function)\n",
+ "parsed_dataset "
],
"execution_count": 0,
"outputs": []
@@ -578,7 +764,7 @@
},
"cell_type": "markdown",
"source": [
- "Now, we can use eager execution to display the observations in the dataset. Note that there are 10,000 observations in this dataset, but we only display the first 10. The data is displayed as a dictionary of features. Each item is a `tf.Tensor`, and the `numpy` element of this tensor displays the value of the feature."
+ "Use eager execution to display the observations in the dataset. There are 10,000 observations in this dataset, but we only display the first 10. The data is displayed as a dictionary of features. Each item is a `tf.Tensor`, and the `numpy` element of this tensor displays the value of the feature."
]
},
{
@@ -589,70 +775,82 @@
},
"cell_type": "code",
"source": [
- "for record in dataset.take(10):\n",
- " print(_parse_function(record))"
+ "for parsed_record in parsed_dataset.take(10):\n",
+ " print(repr(raw_record))"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "S0tFDrwdoj3q",
+ "id": "Cig9EodTlDmg",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
- "## Reading/Writing Image Data"
+ "Here, the `tf.parse_example` function unpacks the `tf.Example` fields into standard tensors."
]
},
{
"metadata": {
- "id": "rjN2LFxFpcR9",
+ "id": "jyg1g3gU7DNn",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
- "This is an example of how to read and write image data using TFRecords. The purpose of this is to show how, end to end, input data (in this case an image) and write the data as a `.tfrecords` file, then read the file back and display the image.\n",
- "\n",
- "This can be useful if, for example, you want to use several models on the same input dataset. Instead of storing the image data raw, it can be preprocessed into the TFRecords format, and that can be used in all further processing and modelling. \n",
- "\n",
- "First, let's download [this](https://commons.wikimedia.org/wiki/File:Felis_catus-cat_on_snow.jpg) adorable image of a cat in the snow, and [this](https://upload.wikimedia.org/wikipedia/commons/f/fe/New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg) awesome picture of the Williamsburg Bridge, NYC under construction."
+ "## TFRecord files using tf.python_io"
]
},
{
"metadata": {
- "id": "BbK8nGxvU9d0",
- "colab_type": "code",
- "colab": {}
+ "id": "3FXG3miA7Kf1",
+ "colab_type": "text"
},
- "cell_type": "code",
+ "cell_type": "markdown",
"source": [
- "# These imports are relevant for displaying and encoding image strings.\n",
- "\n",
- "import base64\n",
- "\n",
- "from IPython.display import Image"
- ],
- "execution_count": 0,
- "outputs": []
+ "The `tf.python_io` module also contains pure-Python functions for reading and writing TFRecord files. "
+ ]
},
{
"metadata": {
- "id": "3a0fmwg8lHdF",
+ "id": "CKn5uql2lAaN",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### Writing a TFRecord file"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "LNW_FA-GQWXs",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Now write the 10,000 observations to the file `test.tfrecords`. Each observation is converted to a `tf.Example` message, then written to file. We can then verify that the file `test.tfrecords` has been created."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "MKPHzoGv7q44",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
- "!wget -O 'cat_in_snow.jpg' 'https://upload.wikimedia.org/wikipedia/commons/b/b6/Felis_catus-cat_on_snow.jpg'\n",
- "!wget -O 'williamsburg_bridge.jpg' 'https://upload.wikimedia.org/wikipedia/commons/f/fe/New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg'"
+ "# Write the `tf.Example` observations to the file.\n",
+ "with tf.python_io.TFRecordWriter(filename) as writer:\n",
+ " for i in range(n_observations):\n",
+ " example = serialize_example(feature0[i], feature1[i], feature2[i], feature3[i])\n",
+ " writer.write(example)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "ELE4ueh4o3OM",
+ "id": "EjdFHHJMpUUo",
"colab_type": "code",
"colab": {}
},
@@ -665,287 +863,371 @@
},
{
"metadata": {
- "id": "Azx83ryQEU6T",
+ "id": "wtQ7k0YWQ1cz",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
- "As we did earlier, we can now encode the features as types compatible with `tf.Example`. In this case, we will not only store the raw image string as a feature, but we will store the height, width, depth, and an arbitrary `label` feature, which is used when we write the file to distinguish between the cat image and the bridge image. We will use `0` for the cat image, and `1` for the bridge image. "
+ "### Reading a TFRecord file"
]
},
{
"metadata": {
- "id": "kC4TS1ZEONHr",
+ "id": "utkozytkQ-2K",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Suppose we now want to read this data back, to be input as data into a model.\n",
+ "\n",
+ "The following example imports the data as is, as a `tf.Example` message. This can be useful to verify that a the file contains the data that we expect. This can also be useful if the input data is stored as TFRecords but you would prefer to input NumPy data (or some other input data type), for example [here](https://www.tensorflow.org/guide/datasets#consuming_numpy_arrays), since this example allows us to read the values themselves.\n",
+ "\n",
+ "We iterate through the TFRecords in the infile, extract the `tf.Example` message, and can read/store the values within."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "36ltP9B8OezA",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
- "image_labels = {\n",
- " 'cat_in_snow.jpg': 0,\n",
- " 'williamsburg_bridge.jpg': 1,\n",
- "}"
+ "record_iterator = tf.python_io.tf_record_iterator(path=filename)\n",
+ "\n",
+ "for string_record in record_iterator:\n",
+ " example = tf.train.Example()\n",
+ " example.ParseFromString(string_record)\n",
+ " \n",
+ " print(example)\n",
+ " \n",
+ " # Exit after 1 iteration as this is purely demonstrative.\n",
+ " break"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "c5njMSYNEhNZ",
+ "id": "i3uquiiGTZTK",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "The features of the `example` object (created above of type `tf.Example`) can be accessed using its getters (similarly to any protocol buffer message). `example.features` returns a `repeated feature` message, then getting the `feature` message returns a map of feature name to feature value (stored in Python as a dictionary)."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "-UNzS7vsUBs0",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
- "# This is an example, just using the cat image.\n",
- "\n",
- "file = open('cat_in_snow.jpg', 'rb').read()\n",
- "\n",
- "image_shape = tf.image.decode_jpeg(file).shape\n",
- "image_string = base64.b64encode(file)\n",
- "\n",
- "label = image_labels['cat_in_snow.jpg']\n",
- "\n",
- "# Create a dictionary with features that may be relevant.\n",
- "\n",
- "feature = {\n",
- " 'height': _int64_feature(image_shape[0]),\n",
- " 'width': _int64_feature(image_shape[1]),\n",
- " 'depth': _int64_feature(image_shape[2]),\n",
- " 'label': _int64_feature(label),\n",
- " 'image_raw': _bytes_feature(image_string),\n",
- "}\n",
- "\n",
- "tf_example = tf.train.Example(features=tf.train.Features(feature=feature))\n",
- "print(tf_example)"
+ "print(dict(example.features.feature))"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "2G_o3O9MN0Qx",
+ "id": "u1M-WrbqUUVW",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
- "We see that all of the features are now stores in the `tf.Example` message. Now, we functionalize the code above and write the example messages to a file, `images.tfrecords`."
+ "From this dictionary, you can get any given value as with a dictionary."
]
},
{
"metadata": {
- "id": "qcw06lQCOCZU",
+ "id": "2yCBu70IUb2H",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
- "# Write the raw image files to images.tfrecords.\n",
- "# First, process the two images into tf.Example messages.\n",
- "# Then, write to a .tfrecords file.\n",
- "\n",
- "writer = tf.python_io.TFRecordWriter('images.tfrecords')\n",
- "\n",
- "for filename, label in image_labels.items():\n",
- " \n",
- " file = open(filename, 'rb').read()\n",
- "\n",
- " image_shape = tf.image.decode_jpeg(file).shape\n",
- " image_string = base64.b64encode(file)\n",
- "\n",
- " feature = {\n",
- " 'height': _int64_feature(image_shape[0]),\n",
- " 'width': _int64_feature(image_shape[1]),\n",
- " 'depth': _int64_feature(image_shape[2]),\n",
- " 'label': _int64_feature(label),\n",
- " 'image_raw': _bytes_feature(image_string),\n",
- " }\n",
- " \n",
- " tf_example = tf.train.Example(features=tf.train.Features(feature=feature))\n",
- " writer.write(tf_example.SerializeToString())\n",
- "\n",
- "writer.close()"
+ "print(example.features.feature['feature3'])"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "yJrTe6tHPCfs",
+ "id": "4dw6_OI9UiNZ",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Now, we can access the value using the getters again."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "BdDYjDnDUlFe",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
- "!ls"
+ "print(example.features.feature['feature3'].float_list.value)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "jJSsCkZLPH6K",
+ "id": "S0tFDrwdoj3q",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
- "We now have the file `images.tfrecords`. We can now iterate over the records in the file to read back what we wrote. Since, for our use case we will just reproduce the image, the only feature we need is the raw image string. We can extract that using the getters described above, namely `example.features.feature['image_raw'].bytes_list.value[0]`. We also use the labels to determine which record is the cat as opposed to the bridge."
+ "## Walkthrough: Reading/Writing Image Data"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "rjN2LFxFpcR9",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "This is an example of how to read and write image data using TFRecords. The purpose of this is to show how, end to end, input data (in this case an image) and write the data as a TFRecord file, then read the file back and display the image.\n",
+ "\n",
+ "This can be useful if, for example, you want to use several models on the same input dataset. Instead of storing the image data raw, it can be preprocessed into the TFRecords format, and that can be used in all further processing and modelling. \n",
+ "\n",
+ "First, let's download [this image](https://commons.wikimedia.org/wiki/File:Felis_catus-cat_on_snow.jpg) of a cat in the snow and [this photo](https://upload.wikimedia.org/wikipedia/commons/f/fe/New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg) of the Williamsburg Bridge, NYC under construction."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "5Lk2qrKvN0yu",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### Fetch the images"
]
},
{
"metadata": {
+ "id": "3a0fmwg8lHdF",
"colab_type": "code",
- "id": "M6Cnfd3cTKHN",
"colab": {}
},
"cell_type": "code",
"source": [
- "record_iterator = tf.python_io.tf_record_iterator(path='images.tfrecords')\n",
- "\n",
- "# Create a dictionary mapping the image label to the bytes string.\n",
- "\n",
- "image_bytes = {}\n",
- "\n",
- "for string_record in record_iterator:\n",
- " example = tf.train.Example()\n",
- " example.ParseFromString(string_record)\n",
- " \n",
- " label = example.features.feature['label'].int64_list.value[0]\n",
- " \n",
- " image_bytes[label] = example.features.feature['image_raw'].bytes_list.value[0]"
+ "cat_in_snow = tf.keras.utils.get_file('320px-Felis_catus-cat_on_snow.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Felis_catus-cat_on_snow.jpg/320px-Felis_catus-cat_on_snow.jpg')\n",
+ "williamsburg_bridge = tf.keras.utils.get_file('194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg','https://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg')"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "qTkNHH9pid40",
- "colab_type": "text"
+ "id": "7aJJh7vENeE4",
+ "colab_type": "code",
+ "colab": {}
},
- "cell_type": "markdown",
+ "cell_type": "code",
"source": [
- "Now, we create new blank JPEG files, that we will write the decoded image strings to."
- ]
+ "display.display(display.Image(filename=cat_in_snow))\n",
+ "display.display(display.HTML('Image cc-by: Von.grzanka'))"
+ ],
+ "execution_count": 0,
+ "outputs": []
},
{
"metadata": {
- "id": "eSzTHYZkVGTd",
+ "id": "KkW0uuhcXZqA",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
- "with open('cat_in_snow_from_tfrecords.jpg', 'w') as f:\n",
- " f.write(base64.b64decode(image_bytes[image_labels['cat_in_snow.jpg']]))\n",
- "\n",
- "with open('williamsburg_bridge_from_tfrecords.jpg', 'w') as f:\n",
- " f.write(base64.b64decode(image_bytes[image_labels['williamsburg_bridge.jpg']]))"
+ "display.display(display.Image(filename=williamsburg_bridge))\n",
+ "display.display(display.HTML('source'))"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "azilbA7Pjeu2",
+ "id": "VSOgJSwoN5TQ",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
- "Let's display these images! Remember these are not the raw images, these have been encoded as a `.tfrecords` file and then read back into raw image format."
+ "### Write the TFRecord file"
]
},
{
"metadata": {
+ "id": "Azx83ryQEU6T",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "As we did earlier, we can now encode the features as types compatible with `tf.Example`. In this case, we will not only store the raw image string as a feature, but we will store the height, width, depth, and an arbitrary `label` feature, which is used when we write the file to distinguish between the cat image and the bridge image. We will use `0` for the cat image, and `1` for the bridge image. "
+ ]
+ },
+ {
+ "metadata": {
+ "id": "kC4TS1ZEONHr",
"colab_type": "code",
- "id": "sQq8cG07U6NG",
"colab": {}
},
"cell_type": "code",
"source": [
- "Image(filename='cat_in_snow_from_tfrecords.jpg', width=500)"
+ "image_labels = {\n",
+ " cat_in_snow : 0,\n",
+ " williamsburg_bridge : 1,\n",
+ "}"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "KVoldzEVjqIX",
+ "id": "c5njMSYNEhNZ",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
- "Image(filename='williamsburg_bridge_from_tfrecords.jpg', width=500)"
+ "# This is an example, just using the cat image.\n",
+ "image_string = open(cat_in_snow, 'rb').read()\n",
+ "\n",
+ "label = image_labels[cat_in_snow]\n",
+ "\n",
+ "# Create a dictionary with features that may be relevant.\n",
+ "def image_example(image_string, label):\n",
+ " image_shape = tf.image.decode_jpeg(image_string).shape\n",
+ "\n",
+ " feature = {\n",
+ " 'height': _int64_feature(image_shape[0]),\n",
+ " 'width': _int64_feature(image_shape[1]),\n",
+ " 'depth': _int64_feature(image_shape[2]),\n",
+ " 'label': _int64_feature(label),\n",
+ " 'image_raw': _bytes_feature(image_string),\n",
+ " }\n",
+ "\n",
+ " return tf.train.Example(features=tf.train.Features(feature=feature))\n",
+ "\n",
+ "for line in str(image_example(image_string, label)).split('\\n')[:15]:\n",
+ " print(line)\n",
+ "print('...')"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "OiP8jBE44mEF",
+ "id": "2G_o3O9MN0Qx",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
- "In practice however, it is less practical to work directly with the raw TFRecords format than with the `dataset` object. This is also true for images, where we can easily load image files into a dataset that is ready to use. This example follows the documentation [here](https://www.tensorflow.org/guide/datasets#decoding_image_data_and_resizing_it). We first define another `_parse_function` to parse an image file into a decoded image, then leverage the `from_tensor_slices` method to load these images into a dataset."
+ "We see that all of the features are now stores in the `tf.Example` message. Now, we functionalize the code above and write the example messages to a file, `images.tfrecords`."
]
},
{
"metadata": {
- "id": "kG66dtkRpNFQ",
+ "id": "qcw06lQCOCZU",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
- "def _parse_function(filename, label):\n",
- " image_string = tf.read_file(filename)\n",
- " image_decoded = tf.image.decode_jpeg(image_string)\n",
- " \n",
- " return image_decoded, label"
+ "# Write the raw image files to images.tfrecords.\n",
+ "# First, process the two images into tf.Example messages.\n",
+ "# Then, write to a .tfrecords file.\n",
+ "\n",
+ "with tf.python_io.TFRecordWriter('images.tfrecords') as writer:\n",
+ " for filename, label in image_labels.items():\n",
+ " image_string = open(filename, 'rb').read()\n",
+ " tf_example = image_example(image_string, label)\n",
+ " writer.write(tf_example.SerializeToString())"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "Q25yC0VlwLeG",
+ "id": "yJrTe6tHPCfs",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
- "filenames = tf.constant(['cat_in_snow.jpg', 'williamsburg_bridge.jpg'])\n",
- "labels = tf.constant([0, 1])\n",
- "\n",
- "dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))\n",
- "dataset = dataset.map(_parse_function)"
+ "!ls"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
- "id": "2FC8JOld5Ar-",
+ "id": "jJSsCkZLPH6K",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
- "Again using eager execution, we can print the records in the dataset. Each record is a tuple of a part of the image and the label. Each has type `tf.Tensor`, the first is an array of non-trivial shape (as it is part of an image), and the second is the label.\n",
+ "### Read the TFRecord file\n",
+ "\n",
+ "We now have the file `images.tfrecords`. We can now iterate over the records in the file to read back what we wrote. Since, for our use case we will just reproduce the image, the only feature we need is the raw image string. We can extract that using the getters described above, namely `example.features.feature['image_raw'].bytes_list.value[0]`. We also use the labels to determine which record is the cat as opposed to the bridge."
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "M6Cnfd3cTKHN",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "raw_image_dataset = tf.data.TFRecordDataset('images.tfrecords')\n",
"\n",
- "From the below, we see that the first record comes from the cat image (as the label is `0`) and the second is from the bridge image (as the label is `1`)."
+ "# Create a dictionary describing the features. \n",
+ "image_feature_description = {\n",
+ " 'height': tf.FixedLenFeature([], tf.int64),\n",
+ " 'width': tf.FixedLenFeature([], tf.int64),\n",
+ " 'depth': tf.FixedLenFeature([], tf.int64),\n",
+ " 'label': tf.FixedLenFeature([], tf.int64),\n",
+ " 'image_raw': tf.FixedLenFeature([], tf.string),\n",
+ "}\n",
+ "\n",
+ "def _parse_image_function(example_proto):\n",
+ " # Parse the input tf.Example proto using the dictionary above.\n",
+ " return tf.parse_single_example(example_proto, image_feature_description)\n",
+ "\n",
+ "parsed_image_dataset = raw_image_dataset.map(_parse_image_function)\n",
+ "parsed_image_dataset"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "0PEEFPk4NEg1",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Recover the images from the TFRecord file:"
]
},
{
"metadata": {
- "id": "s5zte_y3yGq8",
+ "id": "yZf8jOyEIjSF",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
- "for record in dataset.take(2):\n",
- " print(record)"
+ "for image_features in parsed_image_dataset:\n",
+ " image_raw = image_features['image_raw'].numpy()\n",
+ " display.display(display.Image(data=image_raw))"
],
"execution_count": 0,
"outputs": []