<a href="https://colab.research.google.com/github/zjzsu2000/CMPE258/blob/master/Ungraded_assignment_5/7)_Tesorflow_Redo_7_tfrecord.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


##### REF:https://www.tensorflow.org/tutorials/load_data/tfrecord

> The TFRecord format is a simple format for storing a sequence of binary records.

> [Protocol buffers](https://developers.google.com/protocol-buffers/) are a cross-platform, cross-language library for efficient serialization of structured data.

> Protocol messages are defined by `.proto` files, these are often the easiest way to understand a message type.

> The `tf.Example` message (or protobuf) is a flexible message type that represents a `{"string": value}` mapping. It is designed for use with TensorFlow and is used throughout the higher-level APIs such as [TFX](https://www.tensorflow.org/tfx/).

#How to create, parse, and use the `tf.Example` message, and then serialize, write, and read `tf.Example` messages to and from `.tfrecord` files.

## install and import

In [0]:
!pip install tf-nightly
import tensorflow as tf
import numpy as np
import IPython.display as display

## `tf.Example`

### Data types for `tf.Example`

Fundamentally, a `tf.Example` is a `{"string": tf.train.Feature}` mapping.

The `tf.train.Feature` message type can accept one of the following three types (See the [`.proto` file](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto) for reference). Most other generic types can be coerced into one of these:

1. `tf.train.BytesList` (the following types can be coerced)

  - `string`
  - `byte`

1. `tf.train.FloatList` (the following types can be coerced)

  - `float` (`float32`)
  - `double` (`float64`)

1. `tf.train.Int64List` (the following types can be coerced)

  - `bool`
  - `enum`
  - `int32`
  - `uint32`
  - `int64`
  - `uint64`

##Functions to convert a standard TensorFlow type to a `tf.Example`-compatible `tf.train.Feature`.

>  Note:Each function takes a scalar input value and returns a `tf.train.Feature` containing one of the three `list` types above.

In [0]:
def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

Note: To stay simple, this example only uses scalar inputs. The simplest way to handle non-scalar features is to use `tf.serialize_tensor` to convert tensors to binary-strings. Strings are scalars in tensorflow. Use `tf.parse_tensor` to convert the binary-string back to a tensor.

Below are some examples of how these functions work. Note the varying input types and the standardized output types. If the input type for a function does not match one of the coercible types stated above, the function will raise an exception (e.g. `_int64_feature(1.0)` will error out, since `1.0` is a float, so should be used with the `_float_feature` function instead):

####testing

In [0]:
print(_bytes_feature(b'test_string'))
print(_bytes_feature(u'test_bytes'.encode('utf-8')))

print(_float_feature(np.exp(1)))

print(_int64_feature(True))
print(_int64_feature(1))

All proto messages can be serialized to a binary-string using the `.SerializeToString` method:

In [0]:
feature = _float_feature(np.exp(1))

feature.SerializeToString()

### Creating a `tf.Example` message

Suppose you want to create a `tf.Example` message from existing data. In practice, the dataset may come from anywhere, but the procedure of creating the `tf.Example` message from a single observation will be the same:

1. Within each observation, each value needs to be converted to a `tf.train.Feature` containing one of the 3 compatible types, using one of the functions above.

1. create a map (dictionary) from the feature name string to the encoded feature value produced in #1.

1. The map produced in step 2 is converted to a [`Features` message](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto#L85).

Using NumPy to create the dataset with 4 features:

* a boolean feature, `False` or `True` with equal probability
* an integer feature uniformly randomly chosen from `[0, 5]`
* a string feature generated from a string table by using the integer feature as an index
* a float feature from a standard normal distribution

Consider a sample consisting of 10,000 independently and identically distributed observations from each of the above distributions:

In [0]:
# The number of observations in the dataset.
n_obsers = int(1e4)

# Boolean feature, encoded as False or True.
feature0 = np.random.choice([False, True], n_obsers)

# Integer feature, random from 0 to 4.
feature1 = np.random.randint(0, 5, n_obsers)

# String feature
strings = np.array([b'cat', b'dog', b'chicken', b'horse', b'goat'])
feature2 = strings[feature1]

# Float feature, from a standard normal distribution
feature3 = np.random.randn(n_obsers)

Each of these features can be coerced into a `tf.Example`-compatible type using one of the functions( `_bytes_feature`, `_float_feature`, `_int64_feature`). You can then create a `tf.Example` message from these encoded features:

In [0]:
from tensorflow.train import Example, Features

In [0]:
def serialize_example(feature0, feature1, feature2, feature3):
  """
  Creates a tf.Example message ready to be written to a file.
  """
  feature = { 'feature0': _int64_feature(feature0),
      'feature1': _int64_feature(feature1),
      'feature2': _bytes_feature(feature2),
      'feature3': _float_feature(feature3), }

  # Create a Features message using tf.train.Example.

  example_proto = Example(features=Features(feature=feature))
  return example_proto.SerializeToString()

Each single observation will be written as a `Features` message as per the above. 
> Note:The `tf.Example` [message](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto#L88) is just a wrapper around the `Features` message:

In [0]:
# This is an example observation from the dataset.

example_obser = []

serialized_example = serialize_example(False, 4, b'goat', 0.9876)
serialized_example

To decode the message use the `tf.train.Example.FromString` method.

In [0]:
example_proto = Example.FromString(serialized_example)
example_proto

## TFRecords format details

A TFRecord file contains a sequence of records. The file can only be read sequentially.

Each record contains a byte-string, for the data-payload, plus the data-length, and  CRC32C (32-bit CRC using the Castagnoli polynomial) hashes for integrity checking.

Each record is stored in the following formats:

    uint64 length
    uint32 masked_crc32_of_length
    byte   data[length]
    uint32 masked_crc32_of_data

The records are concatenated together to produce the file. CRCs（Cyclic redundancy check） are
[described here](https://en.wikipedia.org/wiki/Cyclic_redundancy_check), and
the mask of a CRC is:

    masked_crc = ((crc >> 15) | (crc << 17)) + 0xa282ead8ul
> Note: There is no requirement to use `tf.Example` in TFRecord files. `tf.Example` is just a method of serializing dictionaries to byte-strings. Lines of text, encoded  image data, or serialized tensors (using `tf.io.serialize_tensor`, and
`tf.io.parse_tensor` when loading). See the `tf.io` module for more options.

## TFRecord files using `tf.data`

### Writing a TFRecord file

The easiest way to get the data into a dataset is to use the `from_tensor_slices` method.

Applied to an array, it returns a dataset of scalars:

In [0]:
from tensorflow.data import Dataset

In [0]:
Dataset.from_tensor_slices(feature1)

Applied to a tuple of arrays, it returns a dataset of tuples:

In [0]:
features_ds = Dataset.from_tensor_slices((feature0, feature1, feature2, feature3))
features_ds

In [0]:
# Use `take(1)` to only pull one example from the dataset.
for f0,f1,f2,f3 in features_ds.take(1):
  print(f0)
  print(f1)
  print(f2)
  print(f3)

Use the `tf.data.Dataset.map` method to apply a function to each element of a `Dataset`.

>Note:The mapped function must operate in TensorFlow graph mode—it must operate on and return `tf.Tensors`. A non-tensor function, like `serialize_example`, can be wrapped with `tf.py_function` to make it compatible.

> Using `tf.py_function` requires to specify the shape and type information that is otherwise unavailable:

In [0]:
def tf_Serialize_example(f0,f1,f2,f3):
  tf_string = tf.py_function(serialize_example,
    (f0,f1,f2,f3),  # pass these args to the above function.
    tf.string)      # the return type is `tf.string`.
  return tf.reshape(tf_string, ()) # The result is a scalar

In [0]:
tf_Serialize_example(f0,f1,f2,f3)

Apply this function to each element in the dataset:

In [0]:
serialized_features_ds = features_ds.map(tf_Serialize_example)
serialized_features_ds

In [0]:
def generator():
  for features in features_ds:
    yield serialize_example(*features)

In [0]:
serialized_features_ds = Dataset.from_generator(
    generator, output_types=tf.string, output_shapes=())

In [0]:
serialized_features_ds

And write them to a TFRecord file:

In [0]:
from tensorflow.data.experimental import TFRecordWriter

In [0]:
file_path = 'test.tfrecord'
writer = TFRecordWriter(file_path)
writer.write(serialized_features_ds)

### Reading a TFRecord file

You can also read the TFRecord file using the `TFRecordDataset` class.

More information on consuming TFRecord files using `tf.data` can be found [here](https://www.tensorflow.org/guide/datasets#consuming_tfrecord_data).

Using `TFRecordDataset`s can be useful for standardizing input data and optimizing performance.

In [0]:
from tensorflow.data import TFRecordDataset

In [0]:
filenames = [file_path]
raw_dataset = TFRecordDataset(filenames)
raw_dataset

At this point the dataset contains serialized `tf.train.Example` messages. When iterated over it returns these as scalar string tensors.

Use the `.take` method to only show the first 10 records.

>Note: iterating over a `tf.data.Dataset` only works with eager execution enabled.

In [0]:
for record in raw_dataset.take(10):
  print(repr(record))

####function to parse tensors.

>Note: the `feature_description` is necessary here because datasets use graph-execution, and need this description to build their shape and type signature:

In [0]:
# Create a description of the features.
feature_description = {
    'feature0': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature1': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature2': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'feature3': tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
}

def _parse_function(example_proto):
  # Parse the input `tf.Example` proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, feature_description)

In [0]:
parsed_dataset = raw_dataset.map(_parse_function)
parsed_dataset

Use eager execution to display the observations in the dataset. There are 10,000 observations in this dataset, but we will only display the first 10. The data is displayed as a dictionary of features. Each item is a `tf.Tensor`, and the `numpy` element of this tensor displays the value of the feature:

In [0]:
for parsed_record in parsed_dataset.take(10):
  print(repr(parsed_record))

Here, the `tf.parse_example` function unpacks the `tf.Example` fields into standard tensors.

## TFRecord files in Python

### Writing a TFRecord file

Write the 10,000 observations to the file `test.tfrecord`. Each observation is converted to a `tf.Example` message, then written to file.

In [0]:
from tensorflow.io import TFRecordWriter

In [0]:
# Write the `tf.Example` observations to the file.
with TFRecordWriter(file_path) as writer:
  for i in range(n_obsers):
    example = serialize_example(feature0[i], feature1[i], feature2[i], feature3[i])
    writer.write(example)

In [0]:
!du -sh {file_path}

### Reading a TFRecord file

These serialized tensors can be easily parsed using `tf.train.Example.ParseFromString`:

In [0]:
filenames = [file_path]
raw_dataset2 =TFRecordDataset(filenames)
raw_dataset2

In [0]:
for record2 in raw_dataset.take(1):
  example = Example()
  example.ParseFromString(record2.numpy())
  print(example)

## Walkthrough: Reading and writing image data

### Fetch the images

In [0]:
cat_in_snow  = tf.keras.utils.get_file('320px-Felis_catus-cat_on_snow.jpg', 'https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg')
williamsburg_bridge = tf.keras.utils.get_file('194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg','https://storage.googleapis.com/download.tensorflow.org/example_images/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg')

In [0]:
display.display(display.Image(filename=cat_in_snow))
display.display(display.HTML('Image cc-by: <a "href=https://commons.wikimedia.org/wiki/File:Felis_catus-cat_on_snow.jpg">Von.grzanka</a>'))

In [0]:
display.display(display.Image(filename=williamsburg_bridge))
display.display(display.HTML('<a "href=https://commons.wikimedia.org/wiki/File:New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg">From Wikimedia</a>'))

### Write the TFRecord file

In [0]:
img_labels = {cat_in_snow : 0,
              williamsburg_bridge : 1,}

In [0]:
# This is an example, just using the williamsburg_bridge image.
img_str = open(williamsburg_bridge, 'rb').read()

label = img_labels[williamsburg_bridge]

# Create a dictionary with features that may be relevant.
def img_example(img_str, label):
  image_shape = tf.image.decode_jpeg(img_str).shape

  feature = {'height': _int64_feature(image_shape[0]),
      'width': _int64_feature(image_shape[1]),
      'depth': _int64_feature(image_shape[2]),
      'label': _int64_feature(label),
      'image_raw': _bytes_feature(img_str),
  }
  return Example(features=Features(feature=feature))

for line in str(img_example(img_str, label)).split('\n')[:15]:
  print(line)
print('...')

> Note:All of the features are now stored in the `tf.Example` message. 

####Functionalize the code above and write the example messages to a file named `images.tfrecords`:

In [0]:
# Write the raw image files to `images.tfrecords`.
# First, process the two images into `tf.Example` messages.
# Then, write to a `.tfrecords` file.
record_file = 'images.tfrecords'
with TFRecordWriter(record_file) as writer:
  for filename, label in img_labels.items():
    img_str = open(filename, 'rb').read()
    tf_example = img_example(img_str, label)
    writer.write(tf_example.SerializeToString())

In [0]:
!du -sh {record_file}

### Read from TFRecord file

In [0]:
raw_image_dataset = TFRecordDataset('images.tfrecords')

# Create a dictionary describing the features.
image_feature_description = {
    'height': tf.io.FixedLenFeature([], tf.int64),
    'width': tf.io.FixedLenFeature([], tf.int64),
    'depth': tf.io.FixedLenFeature([], tf.int64),
    'label': tf.io.FixedLenFeature([], tf.int64),
    'image_raw': tf.io.FixedLenFeature([], tf.string),
}



In [0]:
def _parse_image_function(example_proto):
  # Parse the input tf.Example proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, image_feature_description)



In [0]:
parsed_image_dataset = raw_image_dataset.map(_parse_image_function)
parsed_image_dataset

Recover the images from the TFRecord file:

In [0]:
for image_features in parsed_image_dataset:
  image_raw = image_features['image_raw'].numpy()
  display.display(display.Image(data=image_raw))