##### Copyright 2019 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# TFRecord and tf.SequenceExample

If your dataset consists of features where each feature is a list of values of the same type, the [`tf.train.Example`](https://www.tensorflow.org/api_docs/python/tf/train/Example) proto message is a great choice for storing your data.
However, this message format has some shortcomings when it comes to dealing with features that consists of lists of identically typed data. That is because, in TensorFlow, [`tf.train.Example`](https://www.tensorflow.org/api_docs/python/tf/train/Example) protos are read in row-major format, so any configuration that describes data with rank-2 or above is harder to represent.

For example, to store an `M x N` matrix of Bytes, the [`tf.train.BytesList`](https://www.tensorflow.org/api_docs/python/tf/train/BytesList) must contain `M * N` bytes, with `M` rows of `N` contiguous values each. That is, the [`tf.train.BytesList`](https://www.tensorflow.org/api_docs/python/tf/train/BytesList) value must store the matrix as:

```
    .... row 0 .... .... row 1 .... // ...........  // ... row M-1 ....
```

As you can see, this is not ideal. A more elegant approach would be to use [`tf.train.SequenceExample`](https://www.tensorflow.org/api_docs/python/tf/train/SequenceExample).

While a `tf.train.Example` is fundamentally a mapping from feature names to `tf.train.Feature`, a `tf.train.SequenceExample` extends this data structure by adding a second mapping from feature names to [`tf.train.FeatureList`](https://www.tensorflow.org/api_docs/python/tf/train/FeatureList).

A `tf.train.FeatureList` is practically a container for sequential data, as it contains lists of `tf.train.Feature`. And here lies the key: if you need to use __lists of lists__, it's better to use `tf.train.SequenceExample`.

This notebook will demonstrate how to create, parse and use the [`tf.train.SequenceExample`](https://www.tensorflow.org/api_docs/python/tf/train/SequenceExample) message, and then serialize, write, and read [`tf.train.SequenceExample`](https://www.tensorflow.org/api_docs/python/tf/train/SequenceExample) messages to and from `.tfrecord` files. 

Note: [`tf.train.Example`](https://www.tensorflow.org/api_docs/python/tf/train/Example) is very generic and can be used to represent any sort of data. While not providing any extra power, [`tf.train.SequenceExample`](https://www.tensorflow.org/api_docs/python/tf/train/SequenceExample) offers a little more structure to your data and it is more pleasant to work with it in certain scenarios.

## Setup

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

## Creating a `tf.train.SequenceExample`

To create a `tf.train.SequenceExample` you generally have to take care of:
* Context-level values:
    1. each context-level value needs to be converted to a `tf.train.Feature` containing one of the 3 compatible types (`tf.train.BytesList`, a `tf.train.FloatList` or a `tf.train.Int64List`)
    2. create a map from the context-level feature name string to the encoded feature value produced in the previous step
    3. the map produced in step 2 is converted to a [`Features` message](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto#L85).
* Sequences of values:
    1. for each sequence, create a list of `tf.train.Feature` messages by converting every value in the sequence to a `tf.train.Feature`
    2. the lists produced in the previous step are used create `tf.train.FeatureList` messages
    3. create a map from the sequence feature name string to the corresponding `tf.train.FeatureList` produces in step 2
    4. the map produced in step 3 is converted to a [`FeatureLists` message](https://www.tensorflow.org/api_docs/python/tf/train/FeatureLists)

In this notebook, you will create `tf.train.SequenceExample` messages from JSON records.

The records represent sentences as they are ingested by an NLP model and each sentence has the following features:
- 2 context-level features:
    - the length of the sentence, an integer, representing the number of tokens
    - the sentiment of the sentence, either `negative`, `neutral` or `positive`, as strings
- 2 sequence features:
    - the raw values of the words in the sentence, as strings
    - [BIO tags](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), as strings, to indicate if the corresponding word is part of a named entity (in this case an academic institution name)

In [None]:
import json

sentences_json = '[ \
    { \
        "sentiment": "positive", \
        "length": 14, \
        "words": [ \
            { "value": "The",         "ner-tag": "B-inst"}, \
            { "value": "University",  "ner-tag": "I-inst"}, \
            { "value": "of",          "ner-tag": "I-inst"}, \
            { "value": "Manchester",  "ner-tag": "I-inst"}, \
            { "value": "is",          "ner-tag": "O"}, \
            { "value": "a",           "ner-tag": "O"}, \
            { "value": "leading",     "ner-tag": "O"}, \
            { "value": "academic",    "ner-tag": "O"}, \
            { "value": "institution", "ner-tag": "O"}, \
            { "value": "in",          "ner-tag": "O"}, \
            { "value": "the",         "ner-tag": "O"}, \
            { "value": "United",      "ner-tag": "O"}, \
            { "value": "Kingdom",     "ner-tag": "O"}, \
            { "value": ".",           "ner-tag": "O"} \
        ] \
    } \
]'

sentences = json.loads(sentences_json)
sentences

To convert our values, we can define some helper functions:

In [None]:
def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

The context-level features can be converted directly to `tf.train.Feature` using the helper functions and then a `tf.train.Features` message can be created:

In [None]:
def create_context_features(sentence):
    return tf.train.Features(feature={
        'length': _int64_feature(sentence['length']),
        'sentiment': _bytes_feature(sentence['sentiment'].encode('utf-8')) # get the bytes of the string using .encode()
    })

The sequences need to be collated first into lists, and then the latter can be converted to `tf.train.FeatureList`. Finally, using the `tf.train.FeatureList` protos, you can create the `tf.train.FeatureLists`.

In [None]:
def create_sequence_features(sentence):
    word_features = []
    bio_tag_features = []
    
    for word in sentence['words']:
        # create each of the features, then add them to the corresponding feature list
        word_feature = _bytes_feature(word['value'].encode('utf-8'))
        word_features.append(word_feature)
        
        bio_tag_feature = _bytes_feature(word['ner-tag'].encode('utf-8'))
        bio_tag_features.append(bio_tag_feature)
        
    words = tf.train.FeatureList(feature=word_features)
    bio_tags = tf.train.FeatureList(feature=bio_tag_features)
    
    return tf.train.FeatureLists(feature_list={
        'words': words,
        'bio-tags': bio_tags
    })

Once we have the `tf.train.Features` and `tf.train.FeatureLists` messages, we can create the `tf.train.SequenceExample` message:

In [None]:
def make_sequence_example(sentence):
    context_features = create_context_features(sentence)
    sequence_features = create_sequence_features(sentence)
    
    return tf.train.SequenceExample(
        context = context_features,
        feature_lists = sequence_features
    )

We can now serialiaze our sentences to prepare them for being written in `.tfrecord` files:

In [None]:
serialized_sequence_example = make_sequence_example(sentences[0]).SerializeToString()
serialized_sequence_example

To decode it:

In [None]:
example_proto = tf.train.SequenceExample.FromString(serialized_sequence_example)
example_proto

## Writing to TFRecord files using Python

This is exactly the same as writing `tf.train.Example` messages. The `tf.io` module contains pure-Python functions for reading and writing TFRecord files.

In [None]:
filename = 'test.tfrecord'

with tf.io.TFRecordWriter(filename) as writer:
    for sentence in sentences:
        seq_example = make_sequence_example(sentence).SerializeToString()
        writer.write(seq_example)

Reading it back:

In [None]:
filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
raw_dataset

Note: iterating over a `tf.data.Dataset` only works with eager execution enabled.

In [None]:
# call tf.enable_eager_execution() - must be enabled at program startup

for raw_record in raw_dataset.take(1):
  print(repr(raw_record))