Skip to content

support partially-known and unknown shape specification in decode_json #918

@dholland42

Description

@dholland42

Hi,

I have been using tfio.experimental.serialization.decode_json along with tf.data.TextLineDataset to effectively train using newline separated json files. It feels to me like a slightly less efficient, but human readable, version of tf.data.TFRecordDataset. This has been a huge boon to my workflow, so first off thank you for this contribution! I have, however, run into a minor issue recently. I currently have data with variable length elements in the values. for example:

{"foo": [1, 2, 3, 4]}
{"foo": [1, 2, 3, 4, 5]}

In order to parse these records, I would expect to be able to do something like the following:

import json

import tensorflow as tf
import tensorflow_io as tfio

r = json.dumps({"foo": [1, 2, 3, 4, 5]})


def parse_json(json_text):
    specs = {
        "foo": tf.TensorSpec(tf.TensorShape([None]), tf.int32)
    }
    parsed = tfio.experimental.serialization.decode_json(json_text, specs)
    return parsed["foo"]

parse_json(r)

However, I receive the following error:

2020-04-24 15:56:41.529651: W tensorflow/core/framework/op_kernel.cc:1632] OP_REQUIRES failed at serialization_kernels.cc:36 : Invalid argument: Shape [?] is not fully defined
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in parse_json
  File "/opt/miniconda/miniconda3/envs/insurance/lib/python3.6/site-packages/tensorflow_io/core/python/experimental/serialization_ops.py", line 74, in decode_json
    values = core_ops.io_decode_json(data, names, shapes, dtypes, name=name)
  File "<string>", line 6397, in io_decode_json
  File "<string>", line 6460, in io_decode_json_eager_fallback
  File "/opt/miniconda/miniconda3/envs/insurance/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape [?] is not fully defined [Op:IO>DecodeJSON]

If I specify the full length of the list element, parsing will work as expected:

import json

import tensorflow as tf
import tensorflow_io as tfio

r = json.dumps({"foo": [1, 2, 3, 4, 5]})


def parse_json(json_text):
    specs = {
        "foo": tf.TensorSpec(tf.TensorShape([5]), tf.int32)
    }
    parsed = tfio.experimental.serialization.decode_json(json_text, specs)
    return parsed["foo"]

parse_json(r)

results in

<tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 2, 3, 4, 5], dtype=int32)>

I can currently hack around this by preprocessing my data ahead of time and padding everything to the same length then masking the padding elements, but having decode_json handle undefined shapes would save me time and just generally be much nicer :)

Environment information (in case it matters/helps):

Python 3.6.10 |Anaconda, Inc.| (default, Mar 25 2020, 23:51:54)
[GCC 7.3.0] on linux
---
Name: tensorflow
Version: 2.1.0
---
Name: tensorflow-io
Version: 0.12.0

Thanks again!

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions