Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for shapes in dtype definitions #296

Merged
merged 21 commits into from Oct 19, 2018
Merged

Add support for shapes in dtype definitions #296

merged 21 commits into from Oct 19, 2018

Conversation

ghost
Copy link

@ghost ghost commented Aug 31, 2018

The objective of this PR is to resolve #111. Metadata encoding and
decoding is modified to accept shaped fields of structured dtypes and
shaped arrays of primitives (e.g. (10,10)f8).

Tests not utilizing optional requirements pass on Python 2.7 and 3.6. I
am working on setting up my environment with the optional requirements,
and hope to have those tests running very soon. If needed, I can expand
the test suite to more rigorously test the new encoding functionality.

docs/tutorial.rst will also be updated to include information about
shaped fields in dtypes.

Thank you for your time and consideration.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Changes documented in docs/release.rst
  • Docs build locally (e.g., run tox -e docs)
  • AppVeyor and Travis CI passes
  • Test coverage to 100% (Coveralls passes)

@ghost ghost changed the title [In-Progress] Add support for shapes in dtype definitions [WIP] Add support for shapes in dtype definitions Aug 31, 2018
zarr/meta.py Outdated Show resolved Hide resolved
@alimanfoo
Copy link
Member

Thank you @onalant.

There's some complexities here around structured versus unstructured arrays which I'd like to understand a bit better.

In numpy, if you make an unstructured array with a dtype with a shape, the shape from the dtype gets pushed into the shape of the array. E.g.:

In [12]: a = np.zeros(100, dtype='(10, 10)<f4')

In [13]: a.shape
Out[13]: (100, 10, 10)

In [14]: a.dtype.descr
Out[14]: [('', '<f4')]

In [15]: a.dtype
Out[15]: dtype('float32')

But if you are creating a structured array, then the dtype is preserved, e.g.:

In [16]: a = np.zeros(100, dtype=[('foo', '(10, 10)<f4'), ('bar', 'i8')])

In [17]: a.dtype
Out[17]: dtype([('foo', '<f4', (10, 10)), ('bar', '<i8')])

In [18]: a.dtype.descr
Out[18]: [('foo', '<f4', (10, 10)), ('bar', '<i8')]

In [19]: a.shape
Out[19]: (100,)

In zarr I think we want to follow this behaviour, or at least support the structured array case (which I think #111 is targeting).

Currently I think your PR would fail on the structured array case, and would not emulate numpy behaviour in the unstructured case? (Apologies if I am missing or misunderstanding anything here.)

cc @jakirkham

@ghost
Copy link
Author

ghost commented Aug 31, 2018

Re: unstructured array consistency with numpy - The inconsistency was an oversight on my part. My apologies about this. I will push a fix as soon as possible.

Re: structured arrays - Here is what I am seeing (with some cursory testing). Initializing an array from a pre-allocated numpy ndarray of a structured dtype functions as I would expect:

a = np.zeros(100, dtype=[('foo', '(10, 10)<f4'), ('bar', 'i8')])
z = zarr.array(a)
z.dtype
> dtype([('foo', '<f4', (10, 10)), ('bar', '<i8')])

Using zarr.create with the default fill_value (0) causes the following error:

>>> import zarr
>>> z = zarr.create(100, dtype=[('foo', '(10, 10)<f4'), ('bar', 'i8')])
>>> z["foo"]
Traceback (most recent call last):
  File "/home/onalant/source/zarr/zarr/core.py", line 1566, in _chunk_getitem
    cdata = self.chunk_store[ckey]
KeyError: '0'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/onalant/source/zarr/zarr/core.py", line 559, in __getitem__
    return self.get_basic_selection(selection, fields=fields)
  File "/home/onalant/source/zarr/zarr/core.py", line 685, in get_basic_selection
    fields=fields)
  File "/home/onalant/source/zarr/zarr/core.py", line 727, in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
  File "/home/onalant/source/zarr/zarr/core.py", line 1015, in _get_selection
    drop_axes=indexer.drop_axes, fields=fields)
  File "/home/onalant/source/zarr/zarr/core.py", line 1571, in _chunk_getitem
    out[out_selection] = self._fill_value
ValueError: Can't cast from structure to non-structure, except if the structure only has a single field.

This error appears in master for structured dtypes in general (e.g. [('foo', '<f4'), ('bar', 'i8')]).

However, when fill_value=None:

>>> import zarr
>>> z = zarr.create(100, dtype=[('foo', '(10, 10)<f4'), ('bar', 'i8')], fill_value=None)
>>> z["foo"]
array([[[-2.20392570e+33,  4.59163468e-41, -2.20392570e+33, ...,
          0.00000000e+00,  1.35631564e-19,  1.95059528e-19],
        [ 1.94316151e-19,  4.74289253e+30,  7.02922598e+28, ...,
          7.77811654e+31,  1.12896907e+27,  6.67408064e+22],
        [ 1.24087546e+28,  7.80232061e+34,  4.80081852e+30, ...,
          1.85216880e+28,  1.80569552e+28,  1.35559845e-19],
        ...,
...

The dtype is preserved after writing to and fetching from disk.

Please let me know if I have misunderstood something, or if I can provide any additional information.

@alimanfoo
Copy link
Member

Thanks @onalant, I'll follow up asap.

ghost pushed a commit to uw-ipd/privileged_residues that referenced this pull request Aug 31, 2018
ghost pushed a commit to uw-ipd/privileged_residues that referenced this pull request Aug 31, 2018
ghost pushed a commit to uw-ipd/privileged_residues that referenced this pull request Aug 31, 2018
@ghost
Copy link
Author

ghost commented Aug 31, 2018

At the moment, core.Array.append causes a core dump. I suspect that
some of the other functions are broken as well. Working on a fix.

Not running into this issue after reinstalling the package.

@ghost
Copy link
Author

ghost commented Sep 4, 2018

@alimanfoo Following up on the PR; is there anything I can do to help?

I am working on getting the documentation and additional unit tests
written. I envision completion in the next day.

@alimanfoo
Copy link
Member

Thanks @onalant. Tests for the structured array cases would be good. @jakirkham what structured dtype features do you think we need to support? Field with shape. Also should we support nested fields (i.e.., fields within fields)?

@ghost
Copy link
Author

ghost commented Sep 5, 2018

@alimanfoo In 24b56cf you will find metadata encode-decode tests for
unstructured and structured arrays (with field subshapes), as well
as a smoke test for whether or not array data are preserved after
writing to an external store and being reloaded
(zarr/tests/test_crude.py).

I have one question, however: which test modules require expansion
to consider subfield shapes? Should the expansions extend down to
the core tests, or could the subfield-shape tests be in a separate
file testing consistency with numpy ndarray semantics?

As an aside, I believe the Travis failure is a consequence of adding
a new test file. A similar phenomenon occured in 55e725d, but looking
at the logs shows that all of the tests passed, and an error was raised
elsewhere.

Please let me know if any of this is unclear. Thank you for your time.

@ghost
Copy link
Author

ghost commented Sep 10, 2018

@alimanfoo Just wanted to follow up. Is there anything else you need
for functionality verification?

@alimanfoo
Copy link
Member

alimanfoo commented Sep 10, 2018 via email

@alimanfoo
Copy link
Member

Hi @onalant,

Just took a look through. The implementation in zarr/meta.py looks good. The new tests in zarr/tests/test_meta.py also look good.

The tests in zarr/tests/test_crude.py cover some important cases, but it would be better if these could be incorporated into existing dtype tests. In zarr/tests/test_core.py there is a test_dtypes function, it would be great if you could extend this to cover the dtypes currently tested in test_crude.py.

Many thanks!

@ghost
Copy link
Author

ghost commented Sep 13, 2018

@alimanfoo Understood. I will move the tests in test_crude.py
over to test_core.py.

zarr/tests/test_core.py Outdated Show resolved Hide resolved
@@ -1631,6 +1627,10 @@ def test_astype(self):
expected = data.astype(astype)
assert_array_equal(expected, z2)

def test_unstructured_array(self):
# skip this one, cannot do delta on unstructured array
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technicality: numcodecs cannot do deltas on unstructured array dtypes
in string form e.g. "(4, 4)f4". Since the expanded dtype is already
tested in test_dtypes, skip instead.

@ghost
Copy link
Author

ghost commented Sep 18, 2018

@alimanfoo Sorry to disturb you, but I want to confirm whether the
new tests are satisfactory.

Would you like me to append this feature to the documentation?

@jakirkham
Copy link
Member

Sorry to be largely absent here.

Skimmed the tests (though haven't looked at any code in detail). The test cases covered seem about right. Basically we want to make sure these cases are covered and likely mixed together in different ways.

  1. Single record
  2. Multiple records
  3. Different types
  4. Some records with non-trivial shape 1-D/N-D

Will try to take a closer look, but can't promise anything before Friday unfortunately.

ghost pushed a commit to uw-ipd/privileged_residues that referenced this pull request Sep 18, 2018
@ghost
Copy link
Author

ghost commented Sep 24, 2018

Hello, I would just like to follow-up on this issue. I am currently
slightly bogged down with other work, so I do not think I can write
documentation of the quality I see in the repository. However,
I would be willing to assist in expanding the test suite, if the
current additions are insufficient.

Is there anything else I can do to expedite this PR?

@alimanfoo
Copy link
Member

alimanfoo commented Sep 24, 2018 via email

@ghost
Copy link
Author

ghost commented Oct 2, 2018

Following up on the PR.

My apologies if I am coming across as terse.

@alimanfoo
Copy link
Member

alimanfoo commented Oct 2, 2018 via email

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @onalant, I very much appreciate the work put into this, especially the thorough testing. It all looks good to me modulo some very minor stylistic comments. CI is passing and coverage is 100% so I think this is basically good to go.

@@ -826,6 +826,27 @@ def test_nchunks_initialized(self):
z[:] = 42
assert 10 == z.nchunks_initialized

def test_unstructured_array(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename this "test_array_dtype_shape()" or something like that, just to be a bit more specific about what this test is about.

assert_array_equal(a['bar'], z['bar'])
assert_array_equal(a['baz'], z['baz'])
else:
# BUG(onalant): https://www.github.com/numpy/numpy/issues/11946
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but I don't think you need to sign your comments, and this could be slightly more descriptive. E.g.:

# workaround for numpy bug https://www.github.com/numpy/numpy/issues/11946

assert_array_equal(a['bar'], z['bar'])
assert_array_equal(a['baz'], z['baz'])
else:
# BUG(onalant): https://www.github.com/numpy/numpy/issues/11946
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

@@ -116,6 +116,92 @@ def test_encode_decode_array_2():
assert [df.get_config()] == meta_dec['filters']


def test_encode_decode_array_unstructured():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename to "test_encode_decode_array_dtype_shape()" to be slightly more specific.

# test decoding
meta_dec = decode_array_metadata(meta_enc)
assert ZARR_FORMAT == meta_dec['zarr_format']
# NOTE(onalant): https://github.com/zarr-developers/zarr/pull/296#issuecomment-417608487
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to include this link, the sentence in the next comment line explains the rationale.

# To maintain consistency with numpy unstructured arrays, unpack dimensions into shape.
assert meta['shape'] + meta['dtype'].shape == meta_dec['shape']
assert meta['chunks'] == meta_dec['chunks']
# NOTE(onalant): https://github.com/zarr-developers/zarr/pull/296#issuecomment-417608487
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to include this link, the sentence in the next comment line explains the rationale.

# test decoding
meta_dec = decode_array_metadata(meta_enc)
assert ZARR_FORMAT == meta_dec['zarr_format']
# NOTE(onalant): https://github.com/zarr-developers/zarr/pull/296#issuecomment-417608487
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to include this link, the sentence in the next comment line explains the rationale.

# To maintain consistency with numpy unstructured arrays, unpack dimensions into shape.
assert meta['shape'] + meta['dtype'].shape == meta_dec['shape']
assert meta['chunks'] == meta_dec['chunks']
# NOTE(onalant): https://github.com/zarr-developers/zarr/pull/296#issuecomment-417608487
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to include this link, the sentence in the next comment line explains the rationale.

@alimanfoo alimanfoo changed the title [WIP] Add support for shapes in dtype definitions Add support for shapes in dtype definitions Oct 11, 2018
@alimanfoo alimanfoo added this to the v2.3 milestone Oct 11, 2018
@alimanfoo
Copy link
Member

@jakirkham one thought that occurs is that we should probably add some clarification to the storage spec to state how we expect dtypes for structured arrays with subshape and/or nested fields are encoded in array metadata, given that these are now supported. I think this is another case where we are clarifying an area of ambiguity rather than changing anything, and so can edit the spec without requiring a spec version change.

@onalant if you are happy to finish up the coding work, I'd be happy to deal with the spec changes and release notes.

@jakirkham
Copy link
Member

Yeah that sounds reasonable to me as well.

It’s probably also worth noting that chunking does not happen within the structured type. IIRC HDF5 handles this the same way.

@ghost
Copy link
Author

ghost commented Oct 12, 2018

@alimanfoo Thank you for the style suggestions. I have made the
requested changes.

Remark:
I am not sure why coverage decreased.

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super, many thanks @onalant. I'll push some spec edits to your branch, to keep this all in one place, before merging.

@alimanfoo
Copy link
Member

Rebased on upstream master, which should fix coverage issues. Working on docs now...

@alimanfoo
Copy link
Member

I've pushed some work on the spec to describe how structured arrays with subshape and/or nested fields are encoded.

In the process of playing with this I also found and fixed a small bug that came up when trying to read data from a structured array which had not been initialised with any data yet (and refactored the structured array tests in the process).

Just release notes to do now, otherwise should be good to go.

@alimanfoo
Copy link
Member

Merging, thanks @onalant for seeing this one through, much appreciated.

@alimanfoo alimanfoo merged commit a4d5656 into zarr-developers:master Oct 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Supporting structured array records with non-trivial shape
2 participants