Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attrs kwarg #1299

Closed
wants to merge 12 commits into from
Closed

Attrs kwarg #1299

wants to merge 12 commits into from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Dec 15, 2022

This PR adds attrs as a keyword argument to init_array and init_group. In both cases, the value assigned to attrs is inserted into the attributes of the group / array upon initialization.

Resolves #538

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Dec 15, 2022
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Dec 16, 2022
@codecov
Copy link

codecov bot commented Dec 16, 2022

Codecov Report

Merging #1299 (7eb7c12) into main (1af77b6) will decrease coverage by 0.00%.
The diff coverage is 99.67%.

❗ Current head 7eb7c12 differs from pull request most recent head 66928b2. Consider uploading reports for the commit 66928b2 to get more accurate results

@@             Coverage Diff             @@
##              main    #1299      +/-   ##
===========================================
- Coverage   100.00%   99.99%   -0.01%     
===========================================
  Files           35       35              
  Lines        14130    14158      +28     
===========================================
+ Hits         14130    14157      +27     
- Misses           0        1       +1     
Impacted Files Coverage Δ
zarr/_storage/store.py 99.60% <88.88%> (-0.40%) ⬇️
zarr/attrs.py 100.00% <100.00%> (ø)
zarr/core.py 100.00% <100.00%> (ø)
zarr/creation.py 100.00% <100.00%> (ø)
zarr/hierarchy.py 100.00% <100.00%> (ø)
zarr/storage.py 100.00% <100.00%> (ø)
zarr/tests/test_creation.py 100.00% <100.00%> (ø)
zarr/tests/test_hierarchy.py 100.00% <100.00%> (ø)
zarr/tests/test_storage.py 100.00% <100.00%> (ø)
zarr/tests/test_sync.py 100.00% <100.00%> (ø)
... and 1 more

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this! I think this optimization should really help speed up the creation of hierarchies with metadata. (I did some diagnostics of this on s3 here from the perspective of xarray.)

I have on question that is quite relevant to latency on object storage. Basically I am trying to understand whether this implementation truly reduces the number of I/O operations, or whether it simply provides a more convenient way to specify metadata at array creation time.

zarr/storage.py Show resolved Hide resolved
@@ -633,6 +663,9 @@ def init_group(
_init_group_metadata(store=store, overwrite=overwrite, path=path,
chunk_store=chunk_store)

# initialize attrs
_init_group_attrs(store, path, attrs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 16, 2022

Zarr.creation has lots of functions that independently hit the array / group creation routines, so with attrs as a keyword argument with a default, it would be very easy for a future creation routine to invoke init_group or init_array without supplying attrs, and thus providing an incomplete API (this has bitten me before with the many different creation routines). How do people think about making attrs a positional argument, or keyword only but with no default? This would ensure that any function calling init_array must deal with the attrs argument.

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 16, 2022

I have on question that is quite relevant to latency on object storage. Basically I am trying to understand whether this implementation truly reduces the number of I/O operations, or whether it simply provides a more convenient way to specify metadata at array creation time.

This change is motivated by the convenience. I haven't explicitly checked whether this actually saves I/O budget (but I doubt it).

@rabernat
Copy link
Contributor

How do people think about making attrs a positional argument, or keyword only but with no default? This would ensure that any function calling init_array must deal with the attrs argument.

I think this would be a breaking API change, so probably best to avoid that.

it would be very easy for a future creation routine to invoke init_group or init_array without supplying attrs, and thus providing an incomplete API (this has bitten me before with the many different creation routines)

Can you elaborate more on the downside here? I'm not sure I fully understand the problem.

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 17, 2022

it would be very easy for a future creation routine to invoke init_group or init_array without supplying attrs, and thus providing an incomplete API (this has bitten me before with the many different creation routines)

Can you elaborate more on the downside here? I'm not sure I fully understand the problem.

There are a lot of places in the code that end up invoking init_group: e.g., Group.create_group, Group.create_groups Group.require_group, Group.require_groups, group, open_group, and in principle each added optional keyword argument to init_group is a new opportunity for the group creation routines to diverge slightly, in case the developer adding the new optional keyword argument does not remember to expose that kwarg to all the different group creation routines (i'm pretty sure I added attrs to all of them for this PR). I think for array creation all the high-level creation routines take **kwargs, which can defend against this kind of thing at the cost of an opaque function signature.

Alternatively, we make a breaking change to init_group and add attrs as a positional argument, and then any higher-level routine that invokes it will break, but this will ensure consistency.

For this PR I'm fine with attrs as a kwarg with a default, but for posterity I want to emphasize the advantages of making it positional.

@rabernat
Copy link
Contributor

Alternatively, we make a breaking change to init_group and add attrs as a positional argument, and then any higher-level routine that invokes it will break, but this will ensure consistency.

I favor this options. It allows us solve the problem without changing the public API in a way that could break user code. I don't consider init_group part of the public API. At https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.init_group it says

Note that this is a low-level function and there should be no need to call this directly from user code.

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 21, 2022

Tests would require a lot of redundant changes to test_core.py, so I opened #1305 to set the conditiosn for a much more graceful introduction of the attrs kwarg to the test suite.

@d-v-b d-v-b mentioned this pull request Jul 11, 2023
@jhamman
Copy link
Member

jhamman commented Dec 7, 2023

@d-v-b - should we close this in favor of v3?

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 7, 2023

closing in favor of v3

@d-v-b d-v-b closed this Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

attrs as kwarg
3 participants