Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr DirectoryStore: getsize is wrong #253

Open
Kriechi opened this issue Apr 4, 2018 · 12 comments · May be fixed by #1815
Open

Zarr DirectoryStore: getsize is wrong #253

Kriechi opened this issue Apr 4, 2018 · 12 comments · May be fixed by #1815
Labels
bug Potential issues with the zarr-python library help wanted Issue could use help from someone with familiarity on the topic
Milestone

Comments

@Kriechi
Copy link

Kriechi commented Apr 4, 2018

https://github.com/zarr-developers/zarr/blob/3c8e9291e96e090e20caf6950da69a3cc180652f/zarr/storage.py#L864-L871

This code does not account for subfolders created by this example:

store = zarr.TempStore(dir='/tmpfs')
for name, values in data.items():
    zarr.array(values, store=store, path=name, compressor=compressor, filters=filter)
total_size = store.getsize()

total_size is 26, (24+2 for .zattrs and .zgroup), which is wrong. Subfolders are not accounted for.

My TempStore has the following structure:

.zattrs
.group
name1/
  1
  2
  3
name2/
  1
  2
  3

The file sizes of name1 and name2 are unaccounted for.

Tested with Zarr 2.2.0.

@alimanfoo
Copy link
Member

alimanfoo commented Apr 4, 2018 via email

@alimanfoo
Copy link
Member

Just to say the getsize() method was originally added to provide information about stored size of a single array, and so recursing into subdirectories was not needed. However I think it would be better in general if it did report recursive size. I don't have time to work on this right now but a PR would be welcome.

Also if this was changed to recursive size in DirectoryStore then might want to also review implementation in NestedDirectoryStore and ZipStore classes.

@jakirkham
Copy link
Member

So I guess NestedDirectoryStore would technically be wrong too or does that already recurse when determining array size.

@alimanfoo
Copy link
Member

Yes NestedDirectoryStore is wrong, only counts immediate children of a directory (actually doesn't override getsize() so uses implementation from parent class DirectoryStore).

@alimanfoo alimanfoo added this to the v2.3 milestone Oct 19, 2018
@alimanfoo alimanfoo added the bug Potential issues with the zarr-python library label Nov 20, 2018
@jakirkham
Copy link
Member

It would be pretty simple to change this to os.walk.

If we want to get fancier, we use scandir, which includes the size while walking the tree. There's a backport package for Python 2 as well.

ref: https://stackoverflow.com/a/1392549

@jakirkham
Copy link
Member

PR ( #362 ) should fix this. Just used os.walk for simplicity given we need to go through the directory recursively and scandir behaves a bit more like listdir, which would complicate this a bit.

@jakirkham
Copy link
Member

jakirkham commented Dec 10, 2018

When looking into this further, it seems that getsize's behavior is currently expected to ignore nested children. This is tested here (note this should be 15 instead of 6 if recursion is used) as well as on the following lines. This test is run for all stores (e.g. DictStore also has this behavior). While I agree it probably should change to be recursive, that's a behavior change. So it would be good to make sure we are not missing something before changing this.

@jakirkham
Copy link
Member

@alimanfoo, do you have any thoughts on the comment above? Basically the current behavior seems intentional (as it is tested) so am trying to understand how we should proceed here.

@alimanfoo
Copy link
Member

Sorry for slow follow up. FWIW I think it would be reasonable to change the behaviour and make it recursive. Should not affect results for reporting stored size of an array in a DirectoryStore, as there shouldn't be any sub-directories there anyway. (E.g., I often call .info on an array to see how big it is on disk.) I'm happy for the tests to be modified.

@jakirkham
Copy link
Member

Sorry I missed your comment, @alimanfoo. Thanks for the clarification. Happy to break the current tests as well. Just wanted to make sure there wasn't something I was missing. Will try to update PR ( #362 ) when I have a chance.

@jakirkham jakirkham modified the milestones: v2.3, v2.4 Mar 23, 2019
@hmaarrfk
Copy link

Is the only way to get the current estimated size by parsing the returned value of info_items in our own dictionary then indexing No. bytes stored?

@Carreau Carreau modified the milestones: v2.4, v2.5 Sep 9, 2020
@alimanfoo alimanfoo modified the milestones: v2.5, v2.6 Sep 24, 2020
@Carreau Carreau mentioned this issue Nov 30, 2020
@Carreau Carreau modified the milestones: v2.6, v2.7 Dec 1, 2020
@joshmoore joshmoore added the help wanted Issue could use help from someone with familiarity on the topic label Sep 22, 2021
@aliaksei-chareshneu
Copy link

Dear all,

Are there any updates on this functionality (getting size of subfolders and arrays recursively)?

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library help wanted Issue could use help from someone with familiarity on the topic
Projects
None yet
7 participants