Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DirectoryStore .keys() optimisation. #562

Open
Carreau opened this issue May 8, 2020 · 6 comments
Open

DirectoryStore .keys() optimisation. #562

Carreau opened this issue May 8, 2020 · 6 comments

Comments

@Carreau
Copy link
Contributor

Carreau commented May 8, 2020

It looks like the keys() method of directory store reimplement a lot of the logic of os.walk() that could be used to increase efficiency.

in particular using os.walk() will directly return directory and files separately and will avoid the two extra expensive call to os.path.isfile(path), and os.path.isdir(path).

forming the keys will be a tiny bit less straightforward than current method as os.walk() returns the full relative path you gave it whether or not you terminate it by a slash, so we might have to be careful with this.

(I also note that this methods does list os specific files, like .DStore.. not sure this is on purpose.

@jakirkham
Copy link
Member

Good point! The other thing we might consider is using os.scandir, which we already use in other places and also checks metadata of the path as part of walking.

@Carreau
Copy link
Contributor Author

Carreau commented May 9, 2020

Notes, os.walk() may also return path with \ instead of /,

And yes os.scandir is used inside os.walk()

@Carreau
Copy link
Contributor Author

Carreau commented May 9, 2020

I do have a prototype which appear to be a bit faster on small zarr on my laptop.

In [31]: %timeit list(sorted(z.store.keys_2()))
621 µs ± 8.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [32]: %timeit list(sorted(z.store.keys()))
2.05 ms ± 47.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Some tests do seem to rely on keys ordering though...

@jakirkham
Copy link
Member

How much does it cost us to sort ourselves?

@Carreau
Copy link
Contributor Author

Carreau commented May 9, 2020

How much does it cost us to sort ourselves?

I don't know and we should try on a particularly big dataset.

I've restarted and it might have been an actual bug in my implementation as test passed this time...

I've send #563 with a rough implementation to see if it fails on windows.

We may also want to not be lazy and compute also compute the length while we loop, as currently __len__ consume the iterator just get the number of keys.

@jakirkham
Copy link
Member

It could also be a bug in Zarr. Really ordering shouldn't matter, but I may be proven wrong ;)

Sounds good.

Yeah that seems like something that could be more efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants