Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: New API to replace existing arrays in npz files #68

Open
zhihaoy opened this issue Feb 27, 2020 · 3 comments
Open

Proposal: New API to replace existing arrays in npz files #68

zhihaoy opened this issue Feb 27, 2020 · 3 comments

Comments

@zhihaoy
Copy link
Contributor

zhihaoy commented Feb 27, 2020

Currently, dump_npz either destroys all existing arrays in an npz file or append arrays with names that exist in the npz file as duplicated entries. This is a rather strange semantics, especially when loading individual arrays with load_npz doesn't follow those semantics.

A reasonable semantics would be replacing existing arrays. This requires a context object, with a role similar to HighFive::File.

NumPy doesn't support this. They only overwrite all arrays at once.

I checked out libzippp and libzip++, neither work with streams. It's primarily because dump_npy_stream and libzip are both "push" interfaces (thus, both need to run the main loop), so they cannot work together without writing a special stream class that serves as a pipe so that libzip can pull data from it...

I think there is a rather simple way to support replacing semantics given a context object. When an npz_file is opened for update, append as usual while keeping the central directory as a data structure in memory. When closing, write a temporary file with only the up-to-date arrays, finish the file with the central directory entry, do an atomic move to replace the old file. (It's possible to shrink a file in-place, but I guess that would invalidate too much I/O buffer).

@JohanMabille
Copy link
Member

Thanks for your proposal, I think having this feature would be really nice! Regarding the API, the boolean parameter append_to_existing_file should be replaced with a mode parameter (ideally a simple enum containing overwrite and replace), so it is clear the change is backward incompatible and upgrading to the last version will break the code (instead of having a failure at runtime only).

@zhihaoy
Copy link
Contributor Author

zhihaoy commented Feb 27, 2020

Regarding the API, the boolean parameter append_to_existing_file should be replaced with a mode parameter (ideally a simple enum containing overwrite and replace), so it is clear the change is backward incompatible and upgrading to the last version will break the code (instead of having a failure at runtime only).

I may fail to explain this part well: I was not proposing to change the existing dump_npz. I suggest to add a context class named xt::npz_file, then add a dump member function or an free-function overload of xt::dump that takes xt::npz_file&. Because shrinking the file after replacing each array would be inefficient, but doing that at context closing time is fine.

After doing that, we can consider whether to make a change as you suggested on dump_npz. Although it's not as efficient as the xt::dump API when updating many arrays, the user may want to update only one array.

(I'm going to start implement the necessary bits.)

PS: Switching to a mode enum has an additional benefit: we can deprecate append_to_existing_file with a migration suggestion: https://godbolt.org/z/rTPgbo

@JohanMabille
Copy link
Member

Thanks for the clarification. Let's proceed in two steps then, first adding an overload to dump_npz and then see how we can improve the existing API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants