Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential synchronization optimization #386

Open
alimanfoo opened this issue Jan 9, 2019 · 0 comments
Open

Potential synchronization optimization #386

alimanfoo opened this issue Jan 9, 2019 · 0 comments

Comments

@alimanfoo
Copy link
Member

I get the impression that most people who are doing concurrent writes into a zarr array are managing writes so they align with chunk boundaries, and therefore don't need to use any synchronization. However, this may not be possible or easy in all circumstances, and so zarr has a synchronization API and two synchronizer implementations, one based on thread locking and one based on file locking. The synchronization is done on a per chunk basis, i.e., locks are used to synchronize writes to any given chunk, but different chunks are protected by different locks and so may written concurrently.

If a zarr array is instantiated with a synchronizer, currently a lock is obtained for any write to a chunk. However, some writes will completely overwrite (replace) a chunk, whereas some writes may only partially update the content of a chunk. In the main use cases that zarr aims to satisfy, data are being written concurrently to an array, and each concurrent writer is writing to a separate region of the array. For any chunk that falls completely within a region being written, and thus which will be completely replaced, there will never be any contention between workers. The only time there could be contention is for chunks that only partially overlap a region being written, and thus which be being partially updated.

Thus it would be possible to reduce the number of locks being used, by detecting for each chunk whether it is being partially updated or fully replaced, and only acquiring a lock if it is being partially updated. We already detect whether a chunk write is full or partial because this determines whether or not we need to read the chunk before writing (partial) or can just overwrite it (full). So we'd just need to make use of this information when deciding whether or not to acquire a lock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant