Improving per-chunk latency #72

chris-allan · 2020-08-11T12:06:27Z

Hope you are well @axtimwalde, @igorpisarev.

First, some background.

We have a few projects now where we are using N5. The most obvious open source example is https://github.com/glencoesoftware/bioformats2raw. These projects are:

Cross platform (predominantly Windows and Linux)
Predominantly multi-threaded single node
Often executing on recent, high clock rate CPUs with local NVMe SSD storage
Heavy on I/O, light on compute

Several of our users are utilizing bioformats2raw to convert 100s of TBs of whole slide imaging data either to an N5/Zarr intermediate or in a pipeline with pyramidal OME-TIFF (via https://github.com/glencoesoftware/raw2ometiff) as the end goal. Consequently, time to conversion as well as resource utilization matter greatly to them; milliseconds matter.

While evaluating throughput we came across a few design decisions in N5 that we'd like to validate:

The use of file locking

n5/src/main/java/org/janelia/saalfeldlab/n5/N5FSReader.java

Lines 68 to 85 in 56f106c

    
           	final OpenOption[] options = readOnly ? new OpenOption[]{StandardOpenOption.READ} : new OpenOption[]{StandardOpenOption.READ, StandardOpenOption.WRITE, StandardOpenOption.CREATE}; 
        
           	channel = FileChannel.open(path, options); 
        
           	for (boolean waiting = true; waiting;) { 
        
           		waiting = false; 
        
           		try { 
        
           			channel.lock(0L, Long.MAX_VALUE, readOnly); 
        
           		} catch (final OverlappingFileLockException e) { 
        
           			waiting = true; 
        
           			try { 
        
           				Thread.sleep(100); 
        
           			} catch (final InterruptedException f) { 
        
           				waiting = false; 
        
           				Thread.currentThread().interrupt(); 
        
           			} 
        
           		} catch (final IOException e) {} 
        
           	} 
        
           }

Under load on Windows in particular we have seen several scenarios where file locking consumes the largest amount of time during chunk read/write.

As N5 and Zarr are quite similar, we have been operating under the concurrency assumptions outlined here:

https://zarr.readthedocs.io/en/stable/tutorial.html#parallel-computing-and-synchronization

Is it N5's desire to offer concurrency guarantees beyond those currently offered by Zarr? Would you accept PRs that remove or make file locking optional?

Checks for the existance of files

n5/src/main/java/org/janelia/saalfeldlab/n5/N5FSReader.java

Lines 172 to 174 in 56f106c

    
           final Path path = Paths.get(basePath, getDataBlockPath(pathName, gridPosition).toString()); 
        
           if (!Files.exists(path)) 
        
           	return null;

Similar to [1], we have seen scenarios where checking for file existence takes longer than the actual read/write. Would you accept PRs that change the semantics surrounding missing chunks?

Just in time allocation

Again, similar to [1], we have seen several scenarios where memory allocation, array copying, and alleviating GC pressure dominate runtime. Would you accept PRs that allow API consumers to perform their own memory management perhaps using pre-allocated DataBlock instances? Allow for DataBlock<byte> to be used for any datatype to avoid copies to and from typed Java arrays?

Thanks!

/cc @joshmoore, @kkoz, @melissalinkert

The text was updated successfully, but these errors were encountered:

axtimwalde · 2020-08-11T15:50:48Z

Is it N5's desire to offer concurrency guarantees beyond those currently offered by Zarr? Would you accept PRs that remove or make file locking optional?

Fine with me, we wanted to avoid partially overlapping reads and writes into blocks and meta-data. So creating a writer with optionally turned off file locking would be a welcome PR.

Similar to [1], we have seen scenarios where checking for file existence takes longer than the actual read/write. Would you accept PRs that change the semantics surrounding missing chunks?

Depends. What would the semantics be? If you mean failing through catching equivalent exceptions, and this being faster than the exists check, I am all ears. If it means failing that does not distinguish between files that are locked and files that don't exist, I am negative. On what file-systems are these issues relevant? I feel hesitant to give up guarantees to speed up operation on some niche Windows file system that will never be used in HPC environments. I usually find that single workstation use is typically well served with HDF5 for which we have an N5 driver.

Again, similar to [1], we have seen several scenarios where memory allocation, array copying, and alleviating GC pressure dominate runtime. Would you accept PRs that allow API consumers to perform their own memory management perhaps using pre-allocated DataBlock instances? Allow for DataBlock to be used for any datatype to avoid copies to and from typed Java arrays?

Absolutely! This is a left-over from the very first day and a thorn in my eyes. Particularly for compressed data, however, the advantages can quickly fizzle out because the existing compressors often create byte arrays from byte arrays. I still think that converting the DataBlock API from <T> to ByteBuffer can offer some advantages. This includes Buffer backed pixel access in ImgLib2.

chris-allan · 2020-08-11T16:39:57Z

Fine with me, we wanted to avoid partially overlapping reads and writes into blocks and meta-data. So creating a writer with optionally turned off file locking would be a welcome PR.

👍

We'll get on that.

Depends. What would the semantics be? If you mean failing through catching equivalent exceptions, and this being faster than the exists check, I am all ears.

Definitely this. I was concerned that the original intent may have been to prevent a read/write from happening at all for some particular reason.

On what file-systems are these issues relevant? I feel hesitant to give up guarantees to speed up operation on some niche Windows file system that will never be used in HPC environments. I usually find that single workstation use is typically well served with HDF5 for which we have an N5 driver.

NTFS on Windows 10. It is also a problem when working with network filesystems.

We definitely don't consider Windows niche. A sizeable component of our user base is using fat Windows workstations. To be able to use the same data layout at workstation, local HPC with parallel filesystem and object storage levels is hugely beneficial.

With respect to the current structure of N5 backends there is the separate issue of composability which probably goes beyond what I wanted to raise in this issue. For example, we'd really like to be able to use n5-zarr and object storage. Have you thought about pursuing refactoring on N5 similar to what has been discussed on zarr-developers/zarr-python#540?

Absolutely! This is a left-over from the very first day and a thorn in my eyes. Particularly for compressed data, however, the advantages can quickly fizzle out because the existing compressors often create byte arrays from byte arrays. I still think that converting the DataBlock API from to ByteBuffer can offer some advantages. This includes Buffer backed pixel access in ImgLib2.

👍

Great. I'll give it some thought and try to put something together for review.

Edit: Forgot one comment.

axtimwalde · 2020-08-11T17:01:56Z

NTFS on Windows 10. It is also a problem when working with network filesystems.

We definitely don't consider Windows niche. A sizeable component of our user base is using fat Windows workstations. To be able to use the same data layout at workstation, local HPC with parallel filesystem and object storage levels is hugely beneficial.

Absolutely, I should have skipped that comment. However, my main concern was that not every file-system is well suited for the N5/Zarr approach, e.g. minimum block sizes are just one thing that can get in your way, or limits for # of files in a directory, ls speed, etc. I therefore strongly believe that HDF5 files are an excellent solution for data that fits on big workstations that often run Windows. The transfer into cloud land can then be performed via copy-conversion with tools such as n5-copy. The API of the consuming code remains N5, regardless of whether it uses HDF5 or other N5 backends. I find this superior to guaranteeing data compatibility by making everything look like a filesystem. That is why I am suggesting caution when aiming for performance on platforms that are may be not a relevant target.

With respect to the current structure of N5 backends there is the separate issue of composability which probably goes beyond what I wanted to raise in this issue. For example, we'd really like to be able to use n5-zarr and object storage. Have you thought about pursuing refactoring on N5 similar to what has been discussed on zarr-developers/zarr-python#540?

Yes, we had this discussion. File format filters and storage primitives should be decoupled. Haven't done it yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving per-chunk latency #72

Improving per-chunk latency #72

chris-allan commented Aug 11, 2020 •

edited

axtimwalde commented Aug 11, 2020

chris-allan commented Aug 11, 2020 •

edited

axtimwalde commented Aug 11, 2020 •

edited

Improving per-chunk latency #72

Improving per-chunk latency #72

Comments

chris-allan commented Aug 11, 2020 • edited

axtimwalde commented Aug 11, 2020

chris-allan commented Aug 11, 2020 • edited

axtimwalde commented Aug 11, 2020 • edited

chris-allan commented Aug 11, 2020 •

edited

chris-allan commented Aug 11, 2020 •

edited

axtimwalde commented Aug 11, 2020 •

edited