Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Quick out-of-core downsampling to provide an overview of large data #560
As suggested in #553, we need to improve our raster support so that it can access large rasters chunk by chunk. This will let us work with raster datasets larger than system memory, just as we can currently work with larger-than-memory columnar Dask dataframes, and should make it very practical to work with a small, zoomed-in region of a very large raster dataset.
However, even if full resolution is needed only on deep zooms, very often it is not clear which region of the large dataset should be viewed at full resolution, until a zoomed-out overview of the data has been provided to the user. And unfortunately, constructing such an overview using datashader will currently require making a complete pass through the entire dataset, reading all of the data into Python in order to aggregate it into a coarse grid.
Even if the Python code were to explicitly downsample the data, accessing only a strided subset of the data from disk, with most file formats and storage technologies the entire dataset will still end up being accessed.
After the above issue is addressed, it would be great if we could work on some specific combinations of aggregations and file formats where constructing such an overview would be relatively cheap. We could also consider making examples of having some pre-processed overview of the data stored with it, and then using HoloViews to switch dynamically between the overview and the full data depending on zoom level.
I am currently implementing something similar for our data. I came along #553 and then started working on this myself. So far I didn't find a way to wrap xarrays apply_ufunc or dask.map_blocks even though the latter would support ghosting which, as mentioned by @philippjfr, will probably be of importance for the sampling. The problem is that one needs to know the size of the processed data chunks and these will be chunk-dependent for non-symmetrically chunked data sets (i.e. an array with size 2055 along one axis, chunked into 1000,1000,55).
-1. data = dask.array.from_array(hdf5 file, chunks=(1000,1000))
#estimate number of plot tiles
Inside the resampling_function, I initialize an array of zeros with the size of the plot canvas and fill one tile with the sampled array from gridtools.resample_2d. In the end when I sum over dask_delayed I end up with the complete agg. I am still testing the equivalence but so far it seems to work very well. On my laptop I can downsample more than 7 GByte in a few seconds.
In addition I added a cache of the full downsampled dataset in my plot class so that resetting to the full data extent, by far the biggest computation, is only performed once.
Thinking a little bit more about it, above works well for me without any chunk overlap because of the data (a matrix of probabilities). The values are of the same magnitude in their respective neighborhood and will converge to an area mean regardless of where exactly the split occurs.