Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Implementation of DiskArray interface #105

Merged
merged 25 commits into from
Aug 15, 2020
Merged

Conversation

meggart
Copy link
Collaborator

@meggart meggart commented Jan 15, 2020

As discussed here there was the suggestion to implement the DiskArray interface for the raster types in ArchGDAL. For a single RasterBand this was straightforward. However I had to jump through some hoops to make it possible to treat a whole Dataset as a AbstractDiskArray. The major problems were:

  • data type is not guaranteed to be constant across bands
  • band size is not guaranteed to be constant across bands
  • there are datasets that do not contain any raster data (like all vector data)

So I decided to make a new Dataset wrapper (RasterDataset) which does some checks during construction and then wraps a Dataset into a new AbstractDiskArray. I just tested this and the following functionality would come for free (plus everything else we decide to put on top of DiskArrays.jl)

  • efficient getindex/setindex for Bands and Datasets
  • efficient reductions over blocks that are defined by windows
  • reductions over dimensions

Some examples that I tried

r = AG.read("/home/fgans/julia_dev/ArchGDAL/test/ospy/data4/aster.img") do ds
  b = AG.getband(ds,2)
  mean(b)
end
r = AG.readraster("/home/fgans/julia_dev/ArchGDAL/test/ospy/data4/aster.img") do ds
  r2 = view(ds,1:1000,1:1000,:)
  r2[20:22,20:22,1]
end
r = AG.readraster("/home/fgans/julia_dev/ArchGDAL/test/ospy/data4/aster.img") do ds
  maximum(ds,dims=(1,2))
end

@rafaqz I was confused by a few tests in test_array.jl because it looks like you allowed to omit the last index when accessing a Dataset, even if the number of bands is greater than one. I marked these tests as broken here, what was the rationale behind this behavior?

@rafaqz
Copy link
Collaborator

rafaqz commented Jan 15, 2020

I think the underlying read and write methods allow treating the array as 2d like that. But you are right that shouldn't be allowed in the array interface.

But I'm not sure how bands should be treated really. Is a single band raster 2d or still 3d? In GeoData.jl you always get a 3d array with a Band dim no matter how many bands there are, for consistency.

@meggart
Copy link
Collaborator Author

meggart commented Jan 23, 2020

I think the underlying read and write methods allow treating the array as 2d like that. But you are right that shouldn't be allowed in the array interface.

Thanks for the clarification. For now I removed the tests that were marked as broken

But I'm not sure how bands should be treated really. Is a single band raster 2d or still 3d? In GeoData.jl you always get a 3d array with a Band dim no matter how many bands there are, for consistency.

I tend to go for consistency as well, the number of dimensions of the array should always be predictable. Since I already introduced a new function name (maybe @yeesian can comment if this is ok), should we make a distinction between readraster and readstack, where readraster opens a dataset and returns a handle to the first band while readstack returns a 3D-handle to all bands? Or is this too R-rish...

@felixcremer
Copy link
Contributor

I tried to use this implementation as input to an ESDL data cube see https://github.com/esa-esdl/ESDL.jl/pull/193.
At the moment the functions which use the pointer to apply GDAL functions doesn't work.
For me it was the getgeotransform and the nraster functions.

Could we implement a nicer way to get an interactive RasterDataset?
I would suggest, that readraster("path/to/file") would return the RasterDataset, because this would make it very nice for interactive work.
At the moment I can get this with

AG.RasterDataset(AG.read("/path/to/file"))

I don't like the name readraster, if this returns single bands. I would suggest the name readband and to always expect an Integer to indicate which band should be read.

@meggart
Copy link
Collaborator Author

meggart commented Apr 17, 2020

So I updated the PR according to the comments. @yeesian maybe you can give a short comment if you would be interested in this PR?

@meggart
Copy link
Collaborator Author

meggart commented Apr 17, 2020

To be more specific, AG.readraster("raster.tif") now returns a wrapped IDataset and I added forwarding for a bunch of methods like width, height etc.

@yeesian yeesian requested a review from visr April 18, 2020 02:29
src/raster/array.jl Outdated Show resolved Hide resolved
src/raster/array.jl Outdated Show resolved Hide resolved
src/raster/array.jl Outdated Show resolved Hide resolved
src/raster/array.jl Outdated Show resolved Hide resolved
src/raster/array.jl Outdated Show resolved Hide resolved
src/raster/array.jl Outdated Show resolved Hide resolved
src/raster/array.jl Outdated Show resolved Hide resolved
src/raster/array.jl Outdated Show resolved Hide resolved
src/raster/array.jl Outdated Show resolved Hide resolved
Copy link
Owner

@yeesian yeesian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took me so long to get around to this. I was thrown off by "WIP" in the title. I like the decisions of

  1. the correspondence between a
  • RasterDataset
  • (GDAL)Dataset
  • 3D (Julia) Array
  1. the correspondence between
  • RasterBand
  • DiskArray{T,2}
  • 2D (Julia) Array
  1. that read returns a GDAL Dataset while readraster returns a RasterDataset.

Since we're introducing RasterDataset as a new data type, it's worth documenting it well, so that we don't have to revisit these design decisions in future PRs. For that reason, it might also be worth putting some thought into the sets of methods that we'll like to support (and not support) on it, and to provide docstrings for those functions that are endorsed for use.


The comments I left are more of minor questions -- for that reason, I don't require that they be addressed before merging it. Nonetheless, I think it'll be good to have visr@ and rafaqz@ (and evetion@) weigh in on this PR too.

@meggart meggart changed the title WIP: Implementation of DiskArray interface RFC: Implementation of DiskArray interface Apr 20, 2020
@meggart
Copy link
Collaborator Author

meggart commented Apr 20, 2020

Thanks for the comments @yeesian I tried to better document the RasterDataset in the source code which should address many of your comments. Later this week I will try to update https://github.com/yeesian/ArchGDAL.jl/blob/master/docs/src/rasters.md and add a paragraph about dealing with RasterDatasets.

Maybe we get some more opinions by @rafaqz or @visr or @evetion on which methods should be forwarded for RasterDatasets or other design decisions that should be taken into account.

@visr
Copy link
Collaborator

visr commented Apr 20, 2020

Great to see this coming together. I'll go over it later this week.

src/raster/array.jl Outdated Show resolved Hide resolved
Copy link
Collaborator

@visr visr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things that would be good to address still:

  • Could you use a 4 space indent, for consistency with the rest of the ArchGDAL code?
  • Some tests for the specific functions and types that you added would be good to have
  • Doc updates: this probably needs some updates to the raster, dataset and windowed reads and writes pages.

For the forwarding of AbstractDataset functions, I quickly went over dataset.jl and came up with these functions:

copywholeraster(source::AbstractDataset, dest::AbstractDataset; <keyword arguments>)
copy(dataset::AbstractDataset; [filename, [driver, [<keyword arguments>]]])
write(dataset::AbstractDataset, filename::AbstractString; kwargs...)
getdriver(dataset::AbstractDataset)
filelist(dataset::AbstractDataset)
testcapability(dataset::AbstractDataset, capability::AbstractString)
listcapability
setgeotransform!
ngcp(dataset::AbstractDataset)
setproj!(dataset::AbstractDataset, projstring::AbstractString)
buildoverviews!

Plus from rasterio.jl functions like rasterio!/read/write/read!/write!.
These are quite a lot though. We could consider using a delegation/forwarding macro like discussed in this thread.

Although since there are ths many, can't we just make RasterDataset an AbstractDataset instead of an AbstractDiskArray{T,3}, and add the needed DiskArrays method we would otherwise get? The DiskArrays readme mentions that instead of subtyping you can also use interpret_indices_disk. Or is there another reason you didn't use that here?

src/raster/array.jl Outdated Show resolved Hide resolved
src/raster/array.jl Outdated Show resolved Hide resolved
src/raster/array.jl Outdated Show resolved Hide resolved
@visr
Copy link
Collaborator

visr commented Apr 27, 2020

The RasterDataset is also very interesting in relation to https://gdal.org/development/rfc/rfc75_multidimensional_arrays.html.

This is landing in GDAL 3.1.0 which will come out in a few days:

Implement RFC 75: support for multidirectional arrays in MEM, VRT, netCDF, HDF4, HDF5 and GRIB drivers. Read/write for MEM and netCDF. Read/only for others. Add gdalmdiminfo and gdalmdimtranslate utilities.

Haven't dived much in that RFC yet though.

@meggart
Copy link
Collaborator Author

meggart commented Aug 5, 2020

Thanks a lot for the commments @visr.

Plus from rasterio.jl functions like rasterio!/read/write/read!/write!.
These are quite a lot though. We could consider using a delegation/forwarding macro like discussed in this thread.
Although since there are ths many, can't we just make RasterDataset an AbstractDataset instead of an AbstractDiskArray{T,3}, and add the needed DiskArrays method we would otherwise get? The DiskArrays readme mentions that instead of subtyping you can also use interpret_indices_disk. Or is there another reason you didn't use that here?

The main reason is that I would like a RasterDataset to be a subtype of AbstractArray, because ideally it should just feel like an array, and they should be naturally wrappable by packages like AxisArrays and DImensionalData which usually expect to wrap and AbstractArray.

Initially I also intentionally did not implement the methods from rasterio.jl, because I see the RasterDataset as an alternative interface to treating a Dataset with much more Julian syntax getindex/setindex syntax. So my thinking was that a user either uses the classical open_dataset to obtain a domain-specific object with special methods for read, write, and rasterio or they can use open_raster to obtain something array-like that is accessed by square-brackets. I would try to reflect this in the documentation, if we agree to go this way.

@coveralls
Copy link

coveralls commented Aug 5, 2020

Coverage Status

Coverage decreased (-0.3%) to 92.389% when pulling 3417d94 on meggart:diskarrays into 47cc556 on yeesian:master.

@visr
Copy link
Collaborator

visr commented Aug 6, 2020

Ok, I see the advantages of subtyping AbstractArray, and agree we should go with that for the DiskArray interface. I'm a bit worried about splitting our API in two halves. Perhaps we could migrate in time to a single one that is AbstractArray? Probably shouldn't hold up this PR though, but is worth thinking about.

@meggart
Copy link
Collaborator Author

meggart commented Aug 6, 2020

Good, so I have added a paragraph to the documentation and added additional unit tests. I think I have to check coveralls to find out why there is still a decrease in coverage. Otherwise I think I have addressed the major comments and this should be ready to go, please let me know if I forgot something.

docs/src/rasters.md Outdated Show resolved Hide resolved
docs/src/rasters.md Outdated Show resolved Hide resolved
@meggart
Copy link
Collaborator Author

meggart commented Aug 10, 2020

Yes I just created the PR on the registry: JuliaRegistries/General#19258

@visr visr reopened this Aug 10, 2020
@meggart
Copy link
Collaborator Author

meggart commented Aug 10, 2020

Ok, the new DiskArrays version got tagged. I think @felixcremer wanted to look at the reduced test coverage and add a few tests for the WindowIterator, after that, test coverage should also be fine again.

@felixcremer
Copy link
Contributor

felixcremer commented Aug 11, 2020

I made a pull request against your fork to add tests for WindowIterator. I am wondering, whether we should rename the RasterDataset to RasterArray, because it might be confusing, that the RasterDataset is not a subtype of AbstractDataset.

Also the display function for RasterBand and RasterDataset fall back to the show methods of a DiskArray. I think it would be nice to fallback to the show method of the underlying dataset for RasterDataset and to indicate that it is a RasterBand for the RasterBand.

(cherry picked from commit 9f731d5)
@felixcremer
Copy link
Contributor

I just looked at the coveralls results and the missing lines in iterators.jl are tested, but not picked up correctly by coveralls.

@visr
Copy link
Collaborator

visr commented Aug 12, 2020

I made a pull request against your fork to add tests for WindowIterator.

Thanks, I didn't see a PR but found the commit and put it on this branch with git cherry-pick -x 9f731d57.

I am wondering, whether we should rename the RasterDataset to RasterArray

I see your point, but can also see the advantages of having Dataset in the name, since it wraps a Dataset. I'd say let's keep it like this.

Also the display function for RasterBand and RasterDataset fall back to the show methods of a DiskArray.

Indeed. I don't quite understand why it hits those fallbacks, since in display.jl there is Base.show(io::IO, rasterband::AbstractRasterBand). And for RasterDataSet it would be nice to show something similar to what Base.show(io::IO, dataset::AbstractDataset) shows, minus the feature layers I guess. @meggart could you have a look at the show methods? Otherwise I think this is good to go for me.

I added a few commits. Some of the doc changes are unrelated, sorry for that, but needed to get the docs to build again.

Since this changes the existing types, I don't think we can release this as a patch release, so it would be 0.5.0.

docs/src/rasters.md Outdated Show resolved Hide resolved
@felixcremer
Copy link
Contributor

I think, the problem is, that DiskArrays defines a method of show with a MIME type,

[1] show(io::IO, ::MIME{Symbol("text/plain")}, X::DiskArrays.AbstractDiskArray) in DiskArrays at src/DiskArrays.jl:210

and this is the function to which the plain show call from the repl is dispatched to.
I am not sure, whether we would need to define the show function in ArchGDAL with a MIME type as well, or whether this method should be deleted in DiskArrays.

For the RasterDataset, we can include this line to use the show method of the underlying dataset:

Base.show(io::IO, ::MIME"text/plain", raster::RasterDataset) = show(io, raster.ds)

meggart and others added 2 commits August 14, 2020 09:48
Co-authored-by: Felix Cremer <felix.cremer@uni-jena.de>
@meggart
Copy link
Collaborator Author

meggart commented Aug 14, 2020

Thanks @felixcremer for pointing this out. I remember I had to do this for DiskArrays because otherwise sometimes the generic AbstractArray show method would still leak through, which would be very bad for a DiskArray. I can not remember the exact reason, why this was necessary, we might revisit this in DiskArrays.

@visr
Copy link
Collaborator

visr commented Aug 14, 2020

Thanks. I still want to watch JuliaCon 2020 | Display, show and print -- how Julia's display system works | Fredrik Ekre.

I think the same patch is needed for RasterBand if I remember correctly.

This avoids the usage of the DiskArray show method for rasterbands
We need the version without MIME type so that we can use the print
function.
@felixcremer
Copy link
Contributor

I added the MIME type patch to the RasterBand show function.
I also added, show methods without the MIME type, so that print(band) works as well. I am not entirely sure, whether this is the correct way to overload these show functions, but this should work for now.

@meggart
Copy link
Collaborator Author

meggart commented Aug 14, 2020

I just watched the video and it looks like the preferred way is not to overload show(io,x::MyType) but to explicitly pick a MIME type for your output. So I added methods at least for RasterBand and RasterDataset, but one might think about revisiting display.jl at one point.

@visr
Copy link
Collaborator

visr commented Aug 14, 2020

This is a very nice addition to ArchGDAL. Everybody good to merge?

Copy link
Owner

@yeesian yeesian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great addition, thanks for seeing it through!

I appreciate the work it took to standardize on the methods for ArchGDAL across array.jl and rasterband.jl (away from the Base index methods).

@yeesian yeesian merged commit c224818 into yeesian:master Aug 15, 2020
@visr
Copy link
Collaborator

visr commented Aug 15, 2020

This PR also makes it easier to convert multiple bands into a RGB image:

using ImageCore, ArchGDAL
ds = ArchGDAL.readraster("/vsicurl/https://github.com/yeesian/ArchGDALDatasets/raw/master/gdalworkshop/world.tif")
img = colorview(RGB, normedview(PermutedDimsArray(ds, (3,2,1))))

size(ds) is (2048, 1024, 3), so (width, height, bands).
With permutation (3,2,1) the bands are brought to the beginning, because Images.jl expects all colors together in a pixel, and the width and height are swapped to plot the image upright in VS Code. (cc @pritamd47)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants