Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AbstractDimStack <: AbstractArray #576

Open
rafaqz opened this issue Dec 11, 2023 · 5 comments
Open

AbstractDimStack <: AbstractArray #576

rafaqz opened this issue Dec 11, 2023 · 5 comments

Comments

@rafaqz
Copy link
Owner

rafaqz commented Dec 11, 2023

The idea came up trying to write rand samplers for AbstractDimStack.

Why isn't an AbstractDimArray an AbstractArray{<:NamedTuple} ?

With rand, we would always want the sampler to return a NamedTuple of values from each layer. But what does that mean with mixed-size layers? Currently indexing a DimStack with a linear Int doesn't do something sensible - it just indexes every layer at the same index in map. We should instead treat a DimStack exactly like the DimTable it can be unrolled into. Linear indexing a DimStack should always give the same value as indexing a row of its DimTable - that is it really indexes with:

getindex(st::AbstractDimStack, i::Int) = st[DimIndices(st)[i]...]

I had actually forgotten that it doesn't do that already and consider it almost a bug. When all layers are the same size there will be no difference, and we can still just use regular linear indexing on all layers.

With that change DimStack can basically be an AbstractArray. This means rand will work already without adding a sampler, and broadcast and iteration will also "just work", closing #412.

The main breaking change is to map that currently works over stack Array layers, not NamedTuple values. We will need to replace it with e.g. maplayers. This will need a lot of changes here and also in Rasters.jl. So probably in other packages too.

Any thoughts @sethaxen @felixcremer @JoshuaBillson ?

@JoshuaBillson
Copy link
Contributor

I largely support the idea. However, I think this raises a few questions:

  1. Would indexing into a DimStack return the dimensional values, as it does with a DimTable?
  2. Are DimStacks going to be stored as an array of NamedTuples, or is this simply a type-level abstraction as in StructArrays.jl? I'm concerned that the former would be highly inefficient for many operations.
  3. Could we use this as an opportunity to support wide stacks? The current issue is that NamedTuple has extremely long compile times for DimStacks with many layers. If we choose to represent a standard stack as an AbstractArray{<:NamedTuple}, perhaps we could represent a wide stack as an AbstractArray{<:AbstractDict}. Indexing into a stack would then be analogous to a Tables.rowtable or Tables.dictrowtable, respectively.
  4. A feature I've often found myself wishing for is union and intersection operators for combining stacks and arrays. For example, I could take a RasterStack representing a satellite image and a Raster containing a DEM layer and take their intersection to produce a new RasterStack with an extent equal to the area covered by both data sources. Similarly, a union operation could be used to combine two satellite images covering different geographic regions. One problem is that NamedTuples are immutable, so adding new layers on the fly is difficult. Is there any way we could support dynamically adding layers to an existing stack?
  5. Would broadcast operate over NamedTuples or their individual fields? I feel like the former would be tedious for many operations, like scaling all values by some constant. Also, what would happen if the broadcast function returns a NamedTuple with different fields than the input? Would we get a new DimStack with the same layers as the names in the output NamedTuple? So broadcasting the function f(x) = (sum = x.a + x.b, product = x.a * x.b) would take a stack with the layers a and b and return a new stack with layers sum and product?

@rafaqz
Copy link
Owner Author

rafaqz commented Dec 12, 2023

  1. No, just the values, not the dim values
  2. No storage stays the same. Its very similar to struct arrrays. (Its already how it works with dim/selector indexing too, so very small change).
  3. This is mostly already how it works! NamedTuple returned from indexing is not a new idea Im adding. But in Julia 1.11 NamedTuples should compile a lot faster and reduce your wide stack problem. If we still want Dicts we need a new stack mode that we can switch to. We could just change the new AbstractArray T in AbstractDimArray{T} from NamedTuble to Dict (with matching internals) and dispatch on that in getindex. We would never switch completely to Dict (or Dictionary) as it's much slower for small stack use cases.
  4. You can do intersect/union with crop or extend. If we had a mode switch to Dict we could do in-place adding layers. But what are the issues with using merge to make a new stack (besides wide stacks...)?
  5. Broadcast would return an AbstractDimArray of your return values. You could do things like:
A = (x -> x.a * x.b).(st) .* C

Which is different to

A = st.a .* st.b .* C

Because it will handle mixed-size layers. But yes its a little clunky accessing nested values in broadcasts.

@rafaqz
Copy link
Owner Author

rafaqz commented Dec 12, 2023

Probably a Dictionaries.jl Dictionary would have nicer properties as a mode switch from NamedTuple.

See: andyferris/Dictionaries.jl#43

We really need getproperty on dictionary keys for this to work seamlessly.

@sethaxen
Copy link
Collaborator

For InferenceObjects.jl, we for the most part ignore that getindex returns a NamedTuple and primarily use DimStack as a NamedTuple of DimArrays with the extra check that Dims are consistent. The only exception is the Tables interface implementation. So I don't have a strong feeling either way, as I don't think it will affect my users much. My only concern is whether making an AbstractDimStack an AbstractArray might cause annoying defaults to be hit in other packages or method ambiguities.

@JoshuaBillson
Copy link
Contributor

  1. This is mostly already how it works! NamedTuple returned from indexing is not a new idea Im adding. But in Julia 1.11 NamedTuples should compile a lot faster and reduce your wide stack problem. If we still want Dicts we need a new stack mode that we can switch to. We could just change the new AbstractArray T in AbstractDimArray{T} from NamedTuble to Dict (with matching internals) and dispatch on that in getindex. We would never switch completely to Dict (or Dictionary) as it's much slower for small stack use cases.

It would be a huge relief if 1.11 can address the compilation issue with NamedTuples, as it's been one of my major performance gripes. I wasn't suggesting that we replace NamedTuple with a Dictionary entirely, only that we make it possible to substitute one for the other in order to accommodate wide stacks. However, perhaps Julia 1.11 will solve this issue?

  1. Broadcast would return an AbstractDimArray of your return values. You could do things like:
A = (x -> x.a * x.b).(st) .* C

Which is different to

A = st.a .* st.b .* C

Because it will handle mixed-size layers. But yes its a little clunky accessing nested values in broadcasts.

That makes sense and would work well enough in general. However, would it be possible to return an AbstractDimStack if the broadcasted function returns a NamedTuple and an AbstractDimArray if it returns a scalar?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants