New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
ak.unflatten
does not respect IndexedArray
w.r.t counts
#910
Comments
ak.unflatten
does not respect IndexedArray
on given axisak.unflatten
does not respect IndexedArray
w.r.t counts
s
ak.unflatten
does not respect IndexedArray
w.r.t counts
sak.unflatten
does not respect IndexedArray
w.r.t counts
I think you may be right, though I'm confused by the examples that use the layout itself (through The one high-level point I can make is that IndexedArrays must be invisible at high-level. It's essentially applying a lazy " The intended implementation would have eagerly applied the " |
Hi Jim, thanks for chiming in. The I agree that I've been playing around with a kind of simplification that merges successive layouts together. For example layout = ak.layout.IndexedArray64(
ak.layout.Index64([1, 0]),
ak.layout.ListOffsetArray64(
ak.layout.Index64([0, 2, 4]),
ak.layout.ListOffsetArray64(
ak.layout.Index64([0, 3, 7, 9, 12]),
ak.layout.NumpyArray(([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3])),
),
),
) can be simplified top down (once) to become (via project) layout1 = ak.layout.ListArray64(
ak.layout.Index64([2, 0]),
ak.layout.Index64([4, 2]),
ak.layout.ListOffsetArray64(
ak.layout.Index64([0, 3, 7, 9, 12]),
ak.layout.NumpyArray(([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3])),
),
) and layout2 = ak.layout.ListOffsetArray64(
ak.layout.Index64([0, 2, 4]),
ak.layout.ListOffsetArray64(
ak.layout.Index64([0, 2, 5, 8, 12]),
ak.layout.IndexedArray64(
ak.layout.Index64([7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6]),
ak.layout.NumpyArray(([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3])),
),
),
) I don't know whether this approach is the "right" one - it avoids copying the contents (which is good for a record array), but involves creating a number of intermediate layouts. |
So here's an incomplete simplifier that handles this particular layout (and makes a copy whilst it's at it). It's probably wrong e.g. with the assertion, but it's a PoC rather than production code. I don't know whether this approach is the best one. The idea here is that by moving all reordering to the inner-most level, then the flattened counts will correlate to the layout. One way to do this is:
def replace_content(layout, content):
def getfunction(this, depth):
if this is layout:
return
return lambda: content
return ak._util.recursively_apply(layout, getfunction)
def simplify(layout):
if hasattr(layout, "project"):
return simplify(layout.project())
# Now we don't have an IndexedArray
if not hasattr(layout, "content"):
return layout
# We now have only listtypes
assert isinstance(layout, ak._util.listtypes)
# ListArrays can be converted to another case (reduce dimension of problem)
if isinstance(layout, ak.layout.ListArray64):
layout = layout.toListOffsetArray64(True)
return simplify(layout)
# When arrays are
# - structural
# - contiguous
# then we can't optimise any further, and should move on
if isinstance(layout, (ak.layout.ListOffsetArray64, ak.layout.RegularArray)):
return replace_content(layout, simplify(layout.content))
raise NotImplementedError You know much better what is going on under the hood with respect to copying, RecordArrays, etc. Do you think this approach is the right one here? Is this kind of multi-layout modification possible w.r.t all of the possible layouts (including partitioned arrays)? |
We shouldn't simplify more than we need to. For example, replacing ListArrays with ListOffsetArrays doesn't have anything to do with the current problem, and ListArrays exist to avoid excessive reordering. I think your current problem with It's better to make a chain of Unfortunately for making things local and factorizable, implementing any method generally requires the author to know about all node types. OOP is usually against that, but there was no way to avoid it and do what Awkward Array does. This is the Nth iteration on the set of node types and very final鈥攖he only possible new node type is RedirectArray (#178), but that may be unnecessary now that the C++ is being translated into Python (thanks to the garbage collector). So this problem can be fixed by considering each of the possible node types just above the node that changes its length: IndexedArrays need to be projected, etc. |
Yes, in my PoC I just convert things to reduce the number of combinations to consider.
For posterity, I'll add that it's both length and ordering that we need to handle here.
@jpivarski that's true - we can just project the
Noted, thanks. Old habits die hard I suppose :) I noticed while writing the visitor that it would be useful if the |
Maybe I'm thinking of
|
Yes, there is a |
@jpivarski once #912 is ready to go, what do you think we should do here? Might feeling is that unless you actively care about what's going on under the hood, the average user should not need to be aware that |
Possibly producing wrong results is always bad. If |
I think it is caused by any layout that re-orders / resizes the internal array - Because unflatten can be called for any depth, the "fix" needs to be able to apply to any depth. I think the fact that UnionArray's break the axis wrapping assumption may not be a problem ... because unflatten can't handle unions above the axis if my assumptions are correct. |
Reproducer 馃悰
Same-depth
IndexedArray
sGiven the following layout:
which has the list representations
the act of unflattening by the run lengths produces
If we apply
run_lengths
to theListOffsetArray
layout, then the result is "correct":Upper
IndexedArray
sThis is not just true for thee same-depth
index
layouts, but also,any layout at any depth above the current depth. Consider this layout:which has
Its run lengths are
Unflattening as before, we have
instead of
My expectation is that these public APIs should respect the depth-preserving (
IndexedArray
) layouts.Cause 馃攳
The cause is simply that we don't transform the
counts
array with respect to preceding layouts.Same-depth
IndexedArray
sak.unflatten
usesrecursively_apply
to find the layout corresponds to the axis above the location indicated by the user. When there areak.layout.IndexedArray
layouts at the same depth, we move past them until we find a list type. https://github.com/scikit-hep/awkward-1.0/blob/94de4e5112ad3a2d5fc9c2ec0fc29e242543a0d6/src/awkward/operations/structure.py#L2093e.g. walking the above example
Upper
IndexedArray
sEffectively the
getfunction
skips over these layouts until the correct depth is reached.Solution 馃敡
I think we need to aggregate the non list-type layouts that precede the current list-type layout (that
getfunction
operates upon) and create a newIndexedArray
layout that wraps the current layout's content. Then, we create theListOffsetArray
over this layout, and restore the structural layouts above.The text was updated successfully, but these errors were encountered: