`ak.unflatten` does not respect `IndexedArray` w.r.t `counts` #910

agoose77 · 2021-06-11T08:39:37Z

Reproducer 🐛

Same-depth `IndexedArray`s

Given the following layout:

layout = ak.layout.IndexedArray64(
    ak.layout.Index64(np.array([3, 1, 0, 2])),
    ak.layout.ListOffsetArray64(
        ak.layout.Index64(np.array([0, 3, 6, 9, 12])),
        ak.layout.NumpyArray(np.array([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3])),
    ),
)

which has the list representations

>>> display(
...     ak.to_list(layout),
...     ak.to_list(layout.content),
...     ak.to_list(layout.content.content),
... )
[[3, 3, 3], [1, 1, 1], [0, 0, 0], [2, 2, 3]]
[[0, 0, 0], [1, 1, 1], [2, 2, 3], [3, 3, 3]]
[0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3]

the act of unflattening by the run lengths produces

>>> ak.unflatten(
...     layout,
...     ak.flatten(ak.run_lengths(layout)),
...     axis=1
... ).tolist()
[[[3, 3], [3]], [[1, 1, 1]], [[0, 0, 0]], [[2, 2, 3]]]

If we apply run_lengths to the ListOffsetArray layout, then the result is "correct":

>>> ak.unflatten(
...     layout,
...     ak.flatten(ak.run_lengths(layout.content)),
...     axis=1
... ).tolist()
[[[3, 3, 3]], [[1, 1, 1]], [[0, 0, 0]], [[2, 2], [3]]]

Upper `IndexedArray`s

This is not just true for thee same-depth index layouts, but also,any layout at any depth above the current depth. Consider this layout:

layout = ak.layout.IndexedArray64(
    ak.layout.Index64([1, 0]),
    ak.layout.ListOffsetArray64(
        ak.layout.Index64([0, 2, 4]),
        ak.layout.ListOffsetArray64(
            ak.layout.Index64(([0, 3, 6, 9, 12])),
            ak.layout.NumpyArray(([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3])),
        ),
    ),
)

which has

>>> display(
...     ak.to_list(layout),
...     ak.to_list(layout.content),
...     ak.to_list(layout.content.content),
...     ak.to_list(layout.content.content.content),
... )
[[[2, 2, 3], [3, 3, 3]], [[0, 0, 0], [1, 1, 1]]]
[[[0, 0, 0], [1, 1, 1]], [[2, 2, 3], [3, 3, 3]]]
[[0, 0, 0], [1, 1, 1], [2, 2, 3], [3, 3, 3]]
[0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3]

Its run lengths are

>>> ak.run_lengths(layout).to_list()
[[[2, 1], [3]], [[3], [3]]]

Unflattening as before, we have

>>> ak.unflatten(layout, ak.flatten(ak.run_lengths(layout), axis=None), axis=-1).tolist()
[[[[2, 2, 3]], [[3, 3, 3]]], [[[0, 0], [0]], [[1, 1, 1]]]]

instead of

[[[[2, 2], [3]], [[3, 3, 3]]], [[[0, 0, 0]], [[1, 1, 1]]]]

My expectation is that these public APIs should respect the depth-preserving (IndexedArray) layouts.

Cause 🔍

The cause is simply that we don't transform the counts array with respect to preceding layouts.

Same-depth `IndexedArray`s

ak.unflatten uses recursively_apply to find the layout corresponds to the axis above the location indicated by the user. When there are ak.layout.IndexedArray layouts at the same depth, we move past them until we find a list type. https://github.com/scikit-hep/awkward-1.0/blob/94de4e5112ad3a2d5fc9c2ec0fc29e242543a0d6/src/awkward/operations/structure.py#L2093
e.g. walking the above example

Layout	Depth	List Type?
IndexedArray	1	False
ListOffsetArray	1	True

Upper `IndexedArray`s

Effectively the getfunction skips over these layouts until the correct depth is reached.

Solution 🔧

I think we need to aggregate the non list-type layouts that precede the current list-type layout (that getfunction operates upon) and create a new IndexedArray layout that wraps the current layout's content. Then, we create the ListOffsetArray over this layout, and restore the structural layouts above.

The text was updated successfully, but these errors were encountered:

jpivarski · 2021-06-11T11:20:19Z

I think you may be right, though I'm confused by the examples that use the layout itself (through run_lengths) to generate counts.

The one high-level point I can make is that IndexedArrays must be invisible at high-level. It's essentially applying a lazy "carry" (non-negative integer array slice) at some point in the tree, and whatever happens, such as unflatten, should not depend on whether the "carry" is lazily applied or eagerly applied.

The intended implementation would have eagerly applied the "carry" during the first recursive descent. If a new option, off by default, needs to be added to ak.util.recursively_apply to implement that, that's okay.

agoose77 · 2021-06-11T11:36:02Z

Hi Jim, thanks for chiming in. The run_lengths on content just skips the outer IndexedArray. This works because the IndexedArray doesn't modify the ordering within sublists. Usually the user would be running this on the actual Array, but I was being a bit lazy.

I agree that unflatten should not care about the IndexedArray vs carry, and indeed run_lengths behaves as expected - it respects all of the layouts. It's just the current implementation for unflatten that doesn't.

I've been playing around with a kind of simplification that merges successive layouts together. For example

layout = ak.layout.IndexedArray64(
    ak.layout.Index64([1, 0]),
    ak.layout.ListOffsetArray64(
        ak.layout.Index64([0, 2, 4]),
        ak.layout.ListOffsetArray64(
            ak.layout.Index64([0, 3, 7, 9, 12]),
            ak.layout.NumpyArray(([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3])),
        ),
    ),
)

can be simplified top down (once) to become (via project)

layout1 = ak.layout.ListArray64(
    ak.layout.Index64([2, 0]),
    ak.layout.Index64([4, 2]),
    ak.layout.ListOffsetArray64(
        ak.layout.Index64([0, 3, 7, 9, 12]),
        ak.layout.NumpyArray(([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3])),
    ),
)

and

layout2 = ak.layout.ListOffsetArray64(
    ak.layout.Index64([0, 2, 4]),
    ak.layout.ListOffsetArray64(
        ak.layout.Index64([0, 2, 5, 8, 12]),
        ak.layout.IndexedArray64(
            ak.layout.Index64([7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6]),
            ak.layout.NumpyArray(([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3])),
        ),
    ),
)

I don't know whether this approach is the "right" one - it avoids copying the contents (which is good for a record array), but involves creating a number of intermediate layouts.

agoose77 · 2021-06-11T12:33:52Z

So here's an incomplete simplifier that handles this particular layout (and makes a copy whilst it's at it). It's probably wrong e.g. with the assertion, but it's a PoC rather than production code. I don't know whether this approach is the best one. The idea here is that by moving all reordering to the inner-most level, then the flattened counts will correlate to the layout. One way to do this is:

Move all IndexedArray routines to the inner buffer
Change ListArrays to ListOffsetArrays.

def replace_content(layout, content):
    def getfunction(this, depth):
        if this is layout:
            return
        return lambda: content

    return ak._util.recursively_apply(layout, getfunction)


def simplify(layout):
    if hasattr(layout, "project"):
        return simplify(layout.project())

    # Now we don't have an IndexedArray
    if not hasattr(layout, "content"):
        return layout

    # We now have only listtypes
    assert isinstance(layout, ak._util.listtypes)

    # ListArrays can be converted to another case (reduce dimension of problem)
    if isinstance(layout, ak.layout.ListArray64):
        layout = layout.toListOffsetArray64(True)
        return simplify(layout)

    # When arrays are
    # - structural
    # - contiguous
    # then we can't optimise any further, and should move on
    if isinstance(layout, (ak.layout.ListOffsetArray64, ak.layout.RegularArray)):
        return replace_content(layout, simplify(layout.content))
    
    raise NotImplementedError

You know much better what is going on under the hood with respect to copying, RecordArrays, etc. Do you think this approach is the right one here? Is this kind of multi-layout modification possible w.r.t all of the possible layouts (including partitioned arrays)?

jpivarski · 2021-06-11T13:06:08Z

We shouldn't simplify more than we need to. For example, replacing ListArrays with ListOffsetArrays doesn't have anything to do with the current problem, and ListArrays exist to avoid excessive reordering.

I think your current problem with unflatten is because the non-list nodes directly above the node whose length changes need to be simplified: e.g. IndexedArrays at this level only need to be projected.

It's better to make a chain of elif isinstance(layout, node_classes) than to test for the existence of attributes like hasattr(layout, "project"). IndexedOptionArray also has a project method and it doesn't do what you want: it removes None values.

Unfortunately for making things local and factorizable, implementing any method generally requires the author to know about all node types. OOP is usually against that, but there was no way to avoid it and do what Awkward Array does. This is the Nth iteration on the set of node types and very final—the only possible new node type is RedirectArray (#178), but that may be unnecessary now that the C++ is being translated into Python (thanks to the garbage collector).

So this problem can be fixed by considering each of the possible node types just above the node that changes its length: IndexedArrays need to be projected, etc.

agoose77 · 2021-06-11T13:20:53Z

We shouldn't simplify more than we need to. For example, replacing ListArrays with ListOffsetArrays doesn't have anything to do with the current problem, and ListArrays exist to avoid excessive reordering.

Yes, in my PoC I just convert things to reduce the number of combinations to consider.

So this problem can be fixed by considering each of the possible node types just above the node that changes its length: IndexedArrays need to be projected, etc.

For posterity, I'll add that it's both length and ordering that we need to handle here.

I think your current problem with unflatten is because the non-list nodes directly above the node whose length changes need to be simplified: e.g. IndexedArrays at this level only need to be projected.

@jpivarski that's true - we can just project the IndexedArrays because the buffers below the axis depth won't be at risk of being copied.

It's better to make a chain of elif isinstance(layout, node_classes) than to test for the existence of attributes like hasattr(layout, "project"). IndexedOptionArray also has a project method and it doesn't do what you want: it removes None values.

Noted, thanks. Old habits die hard I suppose :) I noticed while writing the visitor that it would be useful if the apply method of recursively_apply were available to call by getfunction, in the case where it wants to operate at multiple depths sequentially. Perhaps it should be an optional argument to getfunction?

jpivarski · 2021-06-11T14:17:44Z

I noticed while writing the visitor that it would be useful if the apply method of recursively_apply were available to call by getfunction, in the case where it wants to operate at multiple depths sequentially. Perhaps it should be an optional argument to getfunction?

Maybe I'm thinking of ak._util.broadcast_and_apply, but I think there's a way to pass down user data. I used that elsewhere to make it applicable at multiple depths. There were other cases (ak.combinations) in which we wanted to act at multiple levels and I did it through getfunctions that launch new ak._util.broadcast_and_apply, though that got complicated fast.

ak._util.broadcast_and_apply and ak._util.recursively_apply are purely internal functions. We can add whatever we need to them to handle new use-cases, though it has to not break existing users (within our codebase) of these functions. That is, the API can be ugly, but it has to keep working.

agoose77 · 2021-06-11T14:21:29Z

Yes, there is a pass_user which is currently used for the axis parameter. I will add a pass_apply argument to recursively_apply so that the getfunction can call the implementation for sub-trees. It will be False by default to ensure that existing use cases are not broken!

agoose77 · 2021-06-14T08:27:19Z

@jpivarski once #912 is ready to go, what do you think we should do here? Might feeling is that unless you actively care about what's going on under the hood, the average user should not need to be aware that unflatten might produce the wrong results if their array has experienced a particular set of transformations. This leads me to conclude that packed should be called by unflatten, right up to the axis that is being unflattened. This does make unflatten more expensive, but I think that's an inevitability rather than an implementation issue.

jpivarski · 2021-06-14T14:36:27Z

Possibly producing wrong results is always bad. If unflatten has to become more expensive (doing a _packed pass on it) to ensure that it is correct, then that's what needs to be done. It possibly only needs to be done if you see an IndexedArray in unflatten though: that's a decision that can be made there, and the backdoor in _packed could be much less general than an axis argument, it could be a is_shallow argument, saying to only pack at the top node level or the top dimension (always easier than trying to limit an operation to one axis somewhere deep within a structure). Such a thing might fix the error in unflatten without having to rewrite the array all the way down to the leaves of the tree.

agoose77 · 2021-06-14T14:39:32Z

I think it is caused by any layout that re-orders / resizes the internal array - unflatten introduces a new layout over the internal one, which uses offsets derived from the counts array. So, any layouts at a higher level that re-arrange things, e.g. ListArray, or ListOffsetArray with offsets[0]!=0 break things.

Because unflatten can be called for any depth, the "fix" needs to be able to apply to any depth. I think the fact that UnionArray's break the axis wrapping assumption may not be a problem ... because unflatten can't handle unions above the axis if my assumptions are correct.

* Test: add test for #910 * Bugfix: fix #910 with `ak.packed`

agoose77 added the bug (unverified) The problem described would be a bug, but needs to be triaged label Jun 11, 2021

agoose77 changed the title ~~ak.unflatten does not respect IndexedArray on given axis~~ ak.unflatten does not respect IndexedArray w.r.t counts s Jun 11, 2021

agoose77 changed the title ~~ak.unflatten does not respect IndexedArray w.r.t counts s~~ ak.unflatten does not respect IndexedArray w.r.t counts Jun 11, 2021

agoose77 closed this as completed Jun 11, 2021

agoose77 reopened this Jun 11, 2021

agoose77 mentioned this issue Jun 14, 2021

Feature: add ak.packed #912

Merged

13 tasks

agoose77 added a commit that referenced this issue Jun 15, 2021

Test: add test for #910

b8f81d0

agoose77 added a commit that referenced this issue Jun 15, 2021

Bugfix: fix #910 with ak.packed

7d41dfd

agoose77 mentioned this issue Jun 15, 2021

Bugfix: use ak.packed in ak.unflatten #922

Merged

agoose77 added bug The problem described is something that must be fixed and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Jun 15, 2021

jpivarski closed this as completed in #922 Jun 15, 2021

jpivarski pushed a commit that referenced this issue Jun 15, 2021

Bugfix: use ak.packed in ak.unflatten (#922)

d9eceac

* Test: add test for #910 * Bugfix: fix #910 with `ak.packed`

agoose77 mentioned this issue Apr 11, 2022

ListArray._getitem_next_jagged fails for slicecontent where starts[0] != 0 (v1 & v2) #1406

Closed

agoose77 mentioned this issue Jan 11, 2024

feat: add recursive flag to Content.to_packed #2933

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ak.unflatten` does not respect `IndexedArray` w.r.t `counts` #910

`ak.unflatten` does not respect `IndexedArray` w.r.t `counts` #910

agoose77 commented Jun 11, 2021 •

edited

jpivarski commented Jun 11, 2021

agoose77 commented Jun 11, 2021 •

edited

agoose77 commented Jun 11, 2021 •

edited

jpivarski commented Jun 11, 2021

agoose77 commented Jun 11, 2021 •

edited

jpivarski commented Jun 11, 2021

agoose77 commented Jun 11, 2021

agoose77 commented Jun 14, 2021

jpivarski commented Jun 14, 2021

agoose77 commented Jun 14, 2021 •

edited

ak.unflatten does not respect IndexedArray w.r.t counts #910

ak.unflatten does not respect IndexedArray w.r.t counts #910

Comments

agoose77 commented Jun 11, 2021 • edited

Reproducer 🐛

Same-depth IndexedArrays

Upper IndexedArrays

Cause 🔍

Same-depth IndexedArrays

Upper IndexedArrays

Solution 🔧

jpivarski commented Jun 11, 2021

agoose77 commented Jun 11, 2021 • edited

agoose77 commented Jun 11, 2021 • edited

jpivarski commented Jun 11, 2021

agoose77 commented Jun 11, 2021 • edited

jpivarski commented Jun 11, 2021

agoose77 commented Jun 11, 2021

agoose77 commented Jun 14, 2021

jpivarski commented Jun 14, 2021

agoose77 commented Jun 14, 2021 • edited

`ak.unflatten` does not respect `IndexedArray` w.r.t `counts` #910

`ak.unflatten` does not respect `IndexedArray` w.r.t `counts` #910

agoose77 commented Jun 11, 2021 •

edited

Same-depth `IndexedArray`s

Upper `IndexedArray`s

Same-depth `IndexedArray`s

Upper `IndexedArray`s

agoose77 commented Jun 11, 2021 •

edited

agoose77 commented Jun 11, 2021 •

edited

agoose77 commented Jun 11, 2021 •

edited

agoose77 commented Jun 14, 2021 •

edited