Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

ak.unflatten does not respect IndexedArray w.r.t counts #910

Closed
agoose77 opened this issue Jun 11, 2021 · 10 comments 路 Fixed by #922
Closed

ak.unflatten does not respect IndexedArray w.r.t counts #910

agoose77 opened this issue Jun 11, 2021 · 10 comments 路 Fixed by #922
Labels
bug The problem described is something that must be fixed

Comments

@agoose77
Copy link
Collaborator

agoose77 commented Jun 11, 2021

Reproducer 馃悰

Same-depth IndexedArrays

Given the following layout:

layout = ak.layout.IndexedArray64(
    ak.layout.Index64(np.array([3, 1, 0, 2])),
    ak.layout.ListOffsetArray64(
        ak.layout.Index64(np.array([0, 3, 6, 9, 12])),
        ak.layout.NumpyArray(np.array([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3])),
    ),
)

which has the list representations

>>> display(
...     ak.to_list(layout),
...     ak.to_list(layout.content),
...     ak.to_list(layout.content.content),
... )
[[3, 3, 3], [1, 1, 1], [0, 0, 0], [2, 2, 3]]
[[0, 0, 0], [1, 1, 1], [2, 2, 3], [3, 3, 3]]
[0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3]

the act of unflattening by the run lengths produces

>>> ak.unflatten(
...     layout,
...     ak.flatten(ak.run_lengths(layout)),
...     axis=1
... ).tolist()
[[[3, 3], [3]], [[1, 1, 1]], [[0, 0, 0]], [[2, 2, 3]]]

If we apply run_lengths to the ListOffsetArray layout, then the result is "correct":

>>> ak.unflatten(
...     layout,
...     ak.flatten(ak.run_lengths(layout.content)),
...     axis=1
... ).tolist()
[[[3, 3, 3]], [[1, 1, 1]], [[0, 0, 0]], [[2, 2], [3]]]

Upper IndexedArrays

This is not just true for thee same-depth index layouts, but also,any layout at any depth above the current depth. Consider this layout:

layout = ak.layout.IndexedArray64(
    ak.layout.Index64([1, 0]),
    ak.layout.ListOffsetArray64(
        ak.layout.Index64([0, 2, 4]),
        ak.layout.ListOffsetArray64(
            ak.layout.Index64(([0, 3, 6, 9, 12])),
            ak.layout.NumpyArray(([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3])),
        ),
    ),
)

which has

>>> display(
...     ak.to_list(layout),
...     ak.to_list(layout.content),
...     ak.to_list(layout.content.content),
...     ak.to_list(layout.content.content.content),
... )
[[[2, 2, 3], [3, 3, 3]], [[0, 0, 0], [1, 1, 1]]]
[[[0, 0, 0], [1, 1, 1]], [[2, 2, 3], [3, 3, 3]]]
[[0, 0, 0], [1, 1, 1], [2, 2, 3], [3, 3, 3]]
[0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3]

Its run lengths are

>>> ak.run_lengths(layout).to_list()
[[[2, 1], [3]], [[3], [3]]]

Unflattening as before, we have

>>> ak.unflatten(layout, ak.flatten(ak.run_lengths(layout), axis=None), axis=-1).tolist()
[[[[2, 2, 3]], [[3, 3, 3]]], [[[0, 0], [0]], [[1, 1, 1]]]]

instead of

[[[[2, 2], [3]], [[3, 3, 3]]], [[[0, 0, 0]], [[1, 1, 1]]]]

My expectation is that these public APIs should respect the depth-preserving (IndexedArray) layouts.

Cause 馃攳

The cause is simply that we don't transform the counts array with respect to preceding layouts.

Same-depth IndexedArrays

ak.unflatten uses recursively_apply to find the layout corresponds to the axis above the location indicated by the user. When there are ak.layout.IndexedArray layouts at the same depth, we move past them until we find a list type. https://github.com/scikit-hep/awkward-1.0/blob/94de4e5112ad3a2d5fc9c2ec0fc29e242543a0d6/src/awkward/operations/structure.py#L2093
e.g. walking the above example

Layout Depth List Type?
IndexedArray 1 False
ListOffsetArray 1 True

Upper IndexedArrays

Effectively the getfunction skips over these layouts until the correct depth is reached.

Solution 馃敡

I think we need to aggregate the non list-type layouts that precede the current list-type layout (that getfunction operates upon) and create a new IndexedArray layout that wraps the current layout's content. Then, we create the ListOffsetArray over this layout, and restore the structural layouts above.

@agoose77 agoose77 added the bug (unverified) The problem described would be a bug, but needs to be triaged label Jun 11, 2021
@agoose77 agoose77 changed the title ak.unflatten does not respect IndexedArray on given axis ak.unflatten does not respect IndexedArray w.r.t counts s Jun 11, 2021
@agoose77 agoose77 changed the title ak.unflatten does not respect IndexedArray w.r.t counts s ak.unflatten does not respect IndexedArray w.r.t counts Jun 11, 2021
@jpivarski
Copy link
Member

I think you may be right, though I'm confused by the examples that use the layout itself (through run_lengths) to generate counts.

The one high-level point I can make is that IndexedArrays must be invisible at high-level. It's essentially applying a lazy "carry" (non-negative integer array slice) at some point in the tree, and whatever happens, such as unflatten, should not depend on whether the "carry" is lazily applied or eagerly applied.

The intended implementation would have eagerly applied the "carry" during the first recursive descent. If a new option, off by default, needs to be added to ak.util.recursively_apply to implement that, that's okay.

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 11, 2021

Hi Jim, thanks for chiming in. The run_lengths on content just skips the outer IndexedArray. This works because the IndexedArray doesn't modify the ordering within sublists. Usually the user would be running this on the actual Array, but I was being a bit lazy.

I agree that unflatten should not care about the IndexedArray vs carry, and indeed run_lengths behaves as expected - it respects all of the layouts. It's just the current implementation for unflatten that doesn't.

I've been playing around with a kind of simplification that merges successive layouts together. For example

layout = ak.layout.IndexedArray64(
    ak.layout.Index64([1, 0]),
    ak.layout.ListOffsetArray64(
        ak.layout.Index64([0, 2, 4]),
        ak.layout.ListOffsetArray64(
            ak.layout.Index64([0, 3, 7, 9, 12]),
            ak.layout.NumpyArray(([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3])),
        ),
    ),
)

can be simplified top down (once) to become (via project)

layout1 = ak.layout.ListArray64(
    ak.layout.Index64([2, 0]),
    ak.layout.Index64([4, 2]),
    ak.layout.ListOffsetArray64(
        ak.layout.Index64([0, 3, 7, 9, 12]),
        ak.layout.NumpyArray(([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3])),
    ),
)

and

layout2 = ak.layout.ListOffsetArray64(
    ak.layout.Index64([0, 2, 4]),
    ak.layout.ListOffsetArray64(
        ak.layout.Index64([0, 2, 5, 8, 12]),
        ak.layout.IndexedArray64(
            ak.layout.Index64([7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6]),
            ak.layout.NumpyArray(([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3])),
        ),
    ),
)

I don't know whether this approach is the "right" one - it avoids copying the contents (which is good for a record array), but involves creating a number of intermediate layouts.

@agoose77 agoose77 reopened this Jun 11, 2021
@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 11, 2021

So here's an incomplete simplifier that handles this particular layout (and makes a copy whilst it's at it). It's probably wrong e.g. with the assertion, but it's a PoC rather than production code. I don't know whether this approach is the best one. The idea here is that by moving all reordering to the inner-most level, then the flattened counts will correlate to the layout. One way to do this is:

  • Move all IndexedArray routines to the inner buffer
  • Change ListArrays to ListOffsetArrays.
def replace_content(layout, content):
    def getfunction(this, depth):
        if this is layout:
            return
        return lambda: content

    return ak._util.recursively_apply(layout, getfunction)


def simplify(layout):
    if hasattr(layout, "project"):
        return simplify(layout.project())

    # Now we don't have an IndexedArray
    if not hasattr(layout, "content"):
        return layout

    # We now have only listtypes
    assert isinstance(layout, ak._util.listtypes)

    # ListArrays can be converted to another case (reduce dimension of problem)
    if isinstance(layout, ak.layout.ListArray64):
        layout = layout.toListOffsetArray64(True)
        return simplify(layout)

    # When arrays are
    # - structural
    # - contiguous
    # then we can't optimise any further, and should move on
    if isinstance(layout, (ak.layout.ListOffsetArray64, ak.layout.RegularArray)):
        return replace_content(layout, simplify(layout.content))
    
    raise NotImplementedError

You know much better what is going on under the hood with respect to copying, RecordArrays, etc. Do you think this approach is the right one here? Is this kind of multi-layout modification possible w.r.t all of the possible layouts (including partitioned arrays)?

@jpivarski
Copy link
Member

We shouldn't simplify more than we need to. For example, replacing ListArrays with ListOffsetArrays doesn't have anything to do with the current problem, and ListArrays exist to avoid excessive reordering.

I think your current problem with unflatten is because the non-list nodes directly above the node whose length changes need to be simplified: e.g. IndexedArrays at this level only need to be projected.

It's better to make a chain of elif isinstance(layout, node_classes) than to test for the existence of attributes like hasattr(layout, "project"). IndexedOptionArray also has a project method and it doesn't do what you want: it removes None values.

Unfortunately for making things local and factorizable, implementing any method generally requires the author to know about all node types. OOP is usually against that, but there was no way to avoid it and do what Awkward Array does. This is the Nth iteration on the set of node types and very final鈥攖he only possible new node type is RedirectArray (#178), but that may be unnecessary now that the C++ is being translated into Python (thanks to the garbage collector).

So this problem can be fixed by considering each of the possible node types just above the node that changes its length: IndexedArrays need to be projected, etc.

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 11, 2021

We shouldn't simplify more than we need to. For example, replacing ListArrays with ListOffsetArrays doesn't have anything to do with the current problem, and ListArrays exist to avoid excessive reordering.

Yes, in my PoC I just convert things to reduce the number of combinations to consider.

So this problem can be fixed by considering each of the possible node types just above the node that changes its length: IndexedArrays need to be projected, etc.

For posterity, I'll add that it's both length and ordering that we need to handle here.

I think your current problem with unflatten is because the non-list nodes directly above the node whose length changes need to be simplified: e.g. IndexedArrays at this level only need to be projected.

@jpivarski that's true - we can just project the IndexedArrays because the buffers below the axis depth won't be at risk of being copied.

It's better to make a chain of elif isinstance(layout, node_classes) than to test for the existence of attributes like hasattr(layout, "project"). IndexedOptionArray also has a project method and it doesn't do what you want: it removes None values.

Noted, thanks. Old habits die hard I suppose :) I noticed while writing the visitor that it would be useful if the apply method of recursively_apply were available to call by getfunction, in the case where it wants to operate at multiple depths sequentially. Perhaps it should be an optional argument to getfunction?

@jpivarski
Copy link
Member

I noticed while writing the visitor that it would be useful if the apply method of recursively_apply were available to call by getfunction, in the case where it wants to operate at multiple depths sequentially. Perhaps it should be an optional argument to getfunction?

Maybe I'm thinking of ak._util.broadcast_and_apply, but I think there's a way to pass down user data. I used that elsewhere to make it applicable at multiple depths. There were other cases (ak.combinations) in which we wanted to act at multiple levels and I did it through getfunctions that launch new ak._util.broadcast_and_apply, though that got complicated fast.

ak._util.broadcast_and_apply and ak._util.recursively_apply are purely internal functions. We can add whatever we need to them to handle new use-cases, though it has to not break existing users (within our codebase) of these functions. That is, the API can be ugly, but it has to keep working.

@agoose77
Copy link
Collaborator Author

Yes, there is a pass_user which is currently used for the axis parameter. I will add a pass_apply argument to recursively_apply so that the getfunction can call the implementation for sub-trees. It will be False by default to ensure that existing use cases are not broken!

@agoose77
Copy link
Collaborator Author

@jpivarski once #912 is ready to go, what do you think we should do here? Might feeling is that unless you actively care about what's going on under the hood, the average user should not need to be aware that unflatten might produce the wrong results if their array has experienced a particular set of transformations. This leads me to conclude that packed should be called by unflatten, right up to the axis that is being unflattened. This does make unflatten more expensive, but I think that's an inevitability rather than an implementation issue.

@agoose77 agoose77 mentioned this issue Jun 14, 2021
13 tasks
@jpivarski
Copy link
Member

Possibly producing wrong results is always bad. If unflatten has to become more expensive (doing a _packed pass on it) to ensure that it is correct, then that's what needs to be done. It possibly only needs to be done if you see an IndexedArray in unflatten though: that's a decision that can be made there, and the backdoor in _packed could be much less general than an axis argument, it could be a is_shallow argument, saying to only pack at the top node level or the top dimension (always easier than trying to limit an operation to one axis somewhere deep within a structure). Such a thing might fix the error in unflatten without having to rewrite the array all the way down to the leaves of the tree.

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 14, 2021

I think it is caused by any layout that re-orders / resizes the internal array - unflatten introduces a new layout over the internal one, which uses offsets derived from the counts array. So, any layouts at a higher level that re-arrange things, e.g. ListArray, or ListOffsetArray with offsets[0]!=0 break things.

Because unflatten can be called for any depth, the "fix" needs to be able to apply to any depth. I think the fact that UnionArray's break the axis wrapping assumption may not be a problem ... because unflatten can't handle unions above the axis if my assumptions are correct.

agoose77 added a commit that referenced this issue Jun 15, 2021
agoose77 added a commit that referenced this issue Jun 15, 2021
@agoose77 agoose77 added bug The problem described is something that must be fixed and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Jun 15, 2021
jpivarski pushed a commit that referenced this issue Jun 15, 2021
* Test: add test for #910

* Bugfix: fix #910 with `ak.packed`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The problem described is something that must be fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants