Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexedArrays can be absorbed into a UnionArray's index (unless categorical) #2192

Closed
jpivarski opened this issue Feb 2, 2023 · 2 comments · Fixed by #2527
Closed

IndexedArrays can be absorbed into a UnionArray's index (unless categorical) #2192

jpivarski opened this issue Feb 2, 2023 · 2 comments · Fixed by #2527
Labels
one-hour-fix It can probably be done in an hour performance Works, but not fast enough or uses too much memory

Comments

@jpivarski
Copy link
Member

Version of Awkward Array

HEAD

Description and code to reproduce

Currently, UnionArrays are allowed to contain IndexedArrays:

array = ak.Array(
    ak.contents.UnionArray.simplified(
        ak.index.Index8([0, 0, 0, 1, 1, 1, 1, 1]),
        ak.index.Index64([0, 1, 2, 0, 1, 2, 3, 4]),
        [
            ak.from_iter(["one", "two", "three"], highlevel=False),
            ak.contents.IndexedArray(
                ak.index.Index64([4, 3, 2, 1, 0]),
                ak.from_iter([1.1, 2.2, 3.3, 4.4, 5.5], highlevel=False),
            )
        ],
    )
)

has layout

>>> array.layout
<UnionArray len='8'>
    <tags><Index dtype='int8' len='8'>[0 0 0 1 1 1 1 1]</Index></tags>
    <index><Index dtype='int64' len='8'>[0 1 2 0 1 2 3 4]</Index></index>
    <content index='0'>
        <ListOffsetArray len='3'>
            <parameter name='__array__'>'string'</parameter>
            <offsets><Index dtype='int64' len='4'>
                [ 0  3  6 11]
            </Index></offsets>
            <content><NumpyArray dtype='uint8' len='11'>
                <parameter name='__array__'>'char'</parameter>
                [111 110 101 116 119 111 116 104 114 101 101]
            </NumpyArray></content>
        </ListOffsetArray>
    </content>
    <content index='1'>
        <IndexedArray len='5'>
            <index><Index dtype='int64' len='5'>[4 3 2 1 0]</Index></index>
            <content><NumpyArray dtype='float64' len='5'>[1.1 2.2 3.3 4.4 5.5]</NumpyArray></content>
        </IndexedArray>
    </content>
</UnionArray>

But because of

the application of the index[tags == tag] array in UnionArray is associative with the application of the index in the IndexedArray. The latter can be merged into the former with no loss of information, and then we're using less memory and performing less indirection in computations. (Is it enough to matter for any applications? Unclear, because we like to avoid UnionArrays in general, but it is a net win.)

This issue would be resolved by doing a final pass at the end of UnionArray.simplified to check for any IndexedArrays in contents, not IndexedOptionArrays, that are not categorical. (IndexedOptionArrays can't be merged into UnionArray.index because they have -1 for missing values, which UnionArray.index does not, and we don't want to apply this to categorical IndexedArrays because then we'd have to put the __array__: "categorical" parameter on the UnionArray, which downstream code is not expecting, and anyway it would lose information because if some of the UnionArray.contents were categorical and others not, we'd lose knowledge of which is which.)

If the outindex (index of the UnionArray that will be returned) is a new array, created inside the UnionArray.simplified function, it can be modified in place, something like

for tag, content in enumerate(outcontents):
    if isinstance(content, IndexedArray) and content.parameter("__array__") != "categorical":
        # only want to affect the part of the UnionArray.index for this tag
        selection = (outtags == tag)
        # function-composition of this part of the UnionArray.index with the IndexedArray.index
        outindex[selection] = content.index[outindex[selection]]
        # now we don't have an IndexedArray anymore
        outcontents[tag] = content.content

If not, then outindex can be copied before the first modification.

Then UnionArray.__init__ can forbid non-categorical IndexedArrays as contents. Even fewer combinations to worry about: yay!

We should consider the case of IndexedArray-of-RecordArray, which is created by _carry with allow_lazy=True. This would eliminate the IndexedArray that lazily carries the RecordArray, but the RecordArray would still be lazily carried (that is, the _carry would still not be propagated to all of the RecordArray's contents) because that laziness is in the UnionArray.index now. When/if the RecordArray ever gets projected out of this UnionArray, if that happens with allow_lazy=True, then it's still lazy (not any worse for having made this optimization). But if it gets projected out with allow_lazy=False, then that specific case could have a performance degradation due to this change. That seems hyperspecialized, though.

@jpivarski jpivarski added performance Works, but not fast enough or uses too much memory one-hour-fix It can probably be done in an hour labels Feb 2, 2023
@agoose77
Copy link
Collaborator

@jpivarski I was looking at tackling and noticed the following:

  • UnionArray.simplified accepts a merge parameters that can be set to False
  • UnionArray._validity_error ensures that no two contents are mergeable

I believe we should therefore deprecate UnionArray.simplified(..., merge=), as merge=True is allowed, but merge=False will only work for non-mergeable contents.

@jpivarski
Copy link
Member Author

Unless this is related to from_buffers's back door (because ArrayBuilder would build invalid layouts, so it has a back door to change the Form to a valid one)—unless this is related to that, it's a relic. We assume that any UnionArray that exists must be valid/must be merged.

Also, unless it's really mergebool, which is optional. UnionArrays are allowed to include both numerical and boolean contents without merging them. (We agree with NumPy that booleans are not integers. If only Python would agree...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
one-hour-fix It can probably be done in an hour performance Works, but not fast enough or uses too much memory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants