IndexedArrays can be absorbed into a UnionArray's index (unless categorical) #2192

jpivarski · 2023-02-02T18:24:44Z

Version of Awkward Array

HEAD

Description and code to reproduce

Currently, UnionArrays are allowed to contain IndexedArrays:

array = ak.Array(
    ak.contents.UnionArray.simplified(
        ak.index.Index8([0, 0, 0, 1, 1, 1, 1, 1]),
        ak.index.Index64([0, 1, 2, 0, 1, 2, 3, 4]),
        [
            ak.from_iter(["one", "two", "three"], highlevel=False),
            ak.contents.IndexedArray(
                ak.index.Index64([4, 3, 2, 1, 0]),
                ak.from_iter([1.1, 2.2, 3.3, 4.4, 5.5], highlevel=False),
            )
        ],
    )
)

has layout

>>> array.layout
<UnionArray len='8'>
    <tags><Index dtype='int8' len='8'>[0 0 0 1 1 1 1 1]</Index></tags>
    <index><Index dtype='int64' len='8'>[0 1 2 0 1 2 3 4]</Index></index>
    <content index='0'>
        <ListOffsetArray len='3'>
            <parameter name='__array__'>'string'</parameter>
            <offsets><Index dtype='int64' len='4'>
                [ 0  3  6 11]
            </Index></offsets>
            <content><NumpyArray dtype='uint8' len='11'>
                <parameter name='__array__'>'char'</parameter>
                [111 110 101 116 119 111 116 104 114 101 101]
            </NumpyArray></content>
        </ListOffsetArray>
    </content>
    <content index='1'>
        <IndexedArray len='5'>
            <index><Index dtype='int64' len='5'>[4 3 2 1 0]</Index></index>
            <content><NumpyArray dtype='float64' len='5'>[1.1 2.2 3.3 4.4 5.5]</NumpyArray></content>
        </IndexedArray>
    </content>
</UnionArray>

But because of

https://github.com/scikit-hep/awkward/blob/0.1.15/docs/theory/arrays-are-functions.pdf
page 22‒23 https://indico.cern.ch/event/833895/contributions/3577882/
page 25 https://www.thestrangeloop.com/2019/jagged-ragged-awkward-arrays.html, starting at 18:24 in https://youtu.be/2NxWpU7NArk?t=1104

the application of the index[tags == tag] array in UnionArray is associative with the application of the index in the IndexedArray. The latter can be merged into the former with no loss of information, and then we're using less memory and performing less indirection in computations. (Is it enough to matter for any applications? Unclear, because we like to avoid UnionArrays in general, but it is a net win.)

This issue would be resolved by doing a final pass at the end of UnionArray.simplified to check for any IndexedArrays in contents, not IndexedOptionArrays, that are not categorical. (IndexedOptionArrays can't be merged into UnionArray.index because they have -1 for missing values, which UnionArray.index does not, and we don't want to apply this to categorical IndexedArrays because then we'd have to put the __array__: "categorical" parameter on the UnionArray, which downstream code is not expecting, and anyway it would lose information because if some of the UnionArray.contents were categorical and others not, we'd lose knowledge of which is which.)

If the outindex (index of the UnionArray that will be returned) is a new array, created inside the UnionArray.simplified function, it can be modified in place, something like

for tag, content in enumerate(outcontents):
    if isinstance(content, IndexedArray) and content.parameter("__array__") != "categorical":
        # only want to affect the part of the UnionArray.index for this tag
        selection = (outtags == tag)
        # function-composition of this part of the UnionArray.index with the IndexedArray.index
        outindex[selection] = content.index[outindex[selection]]
        # now we don't have an IndexedArray anymore
        outcontents[tag] = content.content

If not, then outindex can be copied before the first modification.

Then UnionArray.__init__ can forbid non-categorical IndexedArrays as contents. Even fewer combinations to worry about: yay!

We should consider the case of IndexedArray-of-RecordArray, which is created by _carry with allow_lazy=True. This would eliminate the IndexedArray that lazily carries the RecordArray, but the RecordArray would still be lazily carried (that is, the _carry would still not be propagated to all of the RecordArray's contents) because that laziness is in the UnionArray.index now. When/if the RecordArray ever gets projected out of this UnionArray, if that happens with allow_lazy=True, then it's still lazy (not any worse for having made this optimization). But if it gets projected out with allow_lazy=False, then that specific case could have a performance degradation due to this change. That seems hyperspecialized, though.

The text was updated successfully, but these errors were encountered:

agoose77 · 2023-06-15T19:33:03Z

@jpivarski I was looking at tackling and noticed the following:

UnionArray.simplified accepts a merge parameters that can be set to False
UnionArray._validity_error ensures that no two contents are mergeable

I believe we should therefore deprecate UnionArray.simplified(..., merge=), as merge=True is allowed, but merge=False will only work for non-mergeable contents.

jpivarski · 2023-06-16T16:29:51Z

Unless this is related to from_buffers's back door (because ArrayBuilder would build invalid layouts, so it has a back door to change the Form to a valid one)—unless this is related to that, it's a relic. We assume that any UnionArray that exists must be valid/must be merged.

Also, unless it's really mergebool, which is optional. UnionArrays are allowed to include both numerical and boolean contents without merging them. (We agree with NumPy that booleans are not integers. If only Python would agree...)

jpivarski added performance Works, but not fast enough or uses too much memory one-hour-fix It can probably be done in an hour labels Feb 2, 2023

agoose77 mentioned this issue Jun 16, 2023

feat!: canonicalise union of indexed #2527

Merged

agoose77 closed this as completed in #2527 Jun 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexedArrays can be absorbed into a UnionArray's index (unless categorical) #2192

IndexedArrays can be absorbed into a UnionArray's index (unless categorical) #2192

jpivarski commented Feb 2, 2023

agoose77 commented Jun 15, 2023

jpivarski commented Jun 16, 2023

IndexedArrays can be absorbed into a UnionArray's index (unless categorical) #2192

IndexedArrays can be absorbed into a UnionArray's index (unless categorical) #2192

Comments

jpivarski commented Feb 2, 2023

Version of Awkward Array

Description and code to reproduce

agoose77 commented Jun 15, 2023

jpivarski commented Jun 16, 2023