IndexedArrays can be absorbed into a UnionArray's index (unless categorical) #2192
Labels
one-hour-fix
It can probably be done in an hour
performance
Works, but not fast enough or uses too much memory
Version of Awkward Array
HEAD
Description and code to reproduce
Currently, UnionArrays are allowed to contain IndexedArrays:
has layout
But because of
the application of the
index[tags == tag]
array in UnionArray is associative with the application of theindex
in the IndexedArray. The latter can be merged into the former with no loss of information, and then we're using less memory and performing less indirection in computations. (Is it enough to matter for any applications? Unclear, because we like to avoid UnionArrays in general, but it is a net win.)This issue would be resolved by doing a final pass at the end of
UnionArray.simplified
to check for any IndexedArrays incontents
, not IndexedOptionArrays, that are not categorical. (IndexedOptionArrays can't be merged intoUnionArray.index
because they have-1
for missing values, whichUnionArray.index
does not, and we don't want to apply this to categorical IndexedArrays because then we'd have to put the__array__: "categorical"
parameter on the UnionArray, which downstream code is not expecting, and anyway it would lose information because if some of theUnionArray.contents
were categorical and others not, we'd lose knowledge of which is which.)If the
outindex
(index
of the UnionArray that will be returned) is a new array, created inside theUnionArray.simplified
function, it can be modified in place, something likeIf not, then
outindex
can be copied before the first modification.Then
UnionArray.__init__
can forbid non-categorical IndexedArrays ascontents
. Even fewer combinations to worry about: yay!We should consider the case of IndexedArray-of-RecordArray, which is created by
_carry
withallow_lazy=True
. This would eliminate the IndexedArray that lazily carries the RecordArray, but the RecordArray would still be lazily carried (that is, the_carry
would still not be propagated to all of the RecordArray'scontents
) because that laziness is in theUnionArray.index
now. When/if the RecordArray ever gets projected out of this UnionArray, if that happens withallow_lazy=True
, then it's still lazy (not any worse for having made this optimization). But if it gets projected out withallow_lazy=False
, then that specific case could have a performance degradation due to this change. That seems hyperspecialized, though.The text was updated successfully, but these errors were encountered: