Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: add after_option argument to ak.zip #1308

Merged
merged 5 commits into from
Feb 23, 2022
Merged

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Feb 23, 2022

This broadcasts any option types at the depth limit, and forms the RecordArray layout node below the result.

Effectively, this produces ?(int64, float64) instead of (?int64, ?float64)

This feature is particularly useful when rezipping RecordArrays with ?{...} types. Without it, after zipping the contents of a RecordArray back together, we would end up with {?..., ?...} types.

P.S. I'm not sure on the naming of the argument to ak.zip. Feel free to suggest something clearer / shorter :)

This broadcasts any option types at the depth limit,
and forms the RecordArray layout node below the result.

Effectively, this produces `?(int64, float64)` instead
of `(?int64, ?float64)`
@codecov
Copy link

codecov bot commented Feb 23, 2022

Codecov Report

Merging #1308 (38c2fc1) into main (5613d44) will decrease coverage by 2.01%.
The diff coverage is 38.17%.

Impacted Files Coverage Δ
src/awkward/_v2/_connect/cling.py 0.00% <0.00%> (ø)
src/awkward/_v2/contents/listoffsetarray.py 80.55% <ø> (+0.13%) ⬆️
src/awkward/_v2/_lookup.py 97.50% <97.50%> (ø)
src/awkward/_v2/__init__.py 100.00% <100.00%> (ø)
src/awkward/_v2/_connect/numba/arrayview.py 96.76% <100.00%> (+0.51%) ⬆️
src/awkward/_v2/_connect/numba/layout.py 84.34% <100.00%> (-2.67%) ⬇️
src/awkward/_v2/operations/structure/ak_argsort.py 100.00% <100.00%> (+25.00%) ⬆️
src/awkward/_v2/operations/structure/ak_zip.py 89.58% <100.00%> (+0.45%) ⬆️
... and 8 more

@agoose77 agoose77 marked this pull request as ready for review February 23, 2022 10:29
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all is great, except maybe for the name. after_option implies a temporal sequence, which makes sense to people who know about how it's implemented—that the control flow starts at the root of the tree and walks downward, so "after" means "deep" or "nested". What this is doing is sharing the option-type among the merged arrays, so maybe "shared"?

How about this very explicit name: optiontype_outside_record? This function always creates a record/tuple at some level, so we know which record it's talking about. Including the "type" in "optiontype" disambiguates other possible uses of "option", which can also suggest user choices, nothing to do with a type in which some values can be null. So far, this is my best suggestion.

I'm leaving this open so that you can pick another name, but once you've done that, you can squash-and-merge it. The implementation is great.

@agoose77
Copy link
Collaborator Author

implies a temporal sequence
Agreed, this was my main concern with the name. If it occurs to you too, then I think it's clearly a bad name.

I like the explicitness of optiontype_outside_record, though I'm unsure about how verbose it is. I suppose there isn't really a better name that comes to mind, so I'm going to squash and merge it once I've made the necessary changes and copied the implementation (subject to changes) to v2

@agoose77 agoose77 enabled auto-merge (squash) February 23, 2022 20:30
@jpivarski
Copy link
Member

Long argument and function names is a cultural thing, maybe coming from languages that don't have modularization and namespaces in class types. In particular, I'm thinking about function names in Emacs Lisp which-are-often-six-words-long! For years, this appalled me, but I've started using it more, particularly in Uproot 4. Any identifiers with connectives like "_and_" in them are guilty of this. But maybe it's alright, especially if there's tab-complete...

@agoose77
Copy link
Collaborator Author

Long argument and function names is a cultural thing, maybe coming from languages that don't have modularization and namespaces in class types. In particular, I'm thinking about function names in Emacs Lisp which-are-often-six-words-long! For years, this appalled me, but I've started using it more, particularly in Uproot 4. Any identifiers with connectives like "_and_" in them are guilty of this. But maybe it's alright, especially if there's tab-complete...

I'm generally in favour of this - I prefer descriptive names given that we mainly read code rather than write it. Exploratory code is perhaps an exception to this rule but I can also think of cases where short parameters are just meaningless without reading the docs. 🤷

@agoose77 agoose77 merged commit 5e4e4b9 into main Feb 23, 2022
@agoose77 agoose77 deleted the feat-zip-after-option branch February 23, 2022 21:03
@jpivarski
Copy link
Member

This empty file in the tests directory was probably unintended, right?

https://github.com/scikit-hep/awkward-1.0/blob/38c2fc1fdd50d7ea3b56fa49702f19517fed43b7/tests/.tmpHaXt1g

@agoose77
Copy link
Collaborator Author

agoose77 commented Feb 23, 2022

Er, I didn't see that in the commit; I'm not sure how I managed to do that! Would you prefer a revert or separate commit?

@jpivarski
Copy link
Member

No reason to do that! I'll just remove it in #1312. It can go under the category of "general cleanup."

@agoose77
Copy link
Collaborator Author

agoose77 commented Mar 1, 2022

@jpivarski digging this up again, I realised that I didn't handle the case of IndexedArray(..., IndexedOptionArray), which we definitely want to support. 🤦

I think the simplest solution here is to replace

if optiontype_outside_record and any(
    isinstance(x, ak._util.optiontypes) for x in inputs
):
    return None

with

if optiontype_outside_record and any(
    isinstance(x, ak._util.optiontypes + ak._util.indexedtypes) for x in inputs
):
    return None

But, this makes me wonder if there's more to do here - what about Union types? Does this warrant another flag uniontype_outside_record?

if optiontype_outside_record and any(
    isinstance(x, ak._util.optiontypes + ak._util.indexedtypes) for x in inputs
):
    return None
if uniontype_outside_record and any(
    isinstance(x, ak._util.uniontype + ak._util.indexedtypes) for x in inputs
):
    return None

In other words, we treat indexedtypes as invisible, which aligns with their behaviour in most (all) user-facing routines.

Perhaps the most explicit behaviour is

if optiontype_outside_record and not all(
    isinstance(x, ak._util.listtypes + (ak.layout.NumpyArray,)) for x in inputs
):
    return None

But this aggregates both options under a single flag, and also would mean reverting this PR. Thoughts?

@agoose77
Copy link
Collaborator Author

agoose77 commented Mar 1, 2022

Furthermore, what is our policy on non-functional unions? E.g. union[int64, int64]?
I would perhaps imagine that we want to support these being passed in as input, and defensively simplify them before testing for UnionArray.

In counterpoint to this, we do show them in the type string, e.g.

>>> array = ak.Array(
...     ak.layout.UnionArray8_64(
...         ak.layout.Index8(np.r_[0, 0, 1, 1]),
...         ak.layout.Index64(np.r_[0, 1, 0, 1]),
...         [
...             ak.layout.NumpyArray(np.r_[0,1,2,3]),
...             ak.layout.NumpyArray(np.r_[0,1,2,3]*2),
...         ]
...     
...     )
... )
>>> array.type
4 * union[int64, int64]

So I think this is really a policy question

@jpivarski
Copy link
Member

I realised that I didn't handle the case of IndexedArray(..., IndexedOptionArray), which we definitely want to support.

IndexedArray and IndexedOptionArray shouldn't be nested directly within each other. If you're encountering that, then there's a missing simplify_optiontype somewhere.

Do you mean IndexedArray → ListArray → IndexedOptionArray or similar?

Or do you mean that IndexedArrays should be considered equivalent to IndexedOptionArray when deciding whether to ak.zip through them? IndexedArrays are supposed to be invisible: it's a Form-level detail, not a Type-level detail. So it would be acceptable to not have any user option for this and always project the IndexedArrays away.

Why not decide whether to project them or collect a common IndexedArray around a RecordArray if all the fields being zipped have the same IndexedArray? Because then, the Form would depend on values observed in data and type-tracers wouldn't be able to predict Forms for Dask. So, whereas I might have considered such an optimization in the past, we can't do that sort of thing anymore because we need Forms to be predictable without looking at data. (For the same reason that compilers make such a distinction between types and values.)

@jpivarski
Copy link
Member

But, this makes me wonder if there's more to do here - what about Union types? Does this warrant another flag uniontype_outside_record?

We don't need a lot of support for manipulating unions, only as much as users ask for. They turn out to be very limited in how they can be used. It might have been a reasonable choice at the beginning of Awkward Array to have ignored them entirely.

Unions and records are the two types that can have lots of children. Thinking along these lines would probably make the combinatorics explode. Let's just leave the unions-in-ak.zip behavior naive unless forced otherwise.

@jpivarski
Copy link
Member

Unions whose children are mergeable are not fully valid, in the sense that our functions should not be returning such things to users. If a function is returning such a thing, then it's missing a simplify_uniontype call.

>>> array.layout
<UnionArray8_64>
    <tags><Index8 i="[0 0 1 1]" offset="0" length="4" at="0x55d886a12130"/></tags>
    <index><Index64 i="[0 1 0 1]" offset="0" length="4" at="0x55d887155a80"/></index>
    <content tag="0">
        <NumpyArray format="l" shape="4" data="0 1 2 3" at="0x55d886f9d370"/>
    </content>
    <content tag="1">
        <NumpyArray format="l" shape="4" data="0 2 4 6" at="0x55d886dc6f30"/>
    </content>
</UnionArray8_64>
>>> array.layout.simplify()
<NumpyArray format="l" shape="4" data="0 1 0 2" at="0x55d886f8c870"/>

After flip-flopping around on this question in the early days, I've decided it's better to leave expressions like union[int64, int64] in the type string because it helps us spot these cases where the simplify_uniontype is missing.

Why allow unions whose children are mergeable to exist at all? They're needed as intermediate states in calculations. If the UnionArray constructor raised an error when it sees them, there are some operations that would be difficult or impossible to implement. The same is true of simplify_optiontype and the set {IndexedArray, IndexedOptionArray, ByteMaskedArray, BitMaskedArray, UnmaskedArray}.

@jpivarski
Copy link
Member

Is the IndexedArray change significant enough to stop the 1.8.0 release? If this were going into v2, I would say, "no," but it's affecting v1, which people are using. Does it introduce the wrong behavior, which we would then have to backtrack?

@agoose77
Copy link
Collaborator Author

agoose77 commented Mar 1, 2022

Thanks for the detailed replies!

IndexedArray and IndexedOptionArray shouldn't be nested directly within each other. If you're encountering that, then there's a missing simplify_optiontype somewhere.

I think this gets to the crux of the issue - do we want to support "quirky" user-crafted layouts? I can construct IndexedArray→IndexedOptionArray, so do we want to support that, or blame the user?

If we do want to support "quirky" layouts, then yes, we could also simplify_optiontype, although in this instance I think it's simpler to just broadcast through the indexed type, and let the Awkward code handle this.

Why not decide whether to project them or collect a common IndexedArray around a RecordArray if all the fields being zipped have the same IndexedArray?

Yes, agreed. That's too brittle! And, for posterity, I haven't proposed it :)

Is the IndexedArray change significant enough to stop the 1.8.0 release?

w.r.t the need to test indexedtypes, I think the current state of master is probably fine if we assume that Awkward doesn't produce these quirky layouts. However, I also didn't handle virtual arrays (I initially thought about this w.r.t v2, so I didn't consider them). Given that we want to support these, I'll need to materialise any virtual arrays too.

@jpivarski
Copy link
Member

Well, if the user is constructing things with ak.layout.*, then yeah. They could also make an IndexedArray with negative index values or a ListArray with stops that precede starts. Maybe we need better documentation (e.g. a nice view-graphic) on which parts are more internal than others: the high-level tier for data analysts (ak.Array and ak.whatever), the mid-level tier for downstream libraries (ak.Array.layout and ak.layout.*), and the low-level tier for Awkward developers (anything starting with an underscore, kernels, etc.). If there were only two tiers, then underscore-prefixing would be enough.

We're not supporting quirky layouts in the sense of doing reasonable things if we encounter them. This is also related to the "inside Awkward/outside Awkward" boundary in #1327. It somehow needs to be a less permeable boundary than there is between any other function calls, though I don't think any language has such a concept built-in. In addition to highlighting that edge in the stack trace, we'd also like to allow "semi-invalid" IndexedArray states inside and not outside.

Maybe that's something the ak.Array wrapper could do: when an ak.Array is constructed, it could crawl over the whole layout and verify that none of the nodes are "semi-invalid," but that's probably too much processing for something as common as wrapping a layout with the high-level layer.

Maybe instead, layouts could note whether they're "semi-invalid" when they're constructed—just an extra boolean for each node—and ak.Array should complain if the top node has such a flag set? (That's something we could only consider in v2, since it would involve a lot of extra C++ code in v1.)

Actually, the closest to this concept of "temporarily invalid" that I've seen is in Clojure, which is a strictly functional language that allows temporarily mutable data within the scope of a function.

I think the current state of master is probably fine if we assume that Awkward doesn't produce these quirky layouts.

Okay, we do assume (not just here) that there are no quirky/semi-invalid layouts. So I'll continue with the 1.8.0 release.

Virtual arrays are another thing. You know, Awkward 0.x had more node types that caused even more troubles than these. It's a slow whittling process of finding a set that play nicely with each other.

@agoose77
Copy link
Collaborator Author

agoose77 commented Mar 1, 2022

Virtual arrays are another thing. You know, Awkward 0.x had more node types that caused even more troubles than these. It's a slow whittling process of finding a set that play nicely with each other.

I can only imagine.

Virtual arrays are another thing.

Does this mean that we just ignore virtual arrays here? The user will only notice this is they pass this new flag and all option-containing arrays are virtual.

@agoose77
Copy link
Collaborator Author

agoose77 commented Mar 1, 2022

This is also related to the "inside Awkward/outside Awkward" boundary

Yes, it sounds like we're discussing safe and unsafe layouts which, for the most part, corresponds to "inside" or "outside" awkward. I think something like this would be quite useful actually. Given that layouts are immutable, we could just use a flag to determine whether the layout was generated by Awkward or by a user. In the latter case, we would need to validate the structure (like packing, but not so concerned about the packing part) whenever Awkward touches such a layout.

Is this what you're thinking of?

@jpivarski
Copy link
Member

Does this mean that we just ignore virtual arrays here?

Ignoring them probably means they'll be materialized, right? That was always the problem with VirtualArrays, that the least thing would cause them to materialize unintentionally.

This would only come up if someone uses the new optiontype_outside_record argument and has VirtualArrays, and if something actually failed in this case, we'd consider that as a bug report. The thing is, it's not going to cause someone's currently working workflow to stop working because their Awkward version updated.

Given that layouts are immutable, we could just use a flag to determine whether the layout was generated by Awkward or by a user.

Right: the value of the flag at the top node tells you whether there are any interior nodes that are not correctly constructed, regardless of whether it was made by us (a function returns without calling simplify_optiontype/simplify_uniontype) or a user. The flag turns an O(log n) time check (where n is the number of nodes in the tree) into an O(1) check by adding O(n) more flags everywhere: trading time for memory (neither of which scales with the number of elements in an array). With such a thing, we'd find more of our simplify_optiontype/simplify_uniontype bugs and inform users that they're constructing invalid arrays.

Every option-type and IndexedArray constructor would contain:

    self._quirky = content._quirky or content.is_OptionType or content.is_IndexedType

and the ak._v2.highlevel.Array constructor can raise an exception if the given layout is quirky.

I'm not saying that this is something that we should do. As I said, there are other ways to construct invalid arrays by filling Indexes with bad data, and there is no way to check these that isn't O(N) in the length of the array (big N). The check_valid argument controls whether a ak._v2.highlevel.Array performs that expensive check, and I think that checks for quirkiness, too.

Yes, it does: here's IndexedOptionArray._validityerror:

https://github.com/scikit-hep/awkward-1.0/blob/a108ed61894d3ad5fe2e575f38f2c237cd5b330b/src/awkward/_v2/contents/indexedoptionarray.py#L1646-L1675

The last test checks for this condition.

@agoose77
Copy link
Collaborator Author

agoose77 commented Mar 1, 2022

Ignoring them probably means they'll be materialized, right

Only if one of inputs is an option. Otherwise, no inputs are an option, and so the record is built with the current set of inputs as contents.

RE quirkiness (forgive me for a terrible name, I'm going to call it "trust" from now on, though it's really just a question of does 'simplify' return the same layout), my motivation is simplifying layout visiting code (I.e most of awkward). If we don't need to handle cases like this and the recent nested option type is_none issue (by enforcing rules like an option can never contain another option/index), we make our lives easier. In fact, the is_none issue makes this even more desirable. I feel like I've run into this in a few places, come to think of it.

Indeed this is something we can compute on the fly, but given that most users should never encounter such layouts (awkward should never produce them itself), it makes sense to keep trust around rather than recomputing.

It's out of scope for this PR of course, but are you comfortable with imposing such constraints on layouts (I.e you can't just nest any content inside any other)?

@jpivarski
Copy link
Member

Most of this is outside the scope of the PR.

We can already assume that option-types and IndexedArrays are not directly nested within each other. It is the responsibility of operations to return valid layouts, not to check incoming layouts for validity, because the latter (defensive programming) would require the same checks to be performed repeatedly.

Reading what you and I wrote on #1193, I'm not sure that I recognized that the error only occurred when ak.is_none was given invalid inputs. It might not have been necessary to handle that case. (It's also not a problem that the code is more robust now, and that particular check only has to descend one level of depth beyond the node it's concerned with—the error checking is bounded.)

If we're going to check it at all, the internal booleans is the least expensive in time. The flag should be completely transient, not taken as an argument (as nplike is), exposed as a property, pickled, etc. because it can always be recreated. It's like a kind of cache. I think "trust" is also an unclear word for describing what it is—how about redundant_optiontype?

The equivalent redundant_uniontype is when any of a UnionArray's contents are mergeable or are themselves UnionArrays. When passing these flags up, it's not important to distinguish them, just to know that there's an error, so it can be a single boolean (nested_redundancy?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants