Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support record reducer overloads #2458

Merged
merged 12 commits into from
May 19, 2023
Merged

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented May 12, 2023

Fixes #1423

  • Implement non-positional reducer support
  • Implement positional reducer support
  • Changed the ak.behavior documentation to use MyST Markdown

@agoose77 agoose77 force-pushed the agoose77/feat-record-reducer branch from 885d882 to 2f2f9f7 Compare May 12, 2023 13:57
@agoose77 agoose77 temporarily deployed to docs-preview May 12, 2023 14:10 — with GitHub Actions Inactive
@agoose77
Copy link
Collaborator Author

agoose77 commented May 12, 2023

This seems simple enough; build a list over the RecordArray which the custom reducer reduces along axis=-1. However, the open question is how to handle masking/identity. The simplest interface is to always require the custom reducer to include an identity element for empty sublists. This can then be masked out by the dispatcher. We currently do this for our normal reduction pathways.

However, doing this with existing highlevel primitives may not be trivial. e.g. an implementation of ak.max for a record structure, which looks at the first slot:

def _my_record_max(array):
    key = array["0"]
    j = ak.from_regular(
        ak.argmax(key, keepdims=True, mask_identity=False, axis=-1)
    )
    return array[j][..., 0]


ak.behavior[(ak.max, "my_record")] = _tuples_max

This will fail if we have an empty sublist and we don't mask the reducer.

One of the constraints that we are operating under is the desire not to have third party code using our nplike abstraction. Whilst for NumPy, CuPy, and JAX that should be possible (just type-check the buffers), typetracer would prove more fiddly (incidentally, this does add a vote in favour of making our typetracer nplike L2).

The above can be extended to support empty sublists:

def _option_is_trivial(array):
    any_is_none = ak.any(ak.is_none(array, axis=0), axis=None)
    return not (ak.typetracer.is_unknown_scalar(any_is_none) or any_is_none)


def _my_record_max(array):
    key = array["0"]
    j = ak.from_regular(
        ak.argmax(key, keepdims=True, mask_identity=True, axis=-1)
    )
    out = ak.to_layout(
        array[j][..., 0]
    )
    assert out.is_option
    
    if _option_is_trivial(out):
        out = out.to_IndexedOptionArray64()
        return ak.contents.IndexedArray(
            out.index,
            out.content
        )
    else:
        return ak.fill_none(out, identity_element)

but where there are any empty sublists, it will involve a copy of out in ak.fill_none even if the output is subsequently masked.

I think the proper API therefore includes the mask_identity flag. Here's an attempt to write a simple max that doesn't copy the RecordArray unless it has no choice, without using nplike.

def _option_is_trivial(array):
    any_is_none = ak.any(ak.is_none(array, axis=0), axis=None)
    return not (ak.typetracer.is_unknown_scalar(any_is_none) or any_is_none)


def _my_record_max(array: ak.Array, mask_identity: bool):
    key = array["0"]
    j = ak.from_regular(ak.argmax(key, keepdims=True, mask_identity=True, axis=-1))
    out = ak.to_layout(
        array[j][..., 0]
    )

    if mask_identity:
        return out
    # Avoid content `_carry` for options with empty masks
    elif _option_is_trivial(out):
        out = out.to_IndexedOptionArray64()
        return ak.contents.IndexedArray(
            out.index,
            out.content
        )
    # _carry the content (and fill missing values)
    else:
        return ak.fill_none(out, identity_element)

Meanwhile, argmax is trivial!

def _my_record_argmax(array: ak.Array, mask_identity: bool):
    return ak.argmax(array["0"], keepdims=False, mask_identity=False, axis=-1)

@agoose77 agoose77 temporarily deployed to docs-preview May 12, 2023 16:04 — with GitHub Actions Inactive
@agoose77 agoose77 changed the title wip: support record reducer overloads fea: support record reducer overloads May 14, 2023
@agoose77 agoose77 changed the title fea: support record reducer overloads feat: support record reducer overloads May 14, 2023
@codecov
Copy link

codecov bot commented May 14, 2023

Codecov Report

Merging #2458 (872de48) into main (4468cf5) will decrease coverage by 0.01%.
The diff coverage is 75.86%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/operations/ak_with_name.py 100.00% <ø> (ø)
src/awkward/_do.py 83.22% <50.00%> (ø)
src/awkward/contents/recordarray.py 84.68% <77.77%> (-0.44%) ⬇️

@agoose77 agoose77 temporarily deployed to docs-preview May 14, 2023 16:43 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview May 14, 2023 17:01 — with GitHub Actions Inactive
@agoose77 agoose77 marked this pull request as ready for review May 15, 2023 08:49
@agoose77 agoose77 requested a review from jpivarski May 15, 2023 08:49
@agoose77 agoose77 temporarily deployed to docs-preview May 15, 2023 09:41 — with GitHub Actions Inactive
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good argument for the API requiring users to write functions that include a mask_identity argument. (Since these are functions users write, it's much harder to add required arguments later!) I don't suppose it needs a keepdims argument, since keeping a dimension should always be done in a regular way—behavior developers shouldn't have control over that.

In principle, the behavior developers might want to raise an exception on mask_identity=False.

Although you have unit tests for the sum-of-vectors case, it could be good to test-implement it in Vector (here) to be sure that the API is what we need. (Notice all of the __numba_typer__/__numba_lower__ behaviors defined in this file: the API in Awkward and its use in Vector were developed in tandem.)

@agoose77
Copy link
Collaborator Author

I agree. In truth, I'm currently thinking about how vector should support this - the dispatch machinery is fairly long. I suspect it will just require simply writing the same code!

@agoose77
Copy link
Collaborator Author

agoose77 commented May 15, 2023

@jpivarski inspired by the existing (and more complex) binary addition, I took a stab at sum here.

For vector, I can't initially think of any reducers that we'd want to implement that do not have identities. Can you think of any?

In the event that there are none, the idea here is that we can re-use the sum implementation for numpy-backed vectors too (by moveaxis, reduce, moveaxis, to prepare the array for reduction)

@agoose77
Copy link
Collaborator Author

My implementation here implicitly assumes that records are atoms, i.e. one cannot reduce deeper than the record itself. This was motivated by RecordArray.purelist_depth which returns 1, and justifies not passing axis to the custom reducer. I believe this holds, but weigh in if you think that's a misstep.

@jpivarski
Copy link
Member

jpivarski commented May 15, 2023

For vector, I can't initially think of any reducers that we'd want to implement that do not have identities. Can you think of any?

In Vector (and elsewhere), the only reducer I can think of wanting to override at all is sum. The any/all pair applies to booleans, which are not geometrical vectors, the min/max pair applies to objects with an ordering, and $n&gt;1$-dimensional vectors do not have an ordering; same for argmin/argmax. I can't even see a reason to have prod (scalar product is not reducible with more than two arguments; vector product only has a definition for 3D vectors; and neither of these should be confused with *, which only combines a scalar with a vector).

Two reducers that can be implemented for Vectors: count and count_nonzero (where "nonzero" means "nonzero magnitude"). They have identities, though.

My implementation here implicitly assumes that records are atoms, i.e. one cannot reduce deeper than the record itself.

That's right. We can enshrine that as a rule.

@agoose77 agoose77 force-pushed the agoose77/feat-record-reducer branch from 1914bdc to b454555 Compare May 16, 2023 20:36
@agoose77 agoose77 temporarily deployed to docs-preview May 16, 2023 20:47 — with GitHub Actions Inactive
Comment on lines +38 to +44
def _apply_record_reducer(
reducer, layout: RecordArray, mask: bool, offsets: ak.index.Index, behavior
) -> Content:
# Build a 1D list over these contents
array = wrap_layout(ak.contents.ListOffsetArray(offsets, layout), behavior=behavior)
# Perform the reduction
return ak.to_layout(reducer(array, mask))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't yet know where to put this. It feels a bit odd creating a highlevel array inside code written in a Content class definition, so I've moved it to a top-level function, much like the other reducer overloads are defined.

@agoose77 agoose77 enabled auto-merge (squash) May 19, 2023 11:55
@agoose77 agoose77 temporarily deployed to docs-preview May 19, 2023 12:08 — with GitHub Actions Inactive
@agoose77 agoose77 merged commit e674dcb into main May 19, 2023
@agoose77 agoose77 deleted the agoose77/feat-record-reducer branch May 19, 2023 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow reducers to be overridden in v2
2 participants