Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ak.drop_none() #1904

Merged
merged 25 commits into from
Dec 16, 2022
Merged

feat: add ak.drop_none() #1904

merged 25 commits into from
Dec 16, 2022

Conversation

ioanaif
Copy link
Collaborator

@ioanaif ioanaif commented Nov 21, 2022

This PR adds the drop_none functionality. Requested in #832

ak.drop_none(array, axis) - removes missing values (None) from a given array.

For example, in the following array,

a = ak.Array([[[0]], [[None]], [[1], None], [[2, None]]])

The None values will be removed, resulting in

>>> ak.drop_none(a)
<Array [[[0]], [[]], [[1]], [[2]]] type='4 * var * var * int64'>

The default axis is None. However, an axis can be specified:

>>> ak.drop_none(a,axis=1)
<Array [[[0]], [[None]], [[1]], [[2, None]]] type='4 * var * var * ?int64'>

@ioanaif ioanaif force-pushed the ioanaif/add-drop-none-feature branch 3 times, most recently from db3a86f to ef611f4 Compare November 21, 2022 15:38
@ioanaif ioanaif force-pushed the ioanaif/add-drop-none-feature branch from a625376 to 76e72c9 Compare November 21, 2022 15:49
@ioanaif ioanaif force-pushed the ioanaif/add-drop-none-feature branch from 28c7ad7 to c671c46 Compare November 22, 2022 10:30
@codecov
Copy link

codecov bot commented Nov 22, 2022

Codecov Report

Merging #1904 (b19bed8) into main (3fc4adb) will increase coverage by 0.07%.
The diff coverage is 97.40%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/contents/regulararray.py 88.62% <50.00%> (-0.16%) ⬇️
src/awkward/contents/content.py 72.95% <75.00%> (+0.01%) ⬆️
src/awkward/contents/bitmaskedarray.py 69.85% <100.00%> (+0.57%) ⬆️
src/awkward/contents/bytemaskedarray.py 88.39% <100.00%> (+0.06%) ⬆️
src/awkward/contents/indexedoptionarray.py 88.55% <100.00%> (+0.04%) ⬆️
src/awkward/contents/listarray.py 90.45% <100.00%> (+0.08%) ⬆️
src/awkward/contents/listoffsetarray.py 79.82% <100.00%> (+0.33%) ⬆️
src/awkward/contents/unmaskedarray.py 73.00% <100.00%> (+0.24%) ⬆️
src/awkward/operations/__init__.py 100.00% <100.00%> (ø)
src/awkward/operations/ak_drop_none.py 100.00% <100.00%> (ø)
... and 1 more

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good implementation and I can see that the tests are exhaustive, using the full suite of layouts.

I think it's good and ready to merge, as soon as the build-docs issue is fixed. If it's fixed in #1905 first, we'll merge that into main and then merge main into here. (One way or the other.)

src/awkward/operations/ak_drop_none.py Show resolved Hide resolved
Copy link
Collaborator

@agoose77 agoose77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @ioanaif!! I needed this feature so much during my analysis work.

I'm on holiday, but I know that this is a big feature and wanted to try and offer an additional set of eyes.

src/awkward/contents/listoffsetarray.py Show resolved Hide resolved
@agoose77
Copy link
Collaborator

agoose77 commented Nov 22, 2022

I'm done touching this PR now (hands off) - I fixed the whitespace in my suggestion that didn't merge properly :)

@ioanaif ioanaif enabled auto-merge (squash) November 22, 2022 17:49
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the tests too hastily: this is not wrapping its result as a high-level array, which it should by default.

There's also an error (below) that I noticed when trying the RecordArray example, which should be possible. I don't know why RecordArray reveals it but NumpyArray doesn't. I'm looking more closely at this now...

src/awkward/operations/ak_drop_none.py Outdated Show resolved Hide resolved
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm sorry, but I missed some things. The to_list tests in test_1904-drop-none.py are insensitive to the difference between high-level and low-level arrays, so that's why we didn't catch that issue.

I suspect that having _drop_none return different types for different nodes (Content vs (Index, Content)) is probably the indirect cause, though I don't know what mechanism is responsible for it. Anyway, it will be safer to have all the _drop_none methods return the same type; the top-level drop_none (no underscore) is in a good position to drop the unnecessary outindex at the end.

Also, this should at least pass through records: when users want to remove missing values, they'll want to remove them from within records just as much as from recordless arrays. A good guide for this could be is_none, which also looks inside of arrays, and is_none and drop_none should be pretty similar.

>>> ak.is_none(ak.Array([[{"x": [1]}], [{"x": [None]}]]), axis=2).show()
[[{x: [False]}],
 [{x: [True]}]]

>>> ak.is_none(ak.Array([[{"x": [1], "y": [[2]]}], [{"x": [None], "y": [[None]]}]]), axis=-1).show()
[[{x: False, y: [False]}],
 [{x: False, y: [False]}]]

In the above, x and y have lists of different depths, but axis=-1 counts up from the bottom. That's why most functions keep calling wrap_axis_if_negative until it finally becomes positive (after the recursion has passed through the record and it's seeing a single, unambiguous depth).

I think these are the only two issues, though. There are tests in test_1904-drop-none.py that involve records; I'll take a look at what happened there now.

@pre-commit-ci pre-commit-ci bot temporarily deployed to docs-preview December 13, 2022 17:32 Inactive
@ioanaif ioanaif temporarily deployed to docs-preview December 13, 2022 17:47 — with GitHub Actions Inactive
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's almost completely done. I see that you're using the new maybe_posaxis.

There are some cases that the tests didn't (couldn't) cover because the tests are based on slicing with is_none, which isn't possible if a record introduces different depths. I've worked through some examples that could be used as direct tests.

Starting from this array, which has missing values at all levels, with different levels for the x and y branches of the record.

array = ak.Array(
    [None,
     [None, {"x": [1], "y": [[2]]}],
     [{"x": [3], "y": [None]}, {"x": [None], "y": [[None]]}]
    ])
>>> array.show()
[None,
 [None, {x: [1], y: [[2]]}],
 [{x: [3], y: [None]}, {x: [None], y: [[None]]}]]

First, with axis=0, the result from is_none and drop_none are correct because only the None outside of any lists is at axis=0.

>>> ak.is_none(array, axis=0).show()
[True,
 False,
 False]
>>> ak.drop_none(array, axis=0).show()
[[None, {x: [1], y: [[2]]}],
 [{x: [3], y: [None]}, {x: [None], y: [[None]]}]]

Next, with axis=1, is_none and drop_none are still both correct.

>>> ak.is_none(array, axis=1).show()
[None,
 [True, False],
 [False, False]]
>>> ak.drop_none(array, axis=1).show()
[None,
 [{x: [1], y: [[2]]}],
 [{x: [3], y: [None]}, {x: [None], y: [[None]]}]]

Next, with axis=2, is_none is correct because it is only making x: [True] or y: [True] for a None at this level; anything deeper is hidden inside a False and anything shallower is visible, so is_none is correct.

However, drop_none is producing the right values but putting them in the wrong places: the x: [3] should be in the previous record, so drop_none is incorrect in this example.

>>> ak.is_none(array, axis=2).show()
[None,
 [None, {x: [False], y: [False]}],
 [{x: [False], y: [True]}, {x: [True], y: [False]}]]
>>> ak.drop_none(array, axis=2).show()
[None,
 [None, {x: [1], y: [[2]]}],
 [{x: [], y: []}, {x: [3], y: [[None]]}]]

Next, with axis=-1, the deepest x values have one [, ] while the deepest y values have two [[, ]]. The is_none function reports x to be [False] or [True] with one bracket, and it reports y to be [[False]] or [[True]] with two brackets, so is_none is correct.

The drop_none correctly leaves the y: [None] (one bracket) because it's above axis=-1 and it keeps x: [3] or removes it, x: [], also with one bracket because that's what axis=-1 means for x. However, the x: [3] should be on the previous record, so it's incorrect for this example.

>>> ak.is_none(array, axis=-1).show()
[None,
 [None, {x: [False], y: [[False]]}],
 [{x: [False], y: [None]}, {x: [True], y: [[True]]}]]
>>> ak.drop_none(array, axis=-1).show()
[None,
 [None, {x: [1], y: [[2]]}],
 [{x: [], y: [None]}, {x: [3], y: [[]]}]]

Finally, with axis=-2, is_none should say x: False and (potentially) x: True (no instances in this array) with zero brackets and y: [False] and y: [True] with one bracket. It does, and is_none is correct.

Thinking about what drop_none should do in this case, it can't remove an x field without also removing a y field, so axis=-2 should raise an exception for this kind of thing, and it doesn't:

>>> ak.is_none(array, axis=-2).show()
[None,
 [None, {x: False, y: [False]}],
 [{x: False, y: [True]}, {x: False, y: [False]}]]
>>> ak.drop_none(array, axis=-2).show()
[None,
 [None, {x: [1], y: [[2]]}],
 [{x: [3], y: []}, {x: [], y: [[None]]}]]

To be concrete: the original array has x: [1], x: [3], and x: [None] at this axis, all of which are not missing, but if one was, then it would want to remove that x value, yet axis=-2 for y does not mean for it to act at this level. You can implement this exception by adding code where the recursion gets to the RecordArray: if posaxis == depth - 1 for some of its fields but not others, then it should raise a np.AxisError.

(is_none raises an np.AxisError if axis=-3, which is impossible for what is_none wants to do.)

I need a better array2 for this case, so that I'm not talking in hypotheticals.

array2 = ak.Array(
    [None,
     [None, {x: [1], y: [[2]]}],
     [{x: None, y: [None]}, {x: [None], y: [[None]]}]
    ])
>>> array2.show()
[None,
 [None, {x: [1], y: [[2]]}],
 [{x: None, y: [None]}, {x: [None], y: [[None]]}]]

Now is_none at axis=-2 is

>>> ak.is_none(array2, axis=-2).show()
[None,
 [None, {x: False, y: [False]}],
 [{x: True, y: [True]}, {x: False, y: [False]}]]

which has x: False wherever x has a list and x: True wherever x has a direct None.

Trying to run drop_none on this raises a ValueError in the RecordArray constructor, but it should have been an np.AxisError before attempting to compute and construct the RecordArray.

These can be directly dropped in as new unit tests. This PR should also increase the awkward-cpp version number to 3 in awkward-cpp/pyproject.toml.

I think the above is just one error in indexing plus needing to add one check-and-raise-exception. It's an enormous amount of work to get this far.

@jpivarski
Copy link
Member

This PR should also increase the awkward-cpp version number to 3 in awkward-cpp/pyproject.toml.

Actually, I'm going to stick to the habit of always changing version numbers as direct commits to main, so that the fact that a version has changed is highly visible. I'll do that right now.

Copy link
Collaborator

@agoose77 agoose77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed some docs fixes, but any code suggestions I've left here! Nice work @ioanaif :)

src/awkward/operations/ak_drop_none.py Outdated Show resolved Hide resolved
src/awkward/operations/ak_drop_none.py Outdated Show resolved Hide resolved
@agoose77 agoose77 temporarily deployed to docs-preview December 13, 2022 22:43 — with GitHub Actions Inactive
@ioanaif ioanaif temporarily deployed to docs-preview December 16, 2022 10:04 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview December 16, 2022 15:33 — with GitHub Actions Inactive
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is done! (Let's see the tests pass.)

Thanks for all of the hard work on this; it was definitely a lot more involved than I had thought it was going to be when I first suggested it. But now it works in all the extreme cases and we know it's not going to come up again as a bug in somebody's analysis.

It reminds me of something... found it:

This applies a lot more often than I'd like it to.

@jpivarski jpivarski temporarily deployed to docs-preview December 16, 2022 16:18 — with GitHub Actions Inactive
@ioanaif
Copy link
Collaborator Author

ioanaif commented Dec 16, 2022

Yay! Hope all corner cases have been discovered! 🥳🥳

@jpivarski jpivarski merged commit eb13f07 into main Dec 16, 2022
@jpivarski jpivarski deleted the ioanaif/add-drop-none-feature branch December 16, 2022 16:30
@jpivarski jpivarski mentioned this pull request Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants