feat: allow awkward type arrays filtering based on rdfentry #2202

ianna · 2023-02-03T16:26:34Z

As discussed with @vepadulano - the rdfentry_ column is guaranteed to produce unique indexing. This PR takes the column out and checks its length against the Awkward type column data (not copied to or from RDF).

However, there might be a more efficient way to check if the RDF was filtered.

codecov · 2023-02-03T16:36:29Z

Codecov Report

Merging #2202 (5fd40d2) into main (20b3da8) will decrease coverage by 0.06%.
The diff coverage is 0.00%.

Additional details and impacted files

Impacted Files	Coverage Δ
src/awkward/_connect/rdataframe/from_rdataframe.py	`0.00% <0.00%> (ø)`

src/awkward/_connect/rdataframe/from_rdataframe.py

…on_rdfentry

jpivarski

Awesome! I expected that RDF would have its own bookkeeping for filters, and it's great that we can use that instead of creating another. Is this rdfentry_ column only available in certain ROOT versions? If so, then we'll need to restrict to a minimum version of ROOT. (That's generally true of all our strict and non-strict dependencies; once we find a feature we minimally need, we can raise errors if the dependency version is too old to have that feature.)

The test reads an Awkward column back out of the RDF (x has record type—not converted through RVec and such) and you verified that the length is correct. Could you also verify that the values are correct, that it's picking out exactly the right entries? Maybe even better if the filter is not y > 2 but one of the integers % 2 == 0, so that you see that the rdfentry_ is picking out a non-trivial subset.

I'm approving this PR for merging anyway, but a test like that would be better.

vepadulano · 2023-02-06T11:24:04Z

Hi @ianna,

Just adding a comment to further clarify the usage of rdfentry_. That depends on both the data source and the mode of execution. The easiest case is single core execution, with no data source (i.e. empty source), then rdfentry_ will just be the row number (from 0 to N-1 in df = RDataFrame(N)). With a TTree/TChain as a data source, this corresponds to the event number of the tree/chain, but only in sequential execution. In MT execution, it is still a unique indexing, but there is no direct correspondence between rdfentry_ and the event number in the full tree/chain as specified in the docs. Interestingly, we have very recently started discussing the possibility of aligning it to the event number also in MT runs in this issue.

Bottom line, it is a unique index but you can use it as a filtering bookkeeping method for external arrays (like numpy or awkward arrays) only in sequential execution, which maybe it is already the case here. RDataFrame doesn't use this special variable for the bookkeeping of filtered events, there are other internal mechanisms.

I also agree with Jim that it is worth to check also the precise contents of the filtered arrays.

ianna

The rdfentry_-based indexing is done for the Awkward types columns. These can be placed into the RDF only via our own RDF source. IMHO, it is safe to use in MT. The result may not be in an original order, but the same entries from different columns will be the same (MT or not):

assert out["x"].tolist() == [{"x": [2.1, 2.2]}, {"x": [4.1, 4.2, 4.3, 4.4]}]
assert out["y"].tolist() == [2, 4]
assert out["z"].tolist() == [[2.1, 2.3, 2.4], [4.1, 4.2, 4.3]]

or

assert out["x"].tolist() == [{"x": [4.1, 4.2, 4.3, 4.4]}, {"x": [2.1, 2.2]}]
assert out["y"].tolist() == [4, 2]
assert out["z"].tolist() == [[4.1, 4.2, 4.3], [2.1, 2.3, 2.4]]

I think, to restore the original order, the workaround could be to argsort the final rdfentry_ column retrieved as an Awkward array and index the result again.

src/awkward/_connect/rdataframe/from_rdataframe.py

ianna · 2023-02-06T15:46:13Z

Awesome! I expected that RDF would have its own bookkeeping for filters, and it's great that we can use that instead of creating another. Is this rdfentry_ column only available in certain ROOT versions? If so, then we'll need to restrict to a minimum version of ROOT. (That's generally true of all our strict and non-strict dependencies; once we find a feature we minimally need, we can raise errors if the dependency version is too old to have that feature.)

Looking at ROOT documentation: an rdfentry_ column is an alias or a replacement of a tdfentry_ column that was introduced in versions greater than 6.14. The latter legacy column name is still supported. I need to check which ROOT version is recommended to use a production version of RDF.

The test reads an Awkward column back out of the RDF (x has record type—not converted through RVec and such) and you verified that the length is correct. Could you also verify that the values are correct, that it's picking out exactly the right entries? Maybe even better if the filter is not y > 2 but one of the integers % 2 == 0, so that you see that the rdfentry_ is picking out a non-trivial subset.

done.

I'm approving this PR for merging anyway, but a test like that would be better.

Thanks! Please, have a look.

jpivarski

Yes, these are more incisive tests. Thanks!

ianna added 2 commits February 3, 2023 15:18

fix: use generic int type for offsets

2c8e31c

feat: allow awkward type arrays filtering based on rdfentry

f5fec9a

fix: pylint fixes

360c6ef

ianna temporarily deployed to docs-preview February 3, 2023 16:52 — with GitHub Actions Inactive

jpivarski reviewed Feb 3, 2023

View reviewed changes

src/awkward/_connect/rdataframe/from_rdataframe.py Outdated Show resolved Hide resolved

ianna added 2 commits February 4, 2023 14:13

fix: revert type

5c82bd9

Merge branch 'main' into ianna/rdataframe_filter_awkward_type_arrays_…

11b1c88

…on_rdfentry

ianna requested a review from jpivarski February 4, 2023 13:15

ianna temporarily deployed to docs-preview February 4, 2023 13:25 — with GitHub Actions Inactive

jpivarski approved these changes Feb 4, 2023

View reviewed changes

fix: more complex test as suggested by Jim

5fd40d2

ianna temporarily deployed to docs-preview February 6, 2023 14:54 — with GitHub Actions Inactive

ianna commented Feb 6, 2023

View reviewed changes

src/awkward/_connect/rdataframe/from_rdataframe.py Show resolved Hide resolved

jpivarski approved these changes Feb 6, 2023

View reviewed changes

jpivarski merged commit 339896f into main Feb 6, 2023

jpivarski deleted the ianna/rdataframe_filter_awkward_type_arrays_on_rdfentry branch February 6, 2023 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow awkward type arrays filtering based on rdfentry #2202

feat: allow awkward type arrays filtering based on rdfentry #2202

ianna commented Feb 3, 2023 •

edited

Loading

codecov bot commented Feb 3, 2023 •

edited

Loading

jpivarski left a comment

vepadulano commented Feb 6, 2023 •

edited

Loading

ianna left a comment

ianna commented Feb 6, 2023

jpivarski left a comment

feat: allow awkward type arrays filtering based on rdfentry #2202

feat: allow awkward type arrays filtering based on rdfentry #2202

Conversation

ianna commented Feb 3, 2023 • edited Loading

codecov bot commented Feb 3, 2023 • edited Loading

Codecov Report

jpivarski left a comment

Choose a reason for hiding this comment

vepadulano commented Feb 6, 2023 • edited Loading

ianna left a comment

Choose a reason for hiding this comment

ianna commented Feb 6, 2023

jpivarski left a comment

Choose a reason for hiding this comment

ianna commented Feb 3, 2023 •

edited

Loading

codecov bot commented Feb 3, 2023 •

edited

Loading

vepadulano commented Feb 6, 2023 •

edited

Loading