Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow awkward type arrays filtering based on rdfentry #2202

Merged

Conversation

ianna
Copy link
Collaborator

@ianna ianna commented Feb 3, 2023

As discussed with @vepadulano - the rdfentry_ column is guaranteed to produce unique indexing. This PR takes the column out and checks its length against the Awkward type column data (not copied to or from RDF).

However, there might be a more efficient way to check if the RDF was filtered.

@codecov
Copy link

codecov bot commented Feb 3, 2023

Codecov Report

Merging #2202 (5fd40d2) into main (20b3da8) will decrease coverage by 0.06%.
The diff coverage is 0.00%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/_connect/rdataframe/from_rdataframe.py 0.00% <0.00%> (ø)

@ianna ianna temporarily deployed to docs-preview February 3, 2023 16:52 — with GitHub Actions Inactive
@ianna ianna requested a review from jpivarski February 4, 2023 13:15
@ianna ianna temporarily deployed to docs-preview February 4, 2023 13:25 — with GitHub Actions Inactive
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I expected that RDF would have its own bookkeeping for filters, and it's great that we can use that instead of creating another. Is this rdfentry_ column only available in certain ROOT versions? If so, then we'll need to restrict to a minimum version of ROOT. (That's generally true of all our strict and non-strict dependencies; once we find a feature we minimally need, we can raise errors if the dependency version is too old to have that feature.)

The test reads an Awkward column back out of the RDF (x has record type—not converted through RVec and such) and you verified that the length is correct. Could you also verify that the values are correct, that it's picking out exactly the right entries? Maybe even better if the filter is not y > 2 but one of the integers % 2 == 0, so that you see that the rdfentry_ is picking out a non-trivial subset.

I'm approving this PR for merging anyway, but a test like that would be better.

@vepadulano
Copy link

vepadulano commented Feb 6, 2023

Hi @ianna,

Just adding a comment to further clarify the usage of rdfentry_. That depends on both the data source and the mode of execution. The easiest case is single core execution, with no data source (i.e. empty source), then rdfentry_ will just be the row number (from 0 to N-1 in df = RDataFrame(N)). With a TTree/TChain as a data source, this corresponds to the event number of the tree/chain, but only in sequential execution. In MT execution, it is still a unique indexing, but there is no direct correspondence between rdfentry_ and the event number in the full tree/chain as specified in the docs. Interestingly, we have very recently started discussing the possibility of aligning it to the event number also in MT runs in this issue.

Bottom line, it is a unique index but you can use it as a filtering bookkeeping method for external arrays (like numpy or awkward arrays) only in sequential execution, which maybe it is already the case here. RDataFrame doesn't use this special variable for the bookkeeping of filtered events, there are other internal mechanisms.

I also agree with Jim that it is worth to check also the precise contents of the filtered arrays.

@ianna ianna temporarily deployed to docs-preview February 6, 2023 14:54 — with GitHub Actions Inactive
Copy link
Collaborator Author

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rdfentry_-based indexing is done for the Awkward types columns. These can be placed into the RDF only via our own RDF source. IMHO, it is safe to use in MT. The result may not be in an original order, but the same entries from different columns will be the same (MT or not):

assert out["x"].tolist() == [{"x": [2.1, 2.2]}, {"x": [4.1, 4.2, 4.3, 4.4]}]
assert out["y"].tolist() == [2, 4]
assert out["z"].tolist() == [[2.1, 2.3, 2.4], [4.1, 4.2, 4.3]]

or

assert out["x"].tolist() == [{"x": [4.1, 4.2, 4.3, 4.4]}, {"x": [2.1, 2.2]}]
assert out["y"].tolist() == [4, 2]
assert out["z"].tolist() == [[4.1, 4.2, 4.3], [2.1, 2.3, 2.4]]

I think, to restore the original order, the workaround could be to argsort the final rdfentry_ column retrieved as an Awkward array and index the result again.

@ianna
Copy link
Collaborator Author

ianna commented Feb 6, 2023

Awesome! I expected that RDF would have its own bookkeeping for filters, and it's great that we can use that instead of creating another. Is this rdfentry_ column only available in certain ROOT versions? If so, then we'll need to restrict to a minimum version of ROOT. (That's generally true of all our strict and non-strict dependencies; once we find a feature we minimally need, we can raise errors if the dependency version is too old to have that feature.)

Looking at ROOT documentation: an rdfentry_ column is an alias or a replacement of a tdfentry_ column that was introduced in versions greater than 6.14. The latter legacy column name is still supported. I need to check which ROOT version is recommended to use a production version of RDF.

The test reads an Awkward column back out of the RDF (x has record type—not converted through RVec and such) and you verified that the length is correct. Could you also verify that the values are correct, that it's picking out exactly the right entries? Maybe even better if the filter is not y > 2 but one of the integers % 2 == 0, so that you see that the rdfentry_ is picking out a non-trivial subset.

done.

I'm approving this PR for merging anyway, but a test like that would be better.

Thanks! Please, have a look.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these are more incisive tests. Thanks!

@jpivarski jpivarski merged commit 339896f into main Feb 6, 2023
@jpivarski jpivarski deleted the ianna/rdataframe_filter_awkward_type_arrays_on_rdfentry branch February 6, 2023 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants