-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow awkward type arrays filtering based on rdfentry #2202
feat: allow awkward type arrays filtering based on rdfentry #2202
Conversation
Codecov Report
Additional details and impacted files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I expected that RDF would have its own bookkeeping for filters, and it's great that we can use that instead of creating another. Is this rdfentry_
column only available in certain ROOT versions? If so, then we'll need to restrict to a minimum version of ROOT. (That's generally true of all our strict and non-strict dependencies; once we find a feature we minimally need, we can raise errors if the dependency version is too old to have that feature.)
The test reads an Awkward column back out of the RDF (x
has record type—not converted through RVec
and such) and you verified that the length is correct. Could you also verify that the values are correct, that it's picking out exactly the right entries? Maybe even better if the filter is not y > 2
but one of the integers % 2 == 0
, so that you see that the rdfentry_
is picking out a non-trivial subset.
I'm approving this PR for merging anyway, but a test like that would be better.
Hi @ianna, Just adding a comment to further clarify the usage of Bottom line, it is a unique index but you can use it as a filtering bookkeeping method for external arrays (like numpy or awkward arrays) only in sequential execution, which maybe it is already the case here. RDataFrame doesn't use this special variable for the bookkeeping of filtered events, there are other internal mechanisms. I also agree with Jim that it is worth to check also the precise contents of the filtered arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rdfentry_
-based indexing is done for the Awkward types columns. These can be placed into the RDF only via our own RDF source. IMHO, it is safe to use in MT. The result may not be in an original order, but the same entries from different columns will be the same (MT or not):
assert out["x"].tolist() == [{"x": [2.1, 2.2]}, {"x": [4.1, 4.2, 4.3, 4.4]}]
assert out["y"].tolist() == [2, 4]
assert out["z"].tolist() == [[2.1, 2.3, 2.4], [4.1, 4.2, 4.3]]
or
assert out["x"].tolist() == [{"x": [4.1, 4.2, 4.3, 4.4]}, {"x": [2.1, 2.2]}]
assert out["y"].tolist() == [4, 2]
assert out["z"].tolist() == [[4.1, 4.2, 4.3], [2.1, 2.3, 2.4]]
I think, to restore the original order, the workaround could be to argsort
the final rdfentry_
column retrieved as an Awkward array and index the result again.
Looking at ROOT documentation: an rdfentry_ column is an alias or a replacement of a tdfentry_ column that was introduced in versions greater than 6.14. The latter legacy column name is still supported. I need to check which ROOT version is recommended to use a production version of RDF.
done.
Thanks! Please, have a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these are more incisive tests. Thanks!
As discussed with @vepadulano - the
rdfentry_
column is guaranteed to produce unique indexing. This PR takes the column out and checks its length against the Awkward type column data (not copied to or from RDF).However, there might be a more efficient way to check if the RDF was filtered.