-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[ntuple] Handle missing values in RNTupleProcessor #18932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
38e21aa
to
bfa6658
Compare
|
||
RNTupleProcessorEntry(std::unique_ptr<ROOT::REntry> entry) : fEntry(std::move(entry)) {} | ||
|
||
void SetValid() { fIsValid = true; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to prefer a SetValid(bool valid)
? I'd say that generally it's a better API as it composes better (e.g. avoids situations where one has to write if (valid) { SetValid() } else { SetInvalid() }
).
} | ||
|
||
public: | ||
bool HasValue() { return fProcessorEntry->IsValid(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should be const
const RNTupleProcessorEntry *fProcessorEntry; | ||
ROOT::RFieldToken fToken; | ||
|
||
RNTupleProcessorValuePtr(std::string_view fieldName, const RNTupleProcessorEntry *processorEntry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can processorEntry
be null here? If not, I'd pass it by reference (but it's fine to store it by pointer ofc)
/// management through RNTupleProcessorValuePtr. The use of pointers returned from a model's default entry (e.g., | ||
/// through RNTupleModel::AddField) may contain incorrect information, specifically when the corresponding field is | ||
/// missing from an entry. | ||
static void EnsureModelIsBare(const RNTupleModel &model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not missing something, this is probably better off as a static function in the cxx file directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Some suggestions from my side
}; | ||
|
||
template <typename T> | ||
class RNTupleProcessorValuePtr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're introducing this new part of the API, I'm wondering. Does it make sense to think about a naming using terms borrowed from computer science concepts which might be relevant here? I'm thinking about something like RNTupleProcessor[Future,Promise,Delayed]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, but I don't think [Future,Promise,Delayed]
is the most accurate here since there is no asynchronicity involved. I would say semantically this is more similar to a std::optional
(or Haskell's Maybe
for that matter ;)). In that case it could/would be RNTupleProcessor[Optional,Maybe]Value
or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like RNTupleProcessorOptionalValue or just RNTupleProcessorOptional 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like the analogy to std::optional
, though I'll throw RNTupleProcessorOptionalPtr
into the mix because it's not actually a copy of the value as far as I understand. Unfortunately all class names are quite long, it might be a possibility to nest them into RNTupleProcessor
, but not sure how much this would actually help...
} | ||
|
||
try { | ||
entry++; | ||
iter++; | ||
FAIL() << "having missing fields in subsequent ntuples should throw"; | ||
} catch (const ROOT::RException &err) { | ||
EXPECT_THAT(err.what(), testing::HasSubstr("field \"y\" not found in the current RNTuple")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be extremely useful to point out that field y
is missing from RNTuple named xxx
in file yyy
.
Test Results 19 files 19 suites 3d 17h 5m 4s ⏱️ Results for commit 389a1cc. ♻️ This comment has been updated with latest results. |
When an entry contains missing data for certain fields, for example because they are missing in a subsequent chain or auxiliary ntuple, we need a way to signal to the user that the resulting entry has missing data and is therefore invalid. The `RNTupleProcessorValuePtr`, together with the `RNTupleProcessorEntry` (which is a thin wrapper around `REntry`) capture this requirement by only returning actual data if the entry is marked as "valid" by the processor.
...instead of the full entry. This design was already a bit questionable, because it would allow users to call `GetPtr` or related functions *inside* of the loop, which we want to avoid. Other than that, there is (currently) no additional useful information one can obtain from the full entry. Therefore, instead we now only return the index of the current entry.
With the addition of `RNTupleValuePtr`, pointers to the values of an RNTupleModel's default entry, returned by e.g., `MakeField` should not be used together with the RNTupleProcessor anymore, because they may contain incorrect data. More specifically, when a particular field's data could not be loaded for a certain entry, it will retain its previous value. The `RNTupleValuePtr` adds appropriate checks for this, but to prevent users from using pointers returned from a model's default entry, we only allow RNTupleProcessors to be created from bare models.
bfa6658
to
389a1cc
Compare
|
||
RNTupleProcessorEntry(std::unique_ptr<ROOT::REntry> entry) : fEntry(std::move(entry)) {} | ||
|
||
void SetValid(bool valid) { fIsValid = valid; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a heads-up, down the line we may need to add more semantic information to whether the entry is valid or not, see e.g.
root/tree/treeplayer/inc/TTreeReader.h
Lines 149 to 162 in b4955fc
enum EEntryStatus { | |
kEntryValid = 0, ///< data read okay | |
kEntryNotLoaded, ///< no entry has been loaded yet | |
kEntryNoTree, ///< the tree does not exist | |
kEntryNotFound, ///< the tree entry number does not exist | |
kEntryChainSetupError, ///< problem in accessing a chain element, e.g. file without the tree | |
kEntryChainFileError, ///< problem in opening a chain's file | |
kEntryDictionaryError, ///< problem reading dictionary info from tree | |
kEntryBeyondEnd, ///< last entry loop has reached its end | |
kEntryBadReader, ///< One of the readers was not successfully initialized. | |
kIndexedFriendNoMatch, ///< A friend with TTreeIndex doesn't have an entry for this index | |
kMissingBranchWhenSwitchingTree, ///< A branch was not found when switching to the next TTree in the chain | |
kEntryUnknownError ///< LoadTree return less than -6, likely a 'newer' error code. | |
}; |
return nullptr; | ||
} | ||
|
||
const T &operator*() const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may also want to implement operator->
so the user can directly call functions.
}; | ||
|
||
template <typename T> | ||
class RNTupleProcessorValuePtr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like the analogy to std::optional
, though I'll throw RNTupleProcessorOptionalPtr
into the mix because it's not actually a copy of the value as far as I understand. Unfortunately all class names are quite long, it might be a possibility to nest them into RNTupleProcessor
, but not sure how much this would actually help...
When an entry managed by the
RNTupleProcessor
contains missing data for certain fields, for example because they are missing in a subsequent chain or auxiliary ntuple, we need a way to signal to the user that the resulting entry has missing data and is therefore invalid.The
RNTupleProcessorValuePtr
, together with theRNTupleProcessorEntry
(which is a thin wrapper aroundREntry
) capture this requirement by only returning actual data if the entry is marked as "valid" by the processor. Its use can be seen in action in theMissingEntries
test inntuple_processor_join.cxx
.With the addition of
RNTupleValuePtr
, pointers to the values of an RNTupleModel's default entry, returned by e.g.,MakeField
should not be used together with the RNTupleProcessor anymore, because they may contain incorrect data. More specifically, when a particular field'sdata could not be loaded for a certain entry, it will retain its previous value. The
RNTupleValuePtr
adds appropriate checks for this, but to prevent users from using pointers returned from a model's default entry, we only allow RNTupleProcessors to be created from bare models.Finally, this PR changes the iterator provided by the
RNTupleProcessor
from the full entry to only the (global) entry index. Providing the full entry was already a bit questionable, because it would allow users to callGetPtr
or related functions inside of the loop, which we want to avoid. Other than that, there is (currently) no additional useful information one can obtain from the full entry.