Skip to content

[ntuple] Handle missing values in RNTupleProcessor #18932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

enirolf
Copy link
Contributor

@enirolf enirolf commented Jun 2, 2025

When an entry managed by the RNTupleProcessor contains missing data for certain fields, for example because they are missing in a subsequent chain or auxiliary ntuple, we need a way to signal to the user that the resulting entry has missing data and is therefore invalid.

The RNTupleProcessorValuePtr, together with the RNTupleProcessorEntry (which is a thin wrapper around REntry) capture this requirement by only returning actual data if the entry is marked as "valid" by the processor. Its use can be seen in action in the MissingEntries test in ntuple_processor_join.cxx.

With the addition of RNTupleValuePtr, pointers to the values of an RNTupleModel's default entry, returned by e.g., MakeField should not be used together with the RNTupleProcessor anymore, because they may contain incorrect data. More specifically, when a particular field's
data could not be loaded for a certain entry, it will retain its previous value. The RNTupleValuePtr adds appropriate checks for this, but to prevent users from using pointers returned from a model's default entry, we only allow RNTupleProcessors to be created from bare models.

Finally, this PR changes the iterator provided by the RNTupleProcessor from the full entry to only the (global) entry index. Providing the full entry was already a bit questionable, because it would allow users to call GetPtr or related functions inside of the loop, which we want to avoid. Other than that, there is (currently) no additional useful information one can obtain from the full entry.

@enirolf enirolf self-assigned this Jun 2, 2025
@enirolf enirolf requested review from jblomer and couet as code owners June 2, 2025 14:17
@enirolf enirolf force-pushed the ntuple-processor-value branch from 38e21aa to bfa6658 Compare June 2, 2025 14:47

RNTupleProcessorEntry(std::unique_ptr<ROOT::REntry> entry) : fEntry(std::move(entry)) {}

void SetValid() { fIsValid = true; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to prefer a SetValid(bool valid)? I'd say that generally it's a better API as it composes better (e.g. avoids situations where one has to write if (valid) { SetValid() } else { SetInvalid() }).

}

public:
bool HasValue() { return fProcessorEntry->IsValid(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be const

const RNTupleProcessorEntry *fProcessorEntry;
ROOT::RFieldToken fToken;

RNTupleProcessorValuePtr(std::string_view fieldName, const RNTupleProcessorEntry *processorEntry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can processorEntry be null here? If not, I'd pass it by reference (but it's fine to store it by pointer ofc)

/// management through RNTupleProcessorValuePtr. The use of pointers returned from a model's default entry (e.g.,
/// through RNTupleModel::AddField) may contain incorrect information, specifically when the corresponding field is
/// missing from an entry.
static void EnsureModelIsBare(const RNTupleModel &model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not missing something, this is probably better off as a static function in the cxx file directly.

Copy link
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Some suggestions from my side

};

template <typename T>
class RNTupleProcessorValuePtr {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're introducing this new part of the API, I'm wondering. Does it make sense to think about a naming using terms borrowed from computer science concepts which might be relevant here? I'm thinking about something like RNTupleProcessor[Future,Promise,Delayed].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, but I don't think [Future,Promise,Delayed] is the most accurate here since there is no asynchronicity involved. I would say semantically this is more similar to a std::optional (or Haskell's Maybe for that matter ;)). In that case it could/would be RNTupleProcessor[Optional,Maybe]Value or something like that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like RNTupleProcessorOptionalValue or just RNTupleProcessorOptional 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the analogy to std::optional, though I'll throw RNTupleProcessorOptionalPtr into the mix because it's not actually a copy of the value as far as I understand. Unfortunately all class names are quite long, it might be a possibility to nest them into RNTupleProcessor, but not sure how much this would actually help...

}

try {
entry++;
iter++;
FAIL() << "having missing fields in subsequent ntuples should throw";
} catch (const ROOT::RException &err) {
EXPECT_THAT(err.what(), testing::HasSubstr("field \"y\" not found in the current RNTuple"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be extremely useful to point out that field y is missing from RNTuple named xxx in file yyy.

Copy link

github-actions bot commented Jun 2, 2025

Test Results

    19 files      19 suites   3d 17h 5m 4s ⏱️
 2 803 tests  2 803 ✅ 0 💤 0 ❌
51 757 runs  51 757 ✅ 0 💤 0 ❌

Results for commit 389a1cc.

♻️ This comment has been updated with latest results.

enirolf added 4 commits June 3, 2025 13:15
When an entry contains missing data for certain fields, for example
because they are missing in a subsequent chain or auxiliary ntuple,
we need a way to signal to the user that the resulting entry has missing
data and is therefore invalid. The `RNTupleProcessorValuePtr`, together
with the `RNTupleProcessorEntry` (which is a thin wrapper around
`REntry`) capture this requirement by only returning actual data if the
entry is marked as "valid" by the processor.
...instead of the full entry. This design was already a bit
questionable, because it would allow users to call `GetPtr` or related
functions *inside* of the loop, which we want to avoid. Other than
that, there is (currently) no additional useful information one can
obtain from the full entry. Therefore, instead we now only return the
index of the current entry.
With the addition of `RNTupleValuePtr`, pointers to the values of an
RNTupleModel's default entry, returned by e.g., `MakeField` should not
be used together with the RNTupleProcessor anymore, because they may
contain incorrect data. More specifically, when a particular field's
data could not be loaded for a certain entry, it will retain its
previous value. The `RNTupleValuePtr` adds appropriate checks for this,
but to prevent users from using pointers returned from a model's default
entry, we only allow RNTupleProcessors to be created from bare models.
@enirolf enirolf force-pushed the ntuple-processor-value branch from bfa6658 to 389a1cc Compare June 4, 2025 13:00
@enirolf enirolf marked this pull request as draft June 4, 2025 13:01

RNTupleProcessorEntry(std::unique_ptr<ROOT::REntry> entry) : fEntry(std::move(entry)) {}

void SetValid(bool valid) { fIsValid = valid; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a heads-up, down the line we may need to add more semantic information to whether the entry is valid or not, see e.g.

enum EEntryStatus {
kEntryValid = 0, ///< data read okay
kEntryNotLoaded, ///< no entry has been loaded yet
kEntryNoTree, ///< the tree does not exist
kEntryNotFound, ///< the tree entry number does not exist
kEntryChainSetupError, ///< problem in accessing a chain element, e.g. file without the tree
kEntryChainFileError, ///< problem in opening a chain's file
kEntryDictionaryError, ///< problem reading dictionary info from tree
kEntryBeyondEnd, ///< last entry loop has reached its end
kEntryBadReader, ///< One of the readers was not successfully initialized.
kIndexedFriendNoMatch, ///< A friend with TTreeIndex doesn't have an entry for this index
kMissingBranchWhenSwitchingTree, ///< A branch was not found when switching to the next TTree in the chain
kEntryUnknownError ///< LoadTree return less than -6, likely a 'newer' error code.
};

return nullptr;
}

const T &operator*() const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may also want to implement operator-> so the user can directly call functions.

};

template <typename T>
class RNTupleProcessorValuePtr {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the analogy to std::optional, though I'll throw RNTupleProcessorOptionalPtr into the mix because it's not actually a copy of the value as far as I understand. Unfortunately all class names are quite long, it might be a possibility to nest them into RNTupleProcessor, but not sure how much this would actually help...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants