[ntuple] Handle missing values in RNTupleProcessor #18932

enirolf · 2025-06-02T14:17:36Z

When an entry managed by the RNTupleProcessor contains missing data for certain fields, for example because they are missing in a subsequent chain or auxiliary ntuple, we need a way to signal to the user that the resulting entry has missing data and is therefore invalid.

The RNTupleProcessorValuePtr, together with the RNTupleProcessorEntry (which is a thin wrapper around REntry) capture this requirement by only returning actual data if the entry is marked as "valid" by the processor. Its use can be seen in action in the MissingEntries test in ntuple_processor_join.cxx.

With the addition of RNTupleValuePtr, pointers to the values of an RNTupleModel's default entry, returned by e.g., MakeField should not be used together with the RNTupleProcessor anymore, because they may contain incorrect data. More specifically, when a particular field's
data could not be loaded for a certain entry, it will retain its previous value. The RNTupleValuePtr adds appropriate checks for this, but to prevent users from using pointers returned from a model's default entry, we only allow RNTupleProcessors to be created from bare models.

Finally, this PR changes the iterator provided by the RNTupleProcessor from the full entry to only the (global) entry index. Providing the full entry was already a bit questionable, because it would allow users to call GetPtr or related functions inside of the loop, which we want to avoid. Other than that, there is (currently) no additional useful information one can obtain from the full entry.

silverweed · 2025-06-02T15:09:07Z

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

+
+   RNTupleProcessorEntry(std::unique_ptr<ROOT::REntry> entry) : fEntry(std::move(entry)) {}
+
+   void SetValid() { fIsValid = true; }


Any reason not to prefer a SetValid(bool valid)? I'd say that generally it's a better API as it composes better (e.g. avoids situations where one has to write if (valid) { SetValid() } else { SetInvalid() }).

silverweed · 2025-06-02T15:09:54Z

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

+   }
+
+public:
+   bool HasValue() { return fProcessorEntry->IsValid(); }


Probably should be const

silverweed · 2025-06-02T15:11:49Z

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

+   const RNTupleProcessorEntry *fProcessorEntry;
+   ROOT::RFieldToken fToken;
+
+   RNTupleProcessorValuePtr(std::string_view fieldName, const RNTupleProcessorEntry *processorEntry)


Can processorEntry be null here? If not, I'd pass it by reference (but it's fine to store it by pointer ofc)

silverweed · 2025-06-02T15:12:53Z

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

+   /// management through RNTupleProcessorValuePtr. The use of pointers returned from a model's default entry (e.g.,
+   /// through RNTupleModel::AddField) may contain incorrect information, specifically when the corresponding field is
+   /// missing from an entry.
+   static void EnsureModelIsBare(const RNTupleModel &model)


If I'm not missing something, this is probably better off as a static function in the cxx file directly.

vepadulano

Very nice! Some suggestions from my side

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

tree/ntuple/src/RNTupleProcessor.cxx

vepadulano · 2025-06-02T19:19:14Z

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

+};
+
+template <typename T>
+class RNTupleProcessorValuePtr {


Since we're introducing this new part of the API, I'm wondering. Does it make sense to think about a naming using terms borrowed from computer science concepts which might be relevant here? I'm thinking about something like RNTupleProcessor[Future,Promise,Delayed].

I see your point, but I don't think [Future,Promise,Delayed] is the most accurate here since there is no asynchronicity involved. I would say semantically this is more similar to a std::optional (or Haskell's Maybe for that matter ;)). In that case it could/would be RNTupleProcessor[Optional,Maybe]Value or something like that.

I like RNTupleProcessorOptionalValue or just RNTupleProcessorOptional 👍

I also like the analogy to std::optional, though I'll throw RNTupleProcessorOptionalPtr into the mix because it's not actually a copy of the value as far as I understand. Unfortunately all class names are quite long, it might be a possibility to nest them into RNTupleProcessor, but not sure how much this would actually help...

tree/ntuple/test/ntuple_processor.cxx

vepadulano · 2025-06-02T19:38:30Z

tree/ntuple/test/ntuple_processor_chain.cxx

   }

   try {
-      entry++;
+      iter++;
      FAIL() << "having missing fields in subsequent ntuples should throw";
   } catch (const ROOT::RException &err) {
      EXPECT_THAT(err.what(), testing::HasSubstr("field \"y\" not found in the current RNTuple"));


It would be extremely useful to point out that field y is missing from RNTuple named xxx in file yyy.

tree/ntuple/test/ntuple_processor_join.cxx

github-actions · 2025-06-02T22:20:32Z

Test Results

19 files 19 suites 3d 17h 5m 4s ⏱️
2 803 tests 2 803 ✅ 0 💤 0 ❌
51 757 runs 51 757 ✅ 0 💤 0 ❌

Results for commit 389a1cc.

♻️ This comment has been updated with latest results.

When an entry contains missing data for certain fields, for example because they are missing in a subsequent chain or auxiliary ntuple, we need a way to signal to the user that the resulting entry has missing data and is therefore invalid. The `RNTupleProcessorValuePtr`, together with the `RNTupleProcessorEntry` (which is a thin wrapper around `REntry`) capture this requirement by only returning actual data if the entry is marked as "valid" by the processor.

...instead of the full entry. This design was already a bit questionable, because it would allow users to call `GetPtr` or related functions *inside* of the loop, which we want to avoid. Other than that, there is (currently) no additional useful information one can obtain from the full entry. Therefore, instead we now only return the index of the current entry.

With the addition of `RNTupleValuePtr`, pointers to the values of an RNTupleModel's default entry, returned by e.g., `MakeField` should not be used together with the RNTupleProcessor anymore, because they may contain incorrect data. More specifically, when a particular field's data could not be loaded for a certain entry, it will retain its previous value. The `RNTupleValuePtr` adds appropriate checks for this, but to prevent users from using pointers returned from a model's default entry, we only allow RNTupleProcessors to be created from bare models.

vepadulano · 2025-06-10T07:52:32Z

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

+
+   RNTupleProcessorEntry(std::unique_ptr<ROOT::REntry> entry) : fEntry(std::move(entry)) {}
+
+   void SetValid(bool valid) { fIsValid = valid; }


Just as a heads-up, down the line we may need to add more semantic information to whether the entry is valid or not, see e.g.

root/tree/treeplayer/inc/TTreeReader.h

Lines 149 to 162 in b4955fc

enum EEntryStatus {

kEntryValid = 0, ///< data read okay

kEntryNotLoaded, ///< no entry has been loaded yet

kEntryNoTree, ///< the tree does not exist

kEntryNotFound, ///< the tree entry number does not exist

kEntryChainSetupError, ///< problem in accessing a chain element, e.g. file without the tree

kEntryChainFileError, ///< problem in opening a chain's file

kEntryDictionaryError, ///< problem reading dictionary info from tree

kEntryBeyondEnd, ///< last entry loop has reached its end

kEntryBadReader, ///< One of the readers was not successfully initialized.

kIndexedFriendNoMatch, ///< A friend with TTreeIndex doesn't have an entry for this index

kMissingBranchWhenSwitchingTree, ///< A branch was not found when switching to the next TTree in the chain

kEntryUnknownError ///< LoadTree return less than -6, likely a 'newer' error code.

};

hahnjo · 2025-06-11T15:37:56Z

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

+      return nullptr;
+   }
+
+   const T &operator*() const


You may also want to implement operator-> so the user can directly call functions.

hahnjo · 2025-06-11T15:48:53Z

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

+};
+
+template <typename T>
+class RNTupleProcessorValuePtr {


I also like the analogy to std::optional, though I'll throw RNTupleProcessorOptionalPtr into the mix because it's not actually a copy of the value as far as I understand. Unfortunately all class names are quite long, it might be a possibility to nest them into RNTupleProcessor, but not sure how much this would actually help...

enirolf requested review from hahnjo, pcanal, silverweed and vepadulano June 2, 2025 14:17

enirolf self-assigned this Jun 2, 2025

enirolf requested review from jblomer and couet as code owners June 2, 2025 14:17

enirolf added the in:RNTuple label Jun 2, 2025

enirolf force-pushed the ntuple-processor-value branch from 38e21aa to bfa6658 Compare June 2, 2025 14:47

silverweed reviewed Jun 2, 2025

View reviewed changes

vepadulano requested changes Jun 2, 2025

View reviewed changes

enirolf added 4 commits June 3, 2025 13:15

[ntuple] Update tutorials

389a1cc

enirolf force-pushed the ntuple-processor-value branch from bfa6658 to 389a1cc Compare June 4, 2025 13:00

enirolf marked this pull request as draft June 4, 2025 13:01

vepadulano reviewed Jun 10, 2025

View reviewed changes

hahnjo reviewed Jun 11, 2025

View reviewed changes


		RNTupleProcessorEntry(std::unique_ptr<ROOT::REntry> entry) : fEntry(std::move(entry)) {}

		void SetValid() { fIsValid = true; }


		RNTupleProcessorEntry(std::unique_ptr<ROOT::REntry> entry) : fEntry(std::move(entry)) {}

		void SetValid(bool valid) { fIsValid = valid; }

	enum EEntryStatus {
	kEntryValid = 0, ///< data read okay
	kEntryNotLoaded, ///< no entry has been loaded yet
	kEntryNoTree, ///< the tree does not exist
	kEntryNotFound, ///< the tree entry number does not exist
	kEntryChainSetupError, ///< problem in accessing a chain element, e.g. file without the tree
	kEntryChainFileError, ///< problem in opening a chain's file
	kEntryDictionaryError, ///< problem reading dictionary info from tree
	kEntryBeyondEnd, ///< last entry loop has reached its end
	kEntryBadReader, ///< One of the readers was not successfully initialized.
	kIndexedFriendNoMatch, ///< A friend with TTreeIndex doesn't have an entry for this index
	kMissingBranchWhenSwitchingTree, ///< A branch was not found when switching to the next TTree in the chain
	kEntryUnknownError ///< LoadTree return less than -6, likely a 'newer' error code.
	};

[ntuple] Handle missing values in RNTupleProcessor #18932

Are you sure you want to change the base?

[ntuple] Handle missing values in RNTupleProcessor #18932

Uh oh!

Conversation

enirolf commented Jun 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vepadulano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jun 2, 2025 •

edited

Loading