Skip to content

TSAN-reported data races with RNTuple reader (triggered by podio reader inside JANA2 reconstruction) #22153

@wdconinc

Description

@wdconinc

Check duplicate issues.

  • Checked for duplicates

Description

In the EIC recontruction framework CI we encounter TSAN reports of data races (e.g. in https://github.com/eic/EICrecon/actions/runs/25331418512/job/74270527718 after merge in eic/EICrecon#2469). This involves v1.0.1.0 RNTuple files generated by DD4hep simulations in the EDM4hep (podio data model) format, written with ROOT v6.38.00.

The issue appears to be in the reader inside our concurrent EICrecon reconstruction framework built on JANA2. Multiple threads read events.

After encountering the issue, I asked copilot to minimize the occurrence from our full stack and it came up with the reproducer below (and some gratuitous comments included).

Reproducer

repro_rclusterpool_race.cpp

To be compiled and run as follows (at least in our environments):

g++ -fsanitize=thread -std=c++17 -O1 $(root-config --cflags --libs) -lROOTNTuple -o repro repro_rclusterpool_race.cpp
TSAN_OPTIONS="halt_on_error=0" setarch $(uname -m) -R ./repro

(thread sanitizer enabled, halt_on_error enabled, and setarch to disable ASLR)

Output (only first data race report included):

$ TSAN_OPTIONS="halt_on_error=0" setarch $(uname -m) -R ./repro
Writing RNTuple (~16 clusters, 1 entry each)... done.
Reading back (this may trigger a TSAN report)...==================
WARNING: ThreadSanitizer: data race (pid=3371455)
  Read of size 8 at 0x72040000af20 by main thread:
    #0 memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:115 (libtsan.so.2+0x82708) (BuildId: 99ef88596cb10a8ccf307fe1a9070f66a44b1624)
    #1 memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:107 (libtsan.so.2+0x82708)
    #2 ROOT::Internal::RPageSource::UnsealPage(ROOT::Internal::RPageStorage::RSealedPage const&, ROOT::Internal::RColumnElementBase const&, ROOT::Internal::RPageAllocator&) <null> (libROOTNTuple.so.6.38+0x2029cf) (BuildId: 9d3fc48eb8a60aad7661d81e7b6cc979d8db5a8c)
    #3 __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 (libc.so.6+0x29ca7) (BuildId: 58749c528985eab03e6700ebc1469fa50aa41219)

  Previous write of size 8 at 0x72040000af20 by thread T1:
    #0 pread64 ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:1025 (libtsan.so.2+0x5a0ed) (BuildId: 99ef88596cb10a8ccf307fe1a9070f66a44b1624)
    #1 ROOT::Internal::RRawFileUnix::ReadAtImpl(void*, unsigned long, unsigned long) <null> (libRIO.so.6.38+0x126ab4) (BuildId: 49074ce67f254ea5495c58bbaf8f9c6ab8f0ffe2)

  Location is heap block of size 16 at 0x72040000af20 allocated by thread T1:
    #0 operator new[](unsigned long) ../../../../src/libsanitizer/tsan/tsan_new_delete.cpp:70 (libtsan.so.2+0x9c5f6) (BuildId: 99ef88596cb10a8ccf307fe1a9070f66a44b1624)
    #1 ROOT::Internal::RPageSourceFile::PrepareSingleCluster(ROOT::Internal::RCluster::RKey const&, std::vector<ROOT::Internal::RRawFile::RIOVec, std::allocator<ROOT::Internal::RRawFile::RIOVec> >&) <null> (libROOTNTuple.so.6.38+0x21a8c4) (BuildId: 9d3fc48eb8a60aad7661d81e7b6cc979d8db5a8c)

  Thread T1 (tid=3371461, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1022 (libtsan.so.2+0x568a6) (BuildId: 99ef88596cb10a8ccf307fe1a9070f66a44b1624)
    #1 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) <null> (libstdc++.so.6+0xe12f8) (BuildId: 133b71e0013695cc7832680a74edb51008c4fc4c)
    #2 __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 (libc.so.6+0x29ca7) (BuildId: 58749c528985eab03e6700ebc1469fa50aa41219)

SUMMARY: ThreadSanitizer: data race ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:115 in memcpy
==================

ROOT version

  ------------------------------------------------------------------
  | Welcome to ROOT 6.38.00                        https://root.cern |
  | (c) 1995-2025, The ROOT Team; conception: R. Brun, F. Rademakers |
  | Built for linuxx8664gcc on Apr 26 2026, 19:35:33                 |
  | From tags/6-38-00@6-38-00                                        |
  | With g++ (Debian 14.2.0-19) 14.2.0 std202002                     |
  | Try '.help'/'.?', '.demo', '.license', '.credits', '.quit'/'.q'  |
   ------------------------------------------------------------------

Installation method

Spack

Operating system

Linux

Additional context

Copilot session with all its details

Appendix: RClusterPool Data Race — Root-Cause Analysis and Proposed Fix

Related issues:


1. Summary

ROOT::Internal::RClusterPool has a data race between the background I/O
thread (T_io, started by RClusterPool::StartBackgroundThread()) and the
consumer thread (the thread that calls GetCluster()). The race is triggered
whenever an RNTuple is read sequentially and manifests as a
write/write race on allocator-managed memory detected by ThreadSanitizer.

The bug is present in ROOT 6.38.00 and in current master (despite the partial
mitigation in commit ecf6205ce54).


2. How the Bug Is Triggered in EICrecon

EICrecon added RNTuple output in PR
eic/EICrecon#2579
(merged 2026-04-27). The TSAN CI job did not fire on that PR because the
simulation-file cache was still valid (old TTree files were re-used). One
week later a containers update (10733ab8) changed the geometry hash, forcing
npsim to regenerate simulation files in RNTuple format. Reading those
RNTuple files in the TSAN CI job then triggered the race.


3. The Race — Step-by-Step

Scheduling

With kDefaultClusterBunchSize = 1, a single call to GetCluster(N) enqueues
two cluster bunches into the work queue:

Bunch Cluster ID bunchId
N N B
N+1 N+1 B+1

Execution timeline

consumer:  GetCluster(N)
             → enqueue [cluster_N/bunchB, cluster_N+1/bunchB+1]  (lock held)
             → fPool.erase(cluster_N-1)                           (NO lock) ← eviction
             → WaitFor(N): blocks on future_N

T_io:      wakes (one cond_var notify)
           ┌──────────────────────────────────────────────────────────────┐
           │  inner while (!readItems.empty()) — NO lock between bunches  │
           │                                                              │
           │  Iteration 1 (bunchB):                                       │
           │    LoadClusters({cluster_N})                                 │
           │      → RCluster::Adopt(pageMap)                              │
           │      → fOnDiskPages.insert() → operator new(node_A)  ←(1)  │
           │    future_N.set_value(cluster_N)                     ←(HB)  │
           │                                                              │
           │  Iteration 2 (bunchB+1):  ← NO lock re-acquired here        │
           │    LoadClusters({cluster_N+1})                               │
           │      → RCluster::Adopt(pageMap)                              │
           │      → fOnDiskPages.insert() → operator new(node_?)  ←(2)  │
           └──────────────────────────────────────────────────────────────┘

consumer:  WaitFor(N) unblocks (HB from T_io's set_value)
           GetCluster(N+1)
             → fPool.erase(cluster_N)   (NO lock)
             → ~RCluster() → ~unordered_map()
             → operator delete(node_A)                            ←(3)

The race

Steps (2) and (3) are concurrent with no synchronisation:

  • T_io (2) writes to node address A (initialising the hash-map node for
    cluster_N+1's page map). This happens after the set_value HB boundary,
    so the consumer's clock does not observe it.
  • consumer (3) writes to address A via operator delete (freeing
    cluster_N's hash-map nodes, which the allocator recycles back to A for
    T_io's next malloc).

ThreadSanitizer reports:

WARNING: ThreadSanitizer: data race
  Write of size 8 by main thread:
    #0 operator delete
    #N ROOT::Internal::RCluster::Adopt(ROOT::Internal::RCluster&&)
    #N ROOT::Internal::RClusterPool::WaitFor(...)
    #N ROOT::Internal::RClusterPool::GetCluster(...)
  Previous write of size 8 by thread T_io:
    #0 operator new
    #N std::_Hashtable::_M_insert_unique(...)
    #N ROOT::Internal::RCluster::Adopt(std::unique_ptr<ROnDiskPageMap>)
    #N ROOT::Internal::RClusterPool::ExecReadClusters()

4. Why the Existing Mitigation (ecf6205ce54) Is Insufficient

Commit ecf6205ce54 changed RClusterPool to start T_io lazily (on the
first GetCluster() call) rather than eagerly in the constructor.

This accidentally avoids the race for short-lived readers (e.g., EICrecon's
non-events categories read in Finish() with only a handful of clusters) because:

  1. T_io processes all clusters in a single LoadClusters batch before the
    consumer gets back to GetCluster().
  2. T_io goes idle.
  3. The consumer's eviction (fPool.erase) executes, then calls GetCluster()
    again, which notify_ones T_io — establishing a happens-before edge.

However, the structural cause — ExecReadClusters looping across bunch
boundaries without re-acquiring fLockWorkQueue
— remains. For any reader
with more than kDefaultClusterBunchSize clusters in flight, the race window
is still present.


5. Proposed Fix

The root cause is that GetCluster() schedules two bunches per call
(2 * fClusterBunchSize clusters), causing ExecReadClusters to process
bunch N+1 without synchronising with the consumer's eviction of bunch N.

Minimal fix: schedule only one bunch per GetCluster() call.

--- a/tree/ntuple/src/RClusterPool.cxx
+++ b/tree/ntuple/src/RClusterPool.cxx
@@ -207,8 +207,7 @@ ROOT::Internal::RCluster *ROOT::Internal::RClusterPool::GetCluster(
-   for (ROOT::DescriptorId_t i = 0, next = clusterId; i < 2 * fClusterBunchSize; ++i) {
-      if (i == fClusterBunchSize)
-         provideInfo.fBunchId = ++fBunchId;
+   for (ROOT::DescriptorId_t i = 0, next = clusterId; i < fClusterBunchSize; ++i) {

With fClusterBunchSize = 1 (the default), this schedules exactly one cluster
per GetCluster() call. ExecReadClusters delivers it and goes idle. The
consumer's eviction runs, then notify_one wakes T_io for the next call —
establishing the required happens-before edge.

Correctness argument

Property Before fix After fix
Bunches per GetCluster() 2 1
T_io idle between consumer calls No (inner loop) Yes
HB: consumer eviction → T_io next alloc Missing Via cond_var notify/wait
Lookahead depth (default settings) 4 clusters 2 clusters

The lookahead depth halves (from 2 × bunchSize × nThreads to
bunchSize × nThreads), which is a small performance trade-off for
correctness. The fBunchId member and its increment can be removed as a
follow-up cleanup.


6. Reproducers

Two minimal reproducers are provided in
tree/ntuple/test/ of the ROOT master worktree.

6a. Standalone C++ reproducer (repro_rclusterpool_race.cpp)

Uses the public RNTupleWriter/RNTupleReader API. Writes 16 entries with a
per-entry cluster budget, then reads them back; each cluster boundary exercises
the race window.

Compile and run (ASLR must be disabled to avoid TSAN shadow-map conflicts):

g++ -fsanitize=thread -std=c++17 -O1 \
    $(root-config --cflags --libs) -lROOTNTuple \
    -o repro repro_rclusterpool_race.cpp

TSAN_OPTIONS="halt_on_error=0" setarch $(uname -m) -R ./repro

Key write options required to avoid page buffer memory budget too small:

// Page-buffer budget = 2 × ApproxZippedClusterSize; initial page size must fit.
opts.SetInitialUnzippedPageSize(sizeof(double)); // 8 bytes = 1 element/page
opts.SetApproxZippedClusterSize(sizeof(double)); // 8 bytes ≈ 1 entry/cluster

6b. GTest regression test (ntuple_cluster_race.cxx)

Uses an internal mock (RPageSourceSlowMock) that sleeps 20 ms on every
LoadClusters() call after the first. The sleep holds T_io inside bunch N+1
long enough for the consumer to evict cluster N, opening the race window.

Register in tree/ntuple/test/CMakeLists.txt:

ROOT_ADD_GTEST(ntuple_cluster_race ntuple_cluster_race.cxx LIBRARIES ROOTNTuple)

Run with TSAN:

cmake -DCMAKE_CXX_FLAGS="-fsanitize=thread" ...
ctest -R ClusterPool_NoRaceBetweenEvictionAndPrefetch

Note on TSAN detection reliability: the race manifests when the system
allocator reuses freed node addresses across threads. With glibc's per-thread
tcache enabled (default, glibc ≥ 2.26), cross-thread reuse may be suppressed
for small allocations. For reliable triggering either build ROOT with TSAN
(so the full allocation path is instrumented) or use jemalloc with
MALLOC_CONF=tcache:false.


7. Short-Term Workaround for EICrecon

Until the ROOT fix lands in a container release, add a TSAN suppression to
EICrecon/.github/tsan.supp:

race:ROOT::Internal::RCluster::Adopt

This suppresses the race report without affecting normal test output.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions