TSAN-reported data races with RNTuple reader (triggered by podio reader inside JANA2 reconstruction)

### Check duplicate issues.

- [x] Checked for duplicates

### Description

In the EIC recontruction framework CI we encounter TSAN reports of data races (e.g. in https://github.com/eic/EICrecon/actions/runs/25331418512/job/74270527718 after merge in https://github.com/eic/EICrecon/pull/2469). This involves v1.0.1.0 RNTuple files generated by DD4hep simulations in the EDM4hep (podio data model) format, written with ROOT v6.38.00.

The issue appears to be in the reader inside our concurrent EICrecon reconstruction framework built on JANA2. Multiple threads read events.

After encountering the issue, I asked copilot to minimize the occurrence from our full stack and it came up with the reproducer below (and some gratuitous comments included).

### Reproducer

[repro_rclusterpool_race.cpp](https://github.com/user-attachments/files/27420579/repro_rclusterpool_race.cpp)

To be compiled and run as follows (at least in our environments):
```
g++ -fsanitize=thread -std=c++17 -O1 $(root-config --cflags --libs) -lROOTNTuple -o repro repro_rclusterpool_race.cpp
TSAN_OPTIONS="halt_on_error=0" setarch $(uname -m) -R ./repro
```
(thread sanitizer enabled, halt_on_error enabled, and setarch to disable ASLR)

Output (only first data race report included):
```
$ TSAN_OPTIONS="halt_on_error=0" setarch $(uname -m) -R ./repro
Writing RNTuple (~16 clusters, 1 entry each)... done.
Reading back (this may trigger a TSAN report)...==================
WARNING: ThreadSanitizer: data race (pid=3371455)
  Read of size 8 at 0x72040000af20 by main thread:
    #0 memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:115 (libtsan.so.2+0x82708) (BuildId: 99ef88596cb10a8ccf307fe1a9070f66a44b1624)
    #1 memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:107 (libtsan.so.2+0x82708)
    #2 ROOT::Internal::RPageSource::UnsealPage(ROOT::Internal::RPageStorage::RSealedPage const&, ROOT::Internal::RColumnElementBase const&, ROOT::Internal::RPageAllocator&) <null> (libROOTNTuple.so.6.38+0x2029cf) (BuildId: 9d3fc48eb8a60aad7661d81e7b6cc979d8db5a8c)
    #3 __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 (libc.so.6+0x29ca7) (BuildId: 58749c528985eab03e6700ebc1469fa50aa41219)

  Previous write of size 8 at 0x72040000af20 by thread T1:
    #0 pread64 ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:1025 (libtsan.so.2+0x5a0ed) (BuildId: 99ef88596cb10a8ccf307fe1a9070f66a44b1624)
    #1 ROOT::Internal::RRawFileUnix::ReadAtImpl(void*, unsigned long, unsigned long) <null> (libRIO.so.6.38+0x126ab4) (BuildId: 49074ce67f254ea5495c58bbaf8f9c6ab8f0ffe2)

  Location is heap block of size 16 at 0x72040000af20 allocated by thread T1:
    #0 operator new[](unsigned long) ../../../../src/libsanitizer/tsan/tsan_new_delete.cpp:70 (libtsan.so.2+0x9c5f6) (BuildId: 99ef88596cb10a8ccf307fe1a9070f66a44b1624)
    #1 ROOT::Internal::RPageSourceFile::PrepareSingleCluster(ROOT::Internal::RCluster::RKey const&, std::vector<ROOT::Internal::RRawFile::RIOVec, std::allocator<ROOT::Internal::RRawFile::RIOVec> >&) <null> (libROOTNTuple.so.6.38+0x21a8c4) (BuildId: 9d3fc48eb8a60aad7661d81e7b6cc979d8db5a8c)

  Thread T1 (tid=3371461, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1022 (libtsan.so.2+0x568a6) (BuildId: 99ef88596cb10a8ccf307fe1a9070f66a44b1624)
    #1 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) <null> (libstdc++.so.6+0xe12f8) (BuildId: 133b71e0013695cc7832680a74edb51008c4fc4c)
    #2 __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 (libc.so.6+0x29ca7) (BuildId: 58749c528985eab03e6700ebc1469fa50aa41219)

SUMMARY: ThreadSanitizer: data race ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:115 in memcpy
==================
```

### ROOT version

```
  ------------------------------------------------------------------
  | Welcome to ROOT 6.38.00                        https://root.cern |
  | (c) 1995-2025, The ROOT Team; conception: R. Brun, F. Rademakers |
  | Built for linuxx8664gcc on Apr 26 2026, 19:35:33                 |
  | From tags/6-38-00@6-38-00                                        |
  | With g++ (Debian 14.2.0-19) 14.2.0 std202002                     |
  | Try '.help'/'.?', '.demo', '.license', '.credits', '.quit'/'.q'  |
   ------------------------------------------------------------------
```

### Installation method

Spack

### Operating system

Linux

### Additional context

<details><summary>Copilot session with all its details</summary>

# Appendix: RClusterPool Data Race — Root-Cause Analysis and Proposed Fix

> **Related issues:**
> - EICrecon: <https://github.com/eic/EICrecon/issues/2650>
> - ROOT: (this report)

---

## 1. Summary

`ROOT::Internal::RClusterPool` has a data race between the background I/O
thread (`T_io`, started by `RClusterPool::StartBackgroundThread()`) and the
consumer thread (the thread that calls `GetCluster()`).  The race is triggered
whenever an RNTuple is read sequentially and manifests as a
**write/write race on allocator-managed memory** detected by ThreadSanitizer.

The bug is present in ROOT 6.38.00 and in current master (despite the partial
mitigation in commit `ecf6205ce54`).

---

## 2. How the Bug Is Triggered in EICrecon

EICrecon added RNTuple output in PR
[eic/EICrecon#2579](https://github.com/eic/EICrecon/pull/2579)
(merged 2026-04-27).  The TSAN CI job did not fire on that PR because the
simulation-file cache was still valid (old TTree files were re-used).  One
week later a containers update (`10733ab8`) changed the geometry hash, forcing
`npsim` to regenerate simulation files **in RNTuple format**.  Reading those
RNTuple files in the TSAN CI job then triggered the race.

---

## 3. The Race — Step-by-Step

### Scheduling

With `kDefaultClusterBunchSize = 1`, a single call to `GetCluster(N)` enqueues
**two** cluster bunches into the work queue:

| Bunch | Cluster ID | bunchId |
|-------|-----------|---------|
| N     | N         | B       |
| N+1   | N+1       | B+1     |

### Execution timeline

```
consumer:  GetCluster(N)
             → enqueue [cluster_N/bunchB, cluster_N+1/bunchB+1]  (lock held)
             → fPool.erase(cluster_N-1)                           (NO lock) ← eviction
             → WaitFor(N): blocks on future_N

T_io:      wakes (one cond_var notify)
           ┌──────────────────────────────────────────────────────────────┐
           │  inner while (!readItems.empty()) — NO lock between bunches  │
           │                                                              │
           │  Iteration 1 (bunchB):                                       │
           │    LoadClusters({cluster_N})                                 │
           │      → RCluster::Adopt(pageMap)                              │
           │      → fOnDiskPages.insert() → operator new(node_A)  ←(1)  │
           │    future_N.set_value(cluster_N)                     ←(HB)  │
           │                                                              │
           │  Iteration 2 (bunchB+1):  ← NO lock re-acquired here        │
           │    LoadClusters({cluster_N+1})                               │
           │      → RCluster::Adopt(pageMap)                              │
           │      → fOnDiskPages.insert() → operator new(node_?)  ←(2)  │
           └──────────────────────────────────────────────────────────────┘

consumer:  WaitFor(N) unblocks (HB from T_io's set_value)
           GetCluster(N+1)
             → fPool.erase(cluster_N)   (NO lock)
             → ~RCluster() → ~unordered_map()
             → operator delete(node_A)                            ←(3)
```

### The race

Steps **(2)** and **(3)** are **concurrent with no synchronisation**:

- `T_io (2)` writes to node address `A` (initialising the hash-map node for
  `cluster_N+1`'s page map).  This happens *after* the `set_value` HB boundary,
  so the consumer's clock does not observe it.
- `consumer (3)` writes to address `A` via `operator delete` (freeing
  `cluster_N`'s hash-map nodes, which the allocator recycles back to `A` for
  `T_io`'s next `malloc`).

ThreadSanitizer reports:

```
WARNING: ThreadSanitizer: data race
  Write of size 8 by main thread:
    #0 operator delete
    #N ROOT::Internal::RCluster::Adopt(ROOT::Internal::RCluster&&)
    #N ROOT::Internal::RClusterPool::WaitFor(...)
    #N ROOT::Internal::RClusterPool::GetCluster(...)
  Previous write of size 8 by thread T_io:
    #0 operator new
    #N std::_Hashtable::_M_insert_unique(...)
    #N ROOT::Internal::RCluster::Adopt(std::unique_ptr<ROnDiskPageMap>)
    #N ROOT::Internal::RClusterPool::ExecReadClusters()
```

---

## 4. Why the Existing Mitigation (`ecf6205ce54`) Is Insufficient

Commit `ecf6205ce54` changed `RClusterPool` to start `T_io` lazily (on the
first `GetCluster()` call) rather than eagerly in the constructor.

This **accidentally** avoids the race for short-lived readers (e.g., EICrecon's
non-events categories read in `Finish()` with only a handful of clusters) because:

1. `T_io` processes all clusters in a single `LoadClusters` batch before the
   consumer gets back to `GetCluster()`.
2. `T_io` goes idle.
3. The consumer's eviction (`fPool.erase`) executes, then calls `GetCluster()`
   again, which `notify_one`s `T_io` — establishing a happens-before edge.

However, the structural cause — **`ExecReadClusters` looping across bunch
boundaries without re-acquiring `fLockWorkQueue`** — remains.  For any reader
with more than `kDefaultClusterBunchSize` clusters in flight, the race window
is still present.

---

## 5. Proposed Fix

The root cause is that `GetCluster()` schedules **two** bunches per call
(`2 * fClusterBunchSize` clusters), causing `ExecReadClusters` to process
bunch N+1 without synchronising with the consumer's eviction of bunch N.

**Minimal fix:** schedule only **one** bunch per `GetCluster()` call.

```diff
--- a/tree/ntuple/src/RClusterPool.cxx
+++ b/tree/ntuple/src/RClusterPool.cxx
@@ -207,8 +207,7 @@ ROOT::Internal::RCluster *ROOT::Internal::RClusterPool::GetCluster(
-   for (ROOT::DescriptorId_t i = 0, next = clusterId; i < 2 * fClusterBunchSize; ++i) {
-      if (i == fClusterBunchSize)
-         provideInfo.fBunchId = ++fBunchId;
+   for (ROOT::DescriptorId_t i = 0, next = clusterId; i < fClusterBunchSize; ++i) {
```

With `fClusterBunchSize = 1` (the default), this schedules exactly one cluster
per `GetCluster()` call.  `ExecReadClusters` delivers it and goes idle.  The
consumer's eviction runs, then `notify_one` wakes `T_io` for the next call —
establishing the required happens-before edge.

### Correctness argument

| Property | Before fix | After fix |
|----------|-----------|-----------|
| Bunches per `GetCluster()` | 2 | 1 |
| `T_io` idle between consumer calls | No (inner loop) | Yes |
| HB: consumer eviction → T_io next alloc | Missing | Via cond_var `notify`/`wait` |
| Lookahead depth (default settings) | 4 clusters | 2 clusters |

The lookahead depth halves (from `2 × bunchSize × nThreads` to
`bunchSize × nThreads`), which is a small performance trade-off for
correctness.  The `fBunchId` member and its increment can be removed as a
follow-up cleanup.

---

## 6. Reproducers

Two minimal reproducers are provided in
`tree/ntuple/test/` of the ROOT master worktree.

### 6a. Standalone C++ reproducer (`repro_rclusterpool_race.cpp`)

Uses the public `RNTupleWriter`/`RNTupleReader` API.  Writes 16 entries with a
per-entry cluster budget, then reads them back; each cluster boundary exercises
the race window.

**Compile and run (ASLR must be disabled to avoid TSAN shadow-map conflicts):**

```bash
g++ -fsanitize=thread -std=c++17 -O1 \
    $(root-config --cflags --libs) -lROOTNTuple \
    -o repro repro_rclusterpool_race.cpp

TSAN_OPTIONS="halt_on_error=0" setarch $(uname -m) -R ./repro
```

Key write options required to avoid `page buffer memory budget too small`:

```cpp
// Page-buffer budget = 2 × ApproxZippedClusterSize; initial page size must fit.
opts.SetInitialUnzippedPageSize(sizeof(double)); // 8 bytes = 1 element/page
opts.SetApproxZippedClusterSize(sizeof(double)); // 8 bytes ≈ 1 entry/cluster
```

### 6b. GTest regression test (`ntuple_cluster_race.cxx`)

Uses an internal mock (`RPageSourceSlowMock`) that sleeps 20 ms on every
`LoadClusters()` call after the first.  The sleep holds `T_io` inside bunch N+1
long enough for the consumer to evict cluster N, opening the race window.

**Register in `tree/ntuple/test/CMakeLists.txt`:**

```cmake
ROOT_ADD_GTEST(ntuple_cluster_race ntuple_cluster_race.cxx LIBRARIES ROOTNTuple)
```

**Run with TSAN:**

```bash
cmake -DCMAKE_CXX_FLAGS="-fsanitize=thread" ...
ctest -R ClusterPool_NoRaceBetweenEvictionAndPrefetch
```

> **Note on TSAN detection reliability:** the race manifests when the system
> allocator reuses freed node addresses across threads.  With glibc's per-thread
> tcache enabled (default, glibc ≥ 2.26), cross-thread reuse may be suppressed
> for small allocations.  For reliable triggering either build ROOT with TSAN
> (so the full allocation path is instrumented) or use jemalloc with
> `MALLOC_CONF=tcache:false`.

---

## 7. Short-Term Workaround for EICrecon

Until the ROOT fix lands in a container release, add a TSAN suppression to
`EICrecon/.github/tsan.supp`:

```
race:ROOT::Internal::RCluster::Adopt
```

This suppresses the race report without affecting normal test output.
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TSAN-reported data races with RNTuple reader (triggered by podio reader inside JANA2 reconstruction) #22153

Check duplicate issues.

Description

Reproducer

ROOT version

Installation method

Operating system

Additional context

Appendix: RClusterPool Data Race — Root-Cause Analysis and Proposed Fix

1. Summary

2. How the Bug Is Triggered in EICrecon

3. The Race — Step-by-Step

Scheduling

Execution timeline

The race

4. Why the Existing Mitigation (`ecf6205ce54`) Is Insufficient

5. Proposed Fix

Correctness argument

6. Reproducers

6a. Standalone C++ reproducer (`repro_rclusterpool_race.cpp`)

6b. GTest regression test (`ntuple_cluster_race.cxx`)

7. Short-Term Workaround for EICrecon

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Property	Before fix	After fix
Bunches per `GetCluster()`	2	1
`T_io` idle between consumer calls	No (inner loop)	Yes
HB: consumer eviction → T_io next alloc	Missing	Via cond_var `notify`/`wait`
Lookahead depth (default settings)	4 clusters	2 clusters

TSAN-reported data races with RNTuple reader (triggered by podio reader inside JANA2 reconstruction) #22153

Description

Check duplicate issues.

Description

Reproducer

ROOT version

Installation method

Operating system

Additional context

Appendix: RClusterPool Data Race — Root-Cause Analysis and Proposed Fix

1. Summary

2. How the Bug Is Triggered in EICrecon

3. The Race — Step-by-Step

Scheduling

Execution timeline

The race

4. Why the Existing Mitigation (ecf6205ce54) Is Insufficient

5. Proposed Fix

Correctness argument

6. Reproducers

6a. Standalone C++ reproducer (repro_rclusterpool_race.cpp)

6b. GTest regression test (ntuple_cluster_race.cxx)

7. Short-Term Workaround for EICrecon

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

4. Why the Existing Mitigation (`ecf6205ce54`) Is Insufficient

6a. Standalone C++ reproducer (`repro_rclusterpool_race.cpp`)

6b. GTest regression test (`ntuple_cluster_race.cxx`)