fix(tracer): O_TRUNC opens are write-only for lineage (P1-43)#100
Merged
Conversation
`roar show <artifact>` was listing every output file as both
"produced by" and "consumed by" the same job. The data layer was
honest — `job_inputs` and `job_outputs` really did both contain the
same (path, artifact_id) for one job — but the artifact_id was the
*post-write* hash, so the input edge was structurally a lie: it
claimed the job consumed content that came into being during the
same run.
## Root cause
Empirically, `np.savez(path, ...)` (via `zipfile.ZipFile(path, "w")`)
opens the output file with `O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC` so the
zip central directory can be seek-patched at close. The preload
tracer's `flags_imply_read` correctly returned true for any
`O_RDWR`, so it emitted an `OpenRead` event for the truncating open.
That conflated the file as both read and written.
But `O_TRUNC` destroys the file's prior content atomically at open
time. Any subsequent read through that fd returns post-write output,
not an input. The classification is semantically wrong regardless of
whether actual `read()` syscalls ever fire.
The user's mmap counterexample (a file with `O_RDWR` *without*
`O_TRUNC` can legitimately be both read and written via mmap)
correctly rules out a path-level dedup. The right fix is per-fd:
when `O_TRUNC` is observed at open time, the fd is flagged and any
subsequent read-marking is dropped. Non-truncating `O_RDWR` opens
keep both classifications.
## Changes
- `tracer-fd` (shared crate): `FdState` gets `was_truncated`.
`handle_open(pid, fd, path, flags)` now actually inspects the
flags it was already given and sets the bit when `O_TRUNC` is set.
`mark_read_internal`, `handle_read_internal`, and
`handle_pread_internal` all short-circuit on a truncated fd.
- preload (`lib.rs`): `flags_imply_read` returns false on `O_TRUNC`,
so the upstream `OpenRead` event is never emitted for the
zipfile-style open. This catches the case before it reaches the
daemon — the preload daemon's path-keyed state means the per-fd
suppression in `tracer-fd` wouldn't naturally apply to its
`record_read` path.
- ebpf userspace: new cross-crate test
(`test_o_trunc_open_then_read_does_not_classify_as_input`) wires
through `state.handle_open(pid, fd, path, O_TRUNC_LINUX)` →
`process_small_event(Read)` and asserts the fd is *not* promoted
to `read_files`. Verifies ebpf inherits the fix via the shared
crate without code changes of its own.
- ptrace already routes `(pid, fd, path, flags)` through the same
`tracer-fd::handle_open`, so it inherits the fix identically.
## Tests
- `tracer-fd` (3 new): O_TRUNC suppresses both `handle_read` and
`mark_read`; non-O_TRUNC O_RDWR keeps both classifications.
- preload (`flags_classification_tests` mod, 5 cases): exhaustive
matrix over RDWR/WRONLY/RDONLY × {TRUNC, no-TRUNC}.
- ebpf userspace (1 cross-crate test): open+read sequence with
`O_TRUNC_LINUX` does not promote to read_files.
- End-to-end: re-ran `roar run --tracer preload python extract.py
...` on the user's mnist repo. The new job's `job_inputs` now
lists only `train-...parquet` (the upstream); `train_feats.npz`
appears only in `job_outputs`.
## What this does NOT fix
- Historical bad rows in existing roar databases survive — they
were recorded by the pre-fix tracer. A `roar reset` or selective
cleanup is the user-facing remedy. Future runs are clean.
- The preload tracer's `mmap` hook (`lib.rs:1336`) still emits
`emit_fd_read` based on `PROT_READ` without consulting the
underlying fd's `O_TRUNC` state. A process that opens with
`O_RDWR|O_TRUNC` and then mmaps the same fd with `PROT_READ`
would still re-pollute via the path-keyed daemon state. This
isn't hit by the zipfile path (strace showed pure writes, no
mmap on the truncated fd) but is a real hole. Logged as a P1-15
follow-up; the cleanest fix probably routes the mmap hook through
the shared `tracer-fd` so its per-fd `was_truncated` check
applies.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous P1-15 commit gated reads in the shared `tracer-fd`
crate, which ebpf inherited automatically (all its read paths are
fd-keyed). ptrace mostly did too — its `SYS_READ`, `SYS_PREAD64`,
and sendfile/copy_file_range handlers all use the fd-keyed
`mark_read_with_thread` — but `SYS_MMAP` was using path-keyed
`mark_path_read_with_thread`. Path-keyed calls don't consult
`fd_state.was_truncated`, so an mmap on a truncated fd would still
classify the file as read.
Switch the mmap branch to fd-keyed `mark_read_with_thread` /
`mark_written_with_thread` so the suppression in `tracer-fd` applies.
Tracer-symmetry status after this commit:
- ebpf: full coverage (all reads fd-keyed).
- ptrace: full coverage (this commit closes the mmap gap).
- preload: partial. `flags_imply_read` returning false on O_TRUNC
catches the OpenRead intent event (and so the user's
zipfile case). But the preload daemon is path-keyed
by design — `TraceEvent::Read` arrives with only the
path, and the preload-side hooks don't track per-fd
O_TRUNC state. So a process that does
`open(O_RDWR|O_TRUNC); read(fd); ...` or
`open(O_RDWR|O_TRUNC); mmap(fd, PROT_READ)` would
still mark the file as read under preload. Closing
this requires either a schema tweak (Read event
carries an "fd-truncated" bit set by the tracee
hooks) or per-fd state tracking in the tracee. Logged
as P1-15 follow-up; not hit by the zipfile path.
Covered by the existing `tracer-fd::test_o_trunc_suppresses_mark_read_too`
test — that's exactly what the ptrace mmap branch now calls.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TrevorBasinger
approved these changes
May 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
roar show <artifact>was double-listing every output file as both producer and consumer. Root cause:np.savez/zipfile.ZipFile("w")opens the output withO_RDWR|O_CREAT|O_TRUNC(for seek-patching the central directory), and the preload tracer'sflags_imply_readhonored theO_RDWRaccess mode → false input edge.O_TRUNCdestroys prior content atomically at open time, so any read through that fd returns post-write output, not an input. The recorded "input" hash equals the output hash — structurally impossible without time travel.Cross-tracer policy
tracer-fd(shared crate):FdState.was_truncatedflag, set onhandle_openwhenO_TRUNCbit is in flags. Read-marking functions (mark_read,handle_read,handle_pread) short-circuit on truncated fds. ebpf + ptrace inherit this automatically — both already pass flags tohandle_open.flags_imply_readreturns false onO_TRUNCso theOpenReadevent is never emitted in the first place (the preload daemon is path-keyed, not fd-keyed, so the per-fd suppression intracer-fdwouldn't naturally apply to it).Non-truncating
O_RDWRopens (genuine in-place editors, mmap-style dual-purpose) keep both classifications — the mmap counterexample explicitly rules out a path-level dedup.Verified
roar run --tracer preload python extract.py ...now recordstrain-...parquetinjob_inputsandtrain_feats.npzinjob_outputsonly. No phantom self-input.Known follow-ups
mmaphook still emitsemit_fd_readpurely fromPROT_READwithout consultingwas_truncatedon the underlying fd. Not hit by the zipfile path (strace confirmed no mmap on the truncated fd) but is a real hole — separate P1-43 follow-up.Test plan
cargo test -p tracer-fd -p roar-tracer-preload -p roar-tracer-ebpfroar runon user's mnist extract.py🤖 Generated with Claude Code