Skip to content

✨ feat(rw): add SoftReadWriteLock for NFS and HPC clusters#528

Draft
gaborbernat wants to merge 1 commit intotox-dev:mainfrom
gaborbernat:feat/soft-read-write-lock
Draft

✨ feat(rw): add SoftReadWriteLock for NFS and HPC clusters#528
gaborbernat wants to merge 1 commit intotox-dev:mainfrom
gaborbernat:feat/soft-read-write-lock

Conversation

@gaborbernat
Copy link
Copy Markdown
Member

Slurm and other HPC deployments run across multiple compute nodes sharing an NFS mount and need reader/writer locking semantics there. The existing ReadWriteLock is SQLite-backed and cannot run on NFS because sqlite.org itself warns against POSIX fcntl locking on network filesystems. SoftFileLock works on NFS but is exclusive-only, and its pid+hostname stale check only evicts crashes on the same host, so a dead writer on node-42 would wedge readers on every other node until manual cleanup. Closes #526.

SoftReadWriteLock fills both gaps. It stores state as sidecar files next to the lock path: a short-held .state SoftFileLock mutex, a .write marker for the exclusive writer, and a .readers/ directory with one file per active reader. Each marker carries a random 128-bit token plus the holder's pid and hostname; a daemon heartbeat thread refreshes mtime every heartbeat_interval seconds (default 30), and any process on any host may evict a marker whose mtime has not advanced in stale_threshold seconds (default 90, the etcd LeaseKeepAlive ratio of TTL/3). 🩺 Eviction uses an atomic read → rename-to-unique-break-name → re-verify → unlink pattern borrowed from SoftFileLock._try_break_stale_lock, so a heartbeat firing mid-break cannot double-own the lock. Writer acquisition is two-phase and writer-preferring: phase one atomically claims .write (which immediately blocks new readers), phase two waits for existing readers to drain. A 99/1 reader-heavy workload cannot starve a writer.

Every marker open uses O_CREAT | O_EXCL | O_NOFOLLOW with mode 0o600 to block the symlink-overwrite attack where an attacker pre-creates the marker as a symlink to a victim file. 🔒 The readers directory gets 0o700 plus an lstat check and a dirfd-relative open, because mkdir has no O_NOFOLLOW equivalent; every subsequent reader-file operation flows through the dirfd. The marker parser is bounded (1 KiB cap, regex-validated token, bounded pid, ASCII hostname) so attacker-controlled content cannot crash the library or mis-authenticate as a live holder. The class runs the same three trust-boundary classes SoftFileLock already handles (same-UID non-cooperating processes, cross-host same-UID, same-host different-UID via permissions) and does not claim protection against root compromise, NTP tampering, or multi-tenant mounts where hostile co-tenants share the UID.

Fork safety was the last hazard worth calling out. Python threads do not survive fork(), so a child would inherit the marker files and the lock-level state without the heartbeat thread; the parent would keep refreshing while the child would not, and both processes would believe they hold the lock. 🍴 os.register_at_fork(after_in_child=...) replaces the inherited threading.Lock objects with fresh ones and marks the instance fork-invalidated, so release() on an inherited with block is a silent no-op and the child must call SoftReadWriteLock(path) again before acquiring. This matches PyMongo's connection-pool semantics and is documented in the concepts page alongside the heartbeat/TTL explanation.

For reviewers: the docs touch all four Diátaxis dimensions (tutorials, how-to, concepts, autodoc-driven API reference) plus a new card in the lock-types grid on docs/index.rst. The changelog file is intentionally not edited — the release workflow regenerates it from PR titles at tag time.

Context and design discussion lives in the issue thread: #526

@gaborbernat gaborbernat force-pushed the feat/soft-read-write-lock branch 6 times, most recently from acb0094 to 0afd22f Compare April 14, 2026 03:48
Slurm and other HPC deployments run across multiple compute nodes sharing
an NFS mount, and need reader/writer locking semantics there. The existing
ReadWriteLock is SQLite-backed and cannot run on NFS because SQLite's own
docs warn against POSIX fcntl locking on network filesystems. SoftFileLock
works on NFS but is exclusive-only and its pid+hostname stale check only
evicts crashes on the same host, so a dead writer on one node would wedge
readers on every other node until manual cleanup.

SoftReadWriteLock fills both gaps. It stores state as sidecar files next
to the lock path: a short-held .state SoftFileLock mutex, a .write marker
for the exclusive writer, and a .readers/ directory with one file per
active reader. Each marker carries a random 128-bit token plus the
holder's pid and hostname; a daemon heartbeat thread refreshes mtime
every 30 seconds, and any process on any host may evict a marker whose
mtime has not advanced in 90 seconds. This is the etcd LeaseKeepAlive
TTL/3 ratio adapted to filesystem primitives.

Eviction uses an atomic read-then-rename-then-verify pattern borrowed
from SoftFileLock so a heartbeat firing mid-break cannot double-own the
lock. Writer acquisition is two-phase and writer-preferring: phase one
claims .write, which immediately blocks new readers, phase two drains
existing readers. A 99/1 reader-heavy workload cannot starve a writer.

Every marker open uses O_CREAT | O_EXCL | O_NOFOLLOW with mode 0o600 to
block the symlink-overwrite attack where an attacker pre-creates the
marker as a symlink to a victim file. The readers directory gets 0o700
plus an lstat check and a dirfd-relative open, because mkdir has no
O_NOFOLLOW equivalent. The marker parser is bounded (1 KiB cap, regex-
validated token, bounded pid, ASCII hostname) so attacker-controlled
content cannot crash or mis-authenticate.

Fork safety is the last hazard. Python threads do not survive fork(),
so the child would inherit the marker files and lock-level state but
no heartbeat thread; the parent would keep refreshing while the child
would not, and both would believe they hold the lock. An os.register_
at_fork(after_in_child=...) hook replaces the inherited threading.Lock
objects with fresh ones and marks the instance fork-invalidated, making
release() a no-op so inherited `with` blocks unwind cleanly in the
child. PyMongo's connection-pool handling is the reference.

Tests cover behavior, reentrancy, stale detection, symlink refusal,
permissions, transaction-lock thread contention, and fork semantics
entirely through the public API. Coverage on the new modules is 100%
on async and 99% on sync, with pragmas on race-only branches. The full
suite runs stably across multiple back-to-back iterations.

Closes tox-dev#526
@gaborbernat gaborbernat force-pushed the feat/soft-read-write-lock branch from 0afd22f to 5ed7f6d Compare April 14, 2026 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: ReadWriteLock that is compatible with network filesystems

1 participant