Skip to content

feat(runtime): file-lock the TRT-RTX runtime cache for cross-runtime safety#1

Draft
tp5uiuc wants to merge 1 commit intofeat/trtrtx-cpp-runtimefrom
feat/runtime-cache-file-lock
Draft

feat(runtime): file-lock the TRT-RTX runtime cache for cross-runtime safety#1
tp5uiuc wants to merge 1 commit intofeat/trtrtx-cpp-runtimefrom
feat/runtime-cache-file-lock

Conversation

@tp5uiuc
Copy link
Copy Markdown
Owner

@tp5uiuc tp5uiuc commented Apr 28, 2026

Description

Adds a cross-platform RAII file-lock primitive (core/util/file_lock.{h,cpp}) so the Python and C++ TRT-RTX runtimes sharing a runtime_cache_path cannot race the rename and silently drop compiled kernels. Wires it into load_runtime_cache (shared lock) and save_runtime_cache_impl (exclusive lock) in core/runtime/TRTRuntimeConfig.cpp.

This is the follow-up to a reviewer comment on the parent C++-runtime port PR, which intentionally landed without locking and asked for a separate "platform-independent file-locking" pass.

Backend choices

The Python library is filelock (tox-dev/filelock; imported as from filelock import FileLock). Source-verified behavior:

  • Linux/macOS: fcntl.flock(fd, LOCK_EX | LOCK_NB) — BSD flock(2), NOT POSIX fcntl(F_SETLK) record locks.
  • Windows: msvcrt.locking(fd, LK_NBLCK, 1) — wraps Win32 LockFile, locks 1 byte at offset 0.

For the C++ side to interoperate with filelock on the same <cache>.lock file, the lock-file convention has to match exactly:

  • Unix backend MUST use flock(2), not fcntl(F_SETLK) — on Linux the two primitives live in independent namespaces, so fcntl would silently fail to interop.
  • Windows backend uses LockFileEx on byte range (0, 1) — locking the whole file would not conflict with the Python side's 1-byte lock.
  • Lock file: <cache_path>.lock next to the cached artifact (matches filelock's default).

API shape

Single class with explicit lock / try_lock / try_lock_for(timeout). Move-only, RAII destructor. Mode { Shared, Exclusive } — load takes shared, save takes exclusive. flock(2) has no native timeout, so try_lock_for is a 50ms-cadence poll loop. 10s default timeout matches the Python side's acquire(timeout=10).

Errors propagate via TORCHTRT_CHECK; the existing try/catch in ensure_initialized and the noexcept save_runtime_cache wrapper catch and log them, so external behavior on contention is unchanged.

Type of change

  • New feature (non-breaking change which adds functionality)

Tests

C++ unit tests (tests/cpp/test_file_lock.cpp): 12 cases covering ctor, exclusive/shared contention, mixed-mode contention, timeout edges, RAII release, move semantics, no-unlink-on-release, open-failure throw, and a same-namespace flock(2) interop check that verifies the C++ primitive conflicts with raw flock locks (the operation filelock uses).

Python E2E (tests/py/dynamo/runtime/test_000_runtime_cache.py):

  • Parameterizes test_filelock_works and test_sequential_save_load over both runtimes.
  • New test_python_lock_blocks_cpp_save: an externally-held filelock causes the C++ save to time out silently (the noexcept member catches), the cache is unmodified while the lock is held, and a fresh save after release succeeds.
  • New test_filelock_cross_runtime_parallel: two subprocesses (one Python-runtime, one C++-runtime) compile against a shared runtime_cache_path and both succeed without deadlock or corruption. Subprocesses rather than threads because torch.export has thread-unsafe TLS, but cross-process is the real-world locking scenario anyway.

Local runs (RTX, A100):

  • bazel test //tests/cpp:test_file_lock: 12 passed, 353ms total
  • pytest tests/py/dynamo/runtime/test_000_runtime_cache.py: 21 passed, 2 skipped (non-RTX), 0 failed

Note on base branch

This PR is opened against feat/trtrtx-cpp-runtime rather than main because the parent C++-runtime port has not landed yet. Once that merges, this branch will be rebased onto main before going to upstream review.

Checklist

  • My code follows the style guidelines of this project (pre-commit clean: clang-format, black, isort, ruff, buildifier, typos, NVIDIA-INTERNAL all pass)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas and hacks
  • I have made corresponding changes to the documentation (no public-facing API change)
  • I have added tests to verify my fix or my feature
  • New and existing unit tests pass locally with my changes
  • I have added the relevant labels to my PR in so that relevant reviewers are notified

Commits:

  1. feat(runtime): file-lock the TRT-RTX runtime cache for cross-runtime safety — the feature.
  2. style(bazel): buildifier sweep on touched BUILD files — drive-by buildifier reformatting on pre-existing BUILD code, kept separate so the functional diff is easy to review.
  3. docs(file_lock): correct package name and rename "wire protocol" to "lock-file convention" — comment-only fix: the Python library is filelock (tox-dev/filelock), not py-filelock; reword "wire protocol" to "lock-file convention".

Comment thread core/util/file_lock.h Outdated
Comment thread core/util/file_lock.cpp Outdated
Comment thread core/util/file_lock.h Outdated
Comment thread core/util/file_lock.cpp Outdated
Comment thread core/util/file_lock.cpp Outdated
Comment thread core/runtime/TRTRuntimeConfig.cpp Outdated
Comment thread core/runtime/TRTRuntimeConfig.cpp Outdated
Comment thread core/runtime/TRTRuntimeConfig.cpp Outdated
Comment thread core/runtime/TRTRuntimeConfig.cpp Outdated
Comment thread core/util/file_lock.h Outdated
@tp5uiuc tp5uiuc force-pushed the feat/trtrtx-cpp-runtime branch from e852123 to 2705f49 Compare May 6, 2026 09:04
…safety

Adds a cross-platform RAII file-lock primitive (core/util/file_lock.{h,cpp})
matching py-filelock's lock-file convention so the Python and C++ runtimes
sharing a runtime_cache_path do not race the rename and silently drop
compiled kernels.

- Unix backend uses BSD flock(2) -- the primitive py-filelock uses, not
  POSIX fcntl record locks (which live in an independent namespace and
  would silently fail to interop on Linux).
- Windows backend uses LockFileEx on byte (0,1) -- matches the byte range
  msvcrt.locking(..., 1) locks on the Python side.
- Platform branch is hidden behind a LockHandle struct with move-and-swap
  semantics, so callers only see a single FileLock RAII type.
- Shared/exclusive modes: load takes shared (multiple readers OK), save
  takes exclusive. Python's FileLock is exclusive-only but conflicts
  correctly against C++ shared holders since both use the flock namespace.
- 10s acquire timeout via 50ms-cadence poll loop, matching the Python
  side's timeout=10. Lock-file path is <cache_path>.lock.

Wired into load_runtime_cache and save_runtime_cache_impl, with the
FileLock scoped to just the I/O block (save writes in-place under the
lock, no tmp+rename). Errors propagate via TORCHTRT_CHECK; the existing
try/catch in ensure_initialized and the noexcept save_runtime_cache
wrapper catch and log, so external behavior on contention is unchanged.

Tests:
- tests/cpp/test_file_lock.cpp: 12 unit tests covering exclusive/shared
  contention, timeout edges, RAII release, move semantics, no-unlink-on-
  release, and a same-namespace flock(2) interop check that verifies the
  C++ primitive conflicts with raw flock locks (what py-filelock uses).
- tests/py/dynamo/runtime/test_000_runtime_cache.py:
  - parameterizes test_filelock_works and test_sequential_save_load over
    both runtimes
  - test_python_lock_blocks_cpp_save: an externally-held py-filelock causes
    the C++ save to time out silently, leaving the cache file unmodified;
    a fresh save after release succeeds
  - test_filelock_cross_runtime_parallel: two subprocesses (one Python-
    runtime, one C++-runtime) compile against a shared cache_path and both
    succeed. Subprocesses rather than threads because torch.export has
    thread-unsafe TLS, but cross-process is the real-world locking
    scenario anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant