feat(runtime): file-lock the TRT-RTX runtime cache for cross-runtime safety#1
Draft
tp5uiuc wants to merge 1 commit intofeat/trtrtx-cpp-runtimefrom
Draft
feat(runtime): file-lock the TRT-RTX runtime cache for cross-runtime safety#1tp5uiuc wants to merge 1 commit intofeat/trtrtx-cpp-runtimefrom
tp5uiuc wants to merge 1 commit intofeat/trtrtx-cpp-runtimefrom
Conversation
tp5uiuc
commented
Apr 28, 2026
tp5uiuc
commented
Apr 28, 2026
tp5uiuc
commented
Apr 28, 2026
tp5uiuc
commented
Apr 28, 2026
tp5uiuc
commented
Apr 28, 2026
tp5uiuc
commented
Apr 28, 2026
tp5uiuc
commented
Apr 28, 2026
tp5uiuc
commented
Apr 28, 2026
tp5uiuc
commented
Apr 28, 2026
tp5uiuc
commented
Apr 28, 2026
e852123 to
2705f49
Compare
…safety
Adds a cross-platform RAII file-lock primitive (core/util/file_lock.{h,cpp})
matching py-filelock's lock-file convention so the Python and C++ runtimes
sharing a runtime_cache_path do not race the rename and silently drop
compiled kernels.
- Unix backend uses BSD flock(2) -- the primitive py-filelock uses, not
POSIX fcntl record locks (which live in an independent namespace and
would silently fail to interop on Linux).
- Windows backend uses LockFileEx on byte (0,1) -- matches the byte range
msvcrt.locking(..., 1) locks on the Python side.
- Platform branch is hidden behind a LockHandle struct with move-and-swap
semantics, so callers only see a single FileLock RAII type.
- Shared/exclusive modes: load takes shared (multiple readers OK), save
takes exclusive. Python's FileLock is exclusive-only but conflicts
correctly against C++ shared holders since both use the flock namespace.
- 10s acquire timeout via 50ms-cadence poll loop, matching the Python
side's timeout=10. Lock-file path is <cache_path>.lock.
Wired into load_runtime_cache and save_runtime_cache_impl, with the
FileLock scoped to just the I/O block (save writes in-place under the
lock, no tmp+rename). Errors propagate via TORCHTRT_CHECK; the existing
try/catch in ensure_initialized and the noexcept save_runtime_cache
wrapper catch and log, so external behavior on contention is unchanged.
Tests:
- tests/cpp/test_file_lock.cpp: 12 unit tests covering exclusive/shared
contention, timeout edges, RAII release, move semantics, no-unlink-on-
release, and a same-namespace flock(2) interop check that verifies the
C++ primitive conflicts with raw flock locks (what py-filelock uses).
- tests/py/dynamo/runtime/test_000_runtime_cache.py:
- parameterizes test_filelock_works and test_sequential_save_load over
both runtimes
- test_python_lock_blocks_cpp_save: an externally-held py-filelock causes
the C++ save to time out silently, leaving the cache file unmodified;
a fresh save after release succeeds
- test_filelock_cross_runtime_parallel: two subprocesses (one Python-
runtime, one C++-runtime) compile against a shared cache_path and both
succeed. Subprocesses rather than threads because torch.export has
thread-unsafe TLS, but cross-process is the real-world locking
scenario anyway.
bb1f4e8 to
e61bc12
Compare
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a cross-platform RAII file-lock primitive (
core/util/file_lock.{h,cpp}) so the Python and C++ TRT-RTX runtimes sharing aruntime_cache_pathcannot race the rename and silently drop compiled kernels. Wires it intoload_runtime_cache(shared lock) andsave_runtime_cache_impl(exclusive lock) incore/runtime/TRTRuntimeConfig.cpp.This is the follow-up to a reviewer comment on the parent C++-runtime port PR, which intentionally landed without locking and asked for a separate "platform-independent file-locking" pass.
Backend choices
The Python library is
filelock(tox-dev/filelock; imported asfrom filelock import FileLock). Source-verified behavior:fcntl.flock(fd, LOCK_EX | LOCK_NB)— BSDflock(2), NOT POSIXfcntl(F_SETLK)record locks.msvcrt.locking(fd, LK_NBLCK, 1)— wraps Win32LockFile, locks 1 byte at offset 0.For the C++ side to interoperate with
filelockon the same<cache>.lockfile, the lock-file convention has to match exactly:flock(2), notfcntl(F_SETLK)— on Linux the two primitives live in independent namespaces, sofcntlwould silently fail to interop.LockFileExon byte range(0, 1)— locking the whole file would not conflict with the Python side's 1-byte lock.<cache_path>.locknext to the cached artifact (matchesfilelock's default).API shape
Single class with explicit
lock/try_lock/try_lock_for(timeout). Move-only, RAII destructor.Mode { Shared, Exclusive }— load takes shared, save takes exclusive.flock(2)has no native timeout, sotry_lock_foris a 50ms-cadence poll loop. 10s default timeout matches the Python side'sacquire(timeout=10).Errors propagate via
TORCHTRT_CHECK; the existingtry/catchinensure_initializedand thenoexcept save_runtime_cachewrapper catch and log them, so external behavior on contention is unchanged.Type of change
Tests
C++ unit tests (
tests/cpp/test_file_lock.cpp): 12 cases covering ctor, exclusive/shared contention, mixed-mode contention, timeout edges, RAII release, move semantics, no-unlink-on-release, open-failure throw, and a same-namespaceflock(2)interop check that verifies the C++ primitive conflicts with rawflocklocks (the operationfilelockuses).Python E2E (
tests/py/dynamo/runtime/test_000_runtime_cache.py):test_filelock_worksandtest_sequential_save_loadover both runtimes.test_python_lock_blocks_cpp_save: an externally-heldfilelockcauses the C++ save to time out silently (the noexcept member catches), the cache is unmodified while the lock is held, and a fresh save after release succeeds.test_filelock_cross_runtime_parallel: two subprocesses (one Python-runtime, one C++-runtime) compile against a sharedruntime_cache_pathand both succeed without deadlock or corruption. Subprocesses rather than threads becausetorch.exporthas thread-unsafe TLS, but cross-process is the real-world locking scenario anyway.Local runs (RTX, A100):
bazel test //tests/cpp:test_file_lock: 12 passed, 353ms totalpytest tests/py/dynamo/runtime/test_000_runtime_cache.py: 21 passed, 2 skipped (non-RTX), 0 failedNote on base branch
This PR is opened against
feat/trtrtx-cpp-runtimerather thanmainbecause the parent C++-runtime port has not landed yet. Once that merges, this branch will be rebased ontomainbefore going to upstream review.Checklist
Commits:
feat(runtime): file-lock the TRT-RTX runtime cache for cross-runtime safety— the feature.style(bazel): buildifier sweep on touched BUILD files— drive-by buildifier reformatting on pre-existing BUILD code, kept separate so the functional diff is easy to review.docs(file_lock): correct package name and rename "wire protocol" to "lock-file convention"— comment-only fix: the Python library isfilelock(tox-dev/filelock), not py-filelock; reword "wire protocol" to "lock-file convention".