feat(runtime): add TensorRT-RTX runtime cache, dynamic shapes strategy, and native CUDA graph support to C++ runtime by tp5uiuc · Pull Request #4202 · pytorch/TensorRT

tp5uiuc · 2026-04-21T20:27:06Z

Description

Extends three TensorRT-RTX runtime features that landed on the Python runtime (PythonTorchTensorRTModule) to the C++ runtime path (TorchTensorRTModule → core/runtime/TRTEngine). All three features center on nvinfer1::IRuntimeConfig, which the C++ runtime previously did not use — it called createExecutionContext(...) directly at four sites.

Features ported in this stack:

Feature	Python PR	This PR adds C++ runtime support
Runtime cache API	#4180 (merged)	yes
Dynamic shapes kernel specialization strategy	#4184 (merged)	yes
TensorRT-RTX native CUDA graph strategy	#4187 (draft)	yes

Without this PR, users on the C++ runtime path (TorchScript deployments, use_python_runtime=False) cannot access any of these TRT-RTX features, and runtime-cache warm-start savings (~8× measured in #4180) are unavailable on that path.

Commits

Original stack (reviewable independently):

feat(runtime): introduce IRuntimeConfig scaffolding and bump ABI to v9 — shared infra. Adds the IRuntimeConfig/IRuntimeCache members (RTX-only), a private recreate_execution_context() helper replacing 4 direct createExecutionContext call sites, three new SerializedInfoIndex entries (RUNTIME_CACHE_PATH_IDX, DYNAMIC_SHAPES_KERNEL_STRATEGY_IDX, CUDA_GRAPH_STRATEGY_IDX) with a single ABI bump 8 → 9. Python settings + compile() parameter threading. No behavior change — the per-feature apply_* appliers are empty stubs filled in by subsequent commits.
feat(runtime): add runtime cache to C++ runtime for TensorRT-RTX — mirror of feat: add runtime cache API for TensorRT-RTX #4180. Load on engine setup, atomic save on destructor (tmp + rename).
feat(runtime): add dynamic shapes kernel specialization strategy to C++ runtime — mirror of feat: add dynamic shapes kernel specialization strategy for TRT-RTX #4184. Wires IRuntimeConfig::setDynamicShapesKernelSpecializationStrategy.
feat(runtime): add TensorRT-RTX native CUDA graph strategy to C++ runtime — mirror of feat: add TRT-RTX native CUDA graph support #4187. Wires IRuntimeConfig::setCudaGraphStrategy and makes execute_engine.cpp TensorRT-RTX-aware: bypasses manual at::cuda::CUDAGraph capture on RTX (TRT-RTX handles it internally) and uses cudaStreamIsCapturing to detect outer whole-graph capture.
test: consolidate C++ runtime tests — folds C++ runtime cases into the existing test_000_runtime_cache.py and test_001_dynamic_shapes_kernel_strategy.py; adds model-level coverage in test_runtime_cache_models.py and test_dynamic_shapes_kernel_strategy_models.py.

Review-response commits (on top of the feature stack):

refactor(runtime): extract TRTRuntimeConfig — moves all TensorRT-RTX-specific IRuntimeConfig state into core/runtime/TRTRuntimeConfig.{h,cpp}, collapses three separate constructor args into a single TRTRuntimeConfig sink, uses enums (DynamicShapesKernelStrategy, CudaGraphStrategyOption), makes the destructor exception-safe, and contains the #ifdef TRT_MAJOR_RTX scatter to the new TU.
refactor(runtime): third-round review polish + cross-backend verification — inlines the FlattenedState macro, moves enum helpers + cache I/O into an anonymous namespace in the cpp, replaces (void)param; with TORCHTRT_UNUSED, uses std::underlying_type_t throughout, adds [[nodiscard]], adds ENABLE_FEATURE_DISABLE_RUNTIME_ALLOCATION as a local_define for the RTX Bazel configs so the runtime-allocation feature gate matches the RTX header's expectation.

Why bundle three features in one PR

All three features require an IRuntimeConfig on the engine, a single ABI bump, and extensions to the same serialization/deserialization code paths. Splitting into three independent PRs would trigger three consecutive ABI bumps and triple the surface area for backward-compat fallout. Keeping them in one stack keeps ABI changes atomic while still giving reviewers clean per-feature diffs.

Type of change

New feature (non-breaking change which adds functionality — on the Python API surface)
Breaking change (serialized-engine ABI bump from "8" to "9" — old .pt/.ep files targeting the C++ runtime will fail verify_serialization_fmt with a clear error, as with every prior ABI bump)
This change requires a documentation update — default values ("lazy", "disabled") keep existing behavior; existing docs for the Python-runtime runtime cache already cover the concept

Checklist

My code follows the style guidelines of this project (clang-format, black, isort, ruff, mypy, typos, buildifier — all pre-commit checks pass)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas (outer-capture detection, cache save-on-destructor contract, ABI bump rationale)
I have made corresponding changes to the documentation (docstrings updated on CompilationSettings; full user-guide updates pending the feature PRs on the Python path)
I have added tests to verify my fix or my feature (C++ runtime cases folded into test_000_runtime_cache.py, test_001_dynamic_shapes_kernel_strategy.py, test_001_cuda_graph_strategy.py; model-level coverage in test_runtime_cache_models.py, test_dynamic_shapes_kernel_strategy_models.py)
New and existing unit tests pass locally on both backends (see test plan below)
I have added the relevant labels to my PR

Test plan

Cross-backend verification on an A100 (ipp1 node) inside the main-native-x86_64-ubuntu24.04-cuda13.0 dev container, PyTorch nightly 2.13.0.dev20260420, CUDA 13.0.

TensorRT-RTX build

Wheel: torch_tensorrt_rtx-2.12.0.dev0+612556ba0 (built with python3 setup.py bdist_wheel --use-rtx, TensorRT-RTX 1.4.0.76).

Import smoke:

ENABLED_FEATURES.tensorrt_rtx = True
ABI_VERSION = 9
SERIALIZATION_LEN = 14
RUNTIME_CACHE_PATH_IDX = 11, DYNAMIC_SHAPES_KERNEL_STRATEGY_IDX = 12, CUDA_GRAPH_STRATEGY_IDX = 13

$ pytest runtime/test_000_runtime_cache.py \
         runtime/test_001_dynamic_shapes_kernel_strategy.py \
         runtime/test_001_cuda_graph_strategy.py
================= 35 passed, 3 skipped, 119 warnings, 3 subtests passed =================

Breakdown of the 35 RTX-side passes:

Runtime cache — Python-runtime setup (TestRuntimeCacheSetup), persistence and warm-cache roundtrip on both Python and C++ runtimes (TestRuntimeCachePersistence, TestRuntimeCacheCppPersistence), concurrency/filelock (TestRuntimeCacheConcurrency), timing-cache skip (TestTimingCacheSkipped), and serialization-index registration (TestCppSerializationIndices).
Dynamic shapes kernel strategy — setup / enum plumbing (TestDynamicShapesKernelStrategySetup), full {lazy, eager, none} end-to-end through both runtimes (TestDynamicShapesKernelStrategyCpp), dynamic-shape traversal, and invalid-value rejection (TestDynamicShapesKernelStrategyCppInvalidValue).
CUDA graph strategy — settings defaults (TestCudaGraphStrategySettings), {disabled, whole_graph_capture} via C++ runtime (TestCudaGraphStrategyCpp), RTX-native override when set_cudagraphs_mode(True) is combined with a strategy, repeated inference, and invalid-value rejection (TestCudaGraphStrategyInvalidValue).

Regression:

$ pytest runtime/test_005_dynamic_allocation.py runtime/test_002_cudagraphs_cpp.py
================= 9 passed =================

Model-level: e2e ResNet18 compilation + inference via the C++ runtime path with each {lazy, eager, none} strategy and with runtime cache warm-roundtrip (added in tests/py/dynamo/models/).

Standard TensorRT build

Wheel: torch_tensorrt-2.12.0.dev0+612556ba0 (built with plain python3 setup.py bdist_wheel, TensorRT 10.16.0).

Import smoke:

ENABLED_FEATURES.tensorrt_rtx = False
ABI_VERSION = 9
SERIALIZATION_LEN = 11 (no RTX-only slots in the FlattenedState)
RTX-only index accessors are not registered by register_jit_hooks.cpp

$ pytest runtime/test_000_runtime_cache.py \
         runtime/test_001_dynamic_shapes_kernel_strategy.py \
         runtime/test_001_cuda_graph_strategy.py
================= 7 passed, 31 skipped =================

The 7 passes on standard TRT are:

TestNonRTXUnchanged — confirms the existing Python-runtime paths are unaffected (runtime cache / timing cache behavior, no runtime_config member) — 2 tests.
TestDynamicShapesKernelStrategyNonRTX::test_setting_ignored_on_non_rtx — confirms the new dynamic_shapes_kernel_specialization_strategy setting is silently ignored on non-RTX.
TestCudaGraphStrategySettings::{test_default_value, test_settable_values} — CompilationSettings accepts the new fields on any backend.
TestDynamicShapesKernelStrategyCppInvalidValue::test_invalid_strategy_raises + TestCudaGraphStrategyInvalidValue::test_invalid_strategy_raises — unknown strategy names are rejected at _pack_engine_info time even on non-RTX (validation is cross-platform; the engine-info serialization slots themselves are RTX-only).

The 31 skips all carry clean messages (sample):

SKIPPED [1] runtime/test_000_runtime_cache.py:68: Runtime cache is only available with TensorRT-RTX
SKIPPED [1] runtime/test_001_dynamic_shapes_kernel_strategy.py:58: Dynamic shapes kernel specialization strategy requires TensorRT-RTX
SKIPPED [1] runtime/test_001_cuda_graph_strategy.py:63:  CUDA graph strategy is a TensorRT-RTX feature

Regression on standard TRT:

$ pytest runtime/test_005_dynamic_allocation.py runtime/test_002_cudagraphs_cpp.py
================= 9 passed =================

Summary

Backend	Wheel	New / merged suites	Regression
TensorRT-RTX 1.4.0.76	`torch_tensorrt_rtx-2.12.0.dev0+612556ba0`	35 passed, 3 skipped	9 passed
Standard TensorRT 10.16.0	`torch_tensorrt-2.12.0.dev0+612556ba0`	7 passed, 31 skipped	9 passed

0 failures on either backend; all RTX-gated suites skip cleanly with descriptive messages on the standard TRT build.

Lay the shared infrastructure used by three upcoming TensorRT-RTX-only runtime features (runtime cache, dynamic shapes kernel specialization strategy, native CUDA graph strategy) in the C++ runtime path. Core changes - Bump ABI_VERSION from "8" to "9" and add three new SerializedInfoIndex entries (RUNTIME_CACHE_PATH_IDX, DYNAMIC_SHAPES_KERNEL_STRATEGY_IDX, CUDA_GRAPH_STRATEGY_IDX). One bump covers all three feature fields. - Add an IRuntimeConfig + IRuntimeCache shared_ptr pair to TRTEngine behind TRT_MAJOR_RTX, plus three plain string/int fields that remain serializable on non-RTX builds so the ABI is stable across both. - Extract a private recreate_execution_context() helper that is the single site where exec_ctx is built. On RTX builds it creates (once) the IRuntimeConfig, invokes per-feature appliers, and then creates the execution context via createExecutionContext(IRuntimeConfig*). Replaces four prior direct createExecutionContext call sites in the constructor, disable_profiling, set_device_memory_budget, and set_resource_allocation_strategy so each automatically inherits the runtime-config path on RTX. - Declare apply_runtime_cache / apply_dynamic_shapes_kernel_strategy / apply_cuda_graph_strategy as private RTX-only helpers with empty bodies; follow-up commits fill these in per feature. The empty stubs keep this commit behavior-neutral. - Extend TRTEngine::serialize, the deserialization constructor, the __obj_flatten__ tuple, and to_str so the new fields round-trip. - Expose RUNTIME_CACHE_PATH_IDX, DYNAMIC_SHAPES_KERNEL_STRATEGY_IDX, and CUDA_GRAPH_STRATEGY_IDX via torch.ops.tensorrt. Python side - Add dynamic_shapes_kernel_specialization_strategy ("lazy" default) and cuda_graph_strategy ("disabled" default) to _defaults.py, CompilationSettings, and the three compile() entry points. - Thread runtime_cache_path, dynamic_shapes_kernel_specialization_ strategy, and cuda_graph_strategy through _TorchTensorRTModule._ pack_engine_info with string-to-int maps so the C++ engine sees validated integer codes (0/1/2 for strategies) and raises ValueError for unknown strings. No behavior change yet: the RTX appliers are empty and all new strategy defaults select the prior code paths.

Implement TensorRT-RTX runtime cache persistence in the C++ runtime path (TorchTensorRTModule / TRTEngine). Mirrors the Python-runtime feature landed in pytorch#4180. What - apply_runtime_cache() (no-op stub from the prior commit) now creates an IRuntimeCache from the IRuntimeConfig, loads any existing cache file from the configured path, and attaches the cache to the config via IRuntimeConfig::setRuntimeCache (taken by const reference). - load_runtime_cache() reads the cache under an advisory shared lock (flock LOCK_SH) on POSIX. Concurrent readers coexist; transient failures downgrade to warnings so inference never blocks on cache IO. - save_runtime_cache() writes the serialized cache atomically via tmp-file + rename under an exclusive lock (flock LOCK_EX). The write path creates intermediate directories as needed. On Windows the save falls back to a best-effort write without advisory locking and emits a warning; LockFileEx support is a follow-up. - ~TRTEngine() now invokes save_runtime_cache() before tearing down the cache, config, and execution context so JIT compilation results survive process exits. Why - TensorRT-RTX JIT-compiles specialized kernels at inference time. The runtime cache lets those compilations persist across runs and across processes, which was measured at ~8x warm-vs-cold speedup in the Python-runtime implementation. - Without this commit, users relying on the C++ runtime (TorchScript deployments, use_python_runtime=False) would have no way to retain JIT work and would pay the cold-start cost on every process start. Tests - tests/py/dynamo/runtime/test_000_runtime_cache_cpp.py exercises the C++ runtime path (use_python_runtime=False) with cache save on destructor, directory creation, warm-cache roundtrip correctness via cosine-similarity, and ABI/index registration.

…++ runtime Wire the dynamic_shapes_kernel_specialization_strategy compile setting into the C++ runtime path on TensorRT-RTX by filling in the apply_dynamic_shapes_kernel_strategy() body introduced in the scaffolding commit. What - apply_dynamic_shapes_kernel_strategy() now calls IRuntimeConfig::setDynamicShapesKernelSpecializationStrategy with the integer code (0=lazy, 1=eager, 2=none) that was validated at engine construction. - The setting is applied once when the IRuntimeConfig is first built inside recreate_execution_context(); the value is serialized with the engine so deserialized modules restore the same strategy. Why - "lazy" (the default) compiles specialized kernels in the background and uses fallbacks until they are ready - good for latency of the first call but hands-off for steady-state throughput. - "eager" compiles the specialized kernel synchronously on first use, blocking inference but eliminating the fallback phase. - "none" disables kernel specialization entirely and always uses the generic fallback. Useful in combination with outer CUDA graph capture where a stable set of kernels is required. Tests - tests/py/dynamo/runtime/test_000_dynamic_shapes_kernel_strategy.py validates the setting default, the full {lazy, eager, none} matrix through the C++ runtime (use_python_runtime=False), dynamic shape traversal under "eager", and ValueError rejection of unknown strategy names at engine-packing time.

…time Wire cuda_graph_strategy into the C++ runtime and make the execute_engine CUDA graph path TensorRT-RTX-aware. Fills in the apply_cuda_graph_strategy stub and adds coexistence handling for outer whole-graph capture. What - apply_cuda_graph_strategy() now calls IRuntimeConfig::setCudaGraphStrategy with either kDISABLED (default) or kWHOLE_GRAPH_CAPTURE. On RTX this hands capture/replay off to the TRT-RTX runtime, avoiding the lazy-kernel and dynamic-shape hazards of wrapping enqueueV3 in at::cuda::CUDAGraph. - is_monolithic_capturable(stream) returns whether an engine can safely be captured by an outer torch.cuda.CUDAGraph: RTX builds check IExecutionContext::isStreamCapturable and require a non-lazy kernel strategy; non-RTX builds always return true. - disable_rtx_native_cudagraphs() is a one-shot switch that turns off the engine internal capture and recreates the execution context so that outer stream captures contain the kernel launches directly. - execute_engine.cpp now computes effective_cudagraphs. On RTX, if a cuda_graph_strategy is set or SUBGRAPH cudagraphs is enabled, it bypasses the manual at::cuda::CUDAGraph path (the TRT-RTX runtime handles that inside enqueueV3). It also polls cudaStreamIsCapturing on the engine stream and, if an outer capture is already running, invokes disable_rtx_native_cudagraphs() so the outer capture proceeds without collision. Why - On TRT-RTX, the manual at::cuda::CUDAGraph wrapper around enqueueV3 can freeze fallback kernels in the captured graph (kLAZY specialisation would swap them later), and fails outright when the engine needs runtime allocation, DDS, control flow, or weight streaming. - Letting the TRT-RTX runtime own capture fixes both problems, and the outer-capture detection keeps the feature compatible with the existing CudaGraphsTorchTensorRTModule whole-graph wrapper without requiring it to know anything about RTX internals. Tests - tests/py/dynamo/runtime/test_000_cuda_graph_strategy.py validates the setting default, both {disabled, whole_graph_capture} through the C++ runtime, the RTX-native override when set_cudagraphs_mode(True) is combined with a strategy, repeated inference correctness, and ValueError rejection of unknown strategy names.

Address the structural PR feedback by extracting TensorRT-RTX-specific IRuntimeConfig state into its own type and collapsing the per-feature appliers that previously scattered `#ifdef TRT_MAJOR_RTX` through TRTEngine. What - New core/runtime/TRTRuntimeConfig.{h,cpp} owns the IRuntimeConfig shared_ptr plus (on TRT-RTX) the IRuntimeCache, runtime-cache path, dynamic shapes kernel strategy, CUDA graph strategy, and the rtx_native_cudagraphs_disabled one-shot flag. All per-feature appliers live there as public members and are no-ops on non-RTX builds, keeping the only `#ifdef TRT_MAJOR_RTX` scatter contained in this new file. - Strategy fields are now strongly-typed enums (`DynamicShapesKernelStrategy`, `CudaGraphStrategyOption`) with matching `to_string`/`to_int` helpers, validated at engine construction via `to_dynamic_shapes_kernel_strategy` / `to_cuda_ graph_strategy_option` rather than raw int ranges. - `TRTEngine::recreate_execution_context` is now backend-agnostic: it calls `runtime_cfg.ensure_initialized`, applies the allocation strategy, and creates the execution context via `createExecutionContext(IRuntimeConfig*)`. Both standard TensorRT and TRT-RTX go through this uniform path; only the three RTX-only setters (`setRuntimeCache`, `setDynamicShapesKernel SpecializationStrategy`, `setCudaGraphStrategy`) stay behind an `#ifdef TRT_MAJOR_RTX` guard inside the struct. - `~TRTEngine` now wraps cleanup in try/catch and delegates cache persistence to `TRTRuntimeConfig::save_runtime_cache_nothrow`, so stack unwinding can no longer propagate a cache-save failure out of the destructor. - `save_runtime_cache_nothrow` uses `std::filesystem` + atomic `tmp+rename` only; file locking is out of scope for this PR and will be introduced in a follow-up once we pick a portable mechanism. - `is_monolithic_capturable` asserts `exec_ctx` is non-null; the three RTX-only appliers `TORCHTRT_ASSERT` that `config` is live before dereferencing. - `disable_rtx_native_cudagraphs` persists the runtime cache before flipping the strategy so any kernels compiled under the internal capture survive to the next reload. - `TRTEngine::to_str` now emits human-readable strategy names (via `to_string(enum)`) instead of integer codes. - New serialization indices (`RUNTIME_CACHE_PATH_IDX`, `DYNAMIC_ SHAPES_KERNEL_STRATEGY_IDX`, `CUDA_GRAPH_STRATEGY_IDX`) are now `#ifdef TRT_MAJOR_RTX`-gated in runtime.h, register_jit_hooks.cpp, the FlattenedState tuple, the serialize/deserialize constructors, and `__obj_flatten__`. Standard TRT builds keep `SERIALIZATION_LEN == 11` so engines serialized there do not carry RTX-only slots. - Python `_TorchTensorRTModule` reads the RTX-only index accessors and writes the RTX-only engine-info slots only when `ENABLED_FEATURES.tensorrt_rtx` is true. Standard TRT users see no new behavior at runtime. - Deduplicated `_compiler.py` arguments after rebase on upstream main where PR pytorch#4184 had already added `dynamic_shapes_kernel_specialization_strategy`. Kept one copy of each arg; `cuda_graph_strategy` is threaded through all three compile() entry points. Build + tests - RTX build on A100 / L40S: libtorchtrt.so and libtorchtrt_ runtime.so link clean, no `#ifdef` diagnostics. Pre-commit checks pass (clang-format, black, isort, ruff, mypy, typos, buildifier). - All 35 runtime-cache/strategy tests pass; regression across test_000_runtime_cache.py (Python runtime), test_002_cudagraphs_ cpp.py, test_005_dynamic_allocation.py is green. Addresses review comments on PR pytorch#4202: - Guarding of new IDX entries and Python accessors on TRT_MAJOR_RTX / ENABLED_FEATURES.tensorrt_rtx. - Encapsulation of RTX-specific state in a dedicated type with enumerated strategies and transparent standard-TRT/RTX behavior. - Destructor exception safety. - Unification of the execution-context creation path via IRuntimeConfig. - Removal of file locking for runtime-cache persistence. - Debug asserts before dereferencing the live IRuntimeConfig. - Human-readable to_str output. - save_runtime_cache invoked from disable_rtx_native_cudagraphs.

Address PR review comments that asked the new C++ runtime tests be folded into existing feature-level files rather than shipped as parallel `*_cpp.py` files. What - Merge `test_000_runtime_cache_cpp.py` into the existing `test_000_runtime_cache.py`. The file already covered the Python runtime path; two new classes (`TestRuntimeCacheCppPersistence`, `TestCppSerializationIndices`) cover the C++ runtime path via `use_python_runtime=False`, and the serialization-index assertions. Skip on non-RTX builds. - Fold the C++ runtime cases for dynamic shapes kernel specialization strategy into `test_001_dynamic_shapes_kernel_ strategy.py` (introduced upstream in PR pytorch#4184). Two new classes (`TestDynamicShapesKernelStrategyCpp`, `TestDynamicShapesKernel StrategyCppInvalidValue`) exercise lazy/eager/none end-to-end and reject invalid strategy names. The pre-existing Python runtime tests remain untouched. - Rename `test_000_cuda_graph_strategy.py` to `test_001_cuda_graph_ strategy.py` to match the `test_001_*` convention used for L1 RTX-only features. When upstream lands the Python runtime counterpart (PR pytorch#4187), both sets fold into the same file. - Add model-level tests: `test_runtime_cache_models.py` gains a `TestRuntimeCacheCppModels` class exercising ResNet18 through the C++ runtime with warm-cache roundtrip. `test_dynamic_shapes_ kernel_strategy_models.py` gains `TestDynamicShapesKernelStrategy CppModels` covering lazy/eager/none on ResNet18 via the C++ runtime. Verified - 35 passed / 3 skipped in the runtime/ tests (merged file plus test_001 strategy files). - No regression in test_002_cudagraphs_cpp.py (8 passed) or test_005_dynamic_allocation.py (1 passed). Addresses PR pytorch#4202 review comments asking for test file merges and the addition of model-level runtime_cache_models.py / dynamic_shapes_kernel_strategy_models.py coverage.

Follow-up to 54f9ccd / 1fa8c82 addressing the second batch of PR pytorch#4202 review feedback. Pure refactor with no user-visible behavior change; all tests green on A100 (35 passed / 3 skipped + 9 regression passed). TRTEngine - Constructor signature simplified: three separate `runtime_cache_path` / `dynamic_shapes_kernel_strategy` / `cuda_graph_strategy` parameters collapsed into a single `TRTRuntimeConfig runtime_cfg` sink parameter. The forwarding ctor std::moves it into the primary ctor, which std::moves it into the member. - String sink parameters (mod_name, serialized_engine, serialized_ metadata) taken by value and moved into members / slugify. - Deserialization constructor routes through the new free function make_runtime_config_from_serialized, which internalizes the TRT_MAJOR_RTX-gated index reads so the constructor itself stays unguarded. - FlattenedState uses a single TRTRTX_FLATTENED_STATE_EXTRAS macro for the three RTX-only tuple entries instead of duplicating the first eleven entries across two branches. - Destructor restored to the pre-refactor structure: torch::cuda:: synchronize runs outside a try block and runtime_cfg.save_runtime_ cache (now noexcept by signature) is called directly. Exception safety is guaranteed by the member's type, not by a defensive try/catch. - __obj_flatten__ and serialize cast enum values via std::underlying_type_t<...> instead of int so serialization stays in lockstep with any future underlying-type change on the enums. TRTRuntimeConfig - Conversion helpers take std::underlying_type_t<Enum> (the declared 32-bit integer type) instead of raw int. Callers at serialization boundaries explicitly std::stoi / static_cast into the right type. - [[nodiscard]] added to to_string, to_dynamic_shapes_kernel_strategy, to_cuda_graph_strategy_option, uses_internal_capture, is_monolithic_ capturable, to_str, and make_runtime_config_from_serialized. - to_string default cases now TORCHTRT_CHECK(false, ...) with the unexpected integer value; std::unreachable is C++23. - set_execution_context_allocation_strategy is now const. - Cache I/O split into two layers: - Free functions load_runtime_cache(path, cache) and save_runtime_cache(path, cache) perform the raw std::filesystem I/O and use TORCHTRT_CHECK on failure -- exception-propagating, easier to test in isolation. - Member TRTRuntimeConfig::save_runtime_cache() is a noexcept wrapper that calls the free function and swallows exceptions via try/catch -- safe from a destructor. The _nothrow suffix is dropped from the member name (the signature now carries that contract). - write_to_str(ostream&) replaced by two functions: a const-correct to_str() -> std::string, and a free operator<<(ostream&, const TRTRuntimeConfig&) that wraps it with "Runtime cfg { ... }" delimiters. TRTEngine::to_str streams the config via the free operator. Python - _settings.py: removed a duplicated dynamic_shapes_kernel_ specialization_strategy field and its duplicated docstring left over from the upstream rebase of PR pytorch#4184 into our changes. Covers review comments 3126538200, 3126541782, 3126547529, 3126549147, 3126682329, 3126683329, 3126693226, 3126715369, 3126725953, 3126736626, 3126738422, 3126745230, 3126747553, 3126749405, 3126764831, 3126772536, 3126786564, 3126803652, 3126816780, 3126818065, 3126818561, 3126819429, 3126823781, 3126840987, 3126846827.

…tion Follow-up to a4989c7 addressing the second batch of comments on PR pytorch#4202 plus verification that the non-RTX (standard TensorRT) build path still compiles and tests correctly skip RTX-only suites. Reviewer feedback - FlattenedState: the TRTRTX_FLATTENED_STATE_EXTRAS macro is inlined directly into the tuple parameter pack with a nested `#ifdef TRT_MAJOR_RTX`; no preprocessor macro is introduced, per the reviewer's "Inline and fix" note. - TRTEngine::to_str now calls `runtime_cfg.to_str()` directly rather than relying on the free `operator<<` framing; keeps the engine's existing two-space indentation consistent. - TRTRuntimeConfig free-function I/O helpers (`load_runtime_cache`, `save_runtime_cache`) moved to an anonymous namespace inside TRTRuntimeConfig.cpp and removed from the public header; the member wrapper `TRTRuntimeConfig::save_runtime_cache()` stays in the header (noexcept, catches exceptions from the raw helper). Renamed the internal free save helper to `save_runtime_cache_impl` to avoid clashing with the member of the same name. - Enum conversion helpers `to_string(...)` / `to_dynamic_shapes_kernel_strategy` / `to_cuda_graph_strategy_option` moved to anonymous namespace in the cpp; nothing outside this translation unit needs them now that TRTEngine holds a TRTRuntimeConfig directly. - Replaced `(void)param;` suppression pattern with `TORCHTRT_UNUSED` on the parameter declaration in five places. - Removed the nested `defined(ENABLE_FEATURE_DISABLE_RUNTIME_ ALLOCATION)` guard on `isStreamCapturable`. Instead, the Bazel rule for `//core/runtime:runtime` now sets `ENABLE_FEATURE_DISABLE_RUNTIME_ALLOCATION` as a local_define for the `:rtx_win` and `:rtx_x86_64` configs so the RTX header's feature gate is always on when we're building for RTX, matching the reviewer's invariant. Cross-backend - Python `_TorchTensorRTModule._pack_engine_info` now always validates `dynamic_shapes_kernel_specialization_strategy` and `cuda_graph_strategy` against the allowed name lists, regardless of whether the build is RTX or standard TRT. The engine-info serialization slots are only written on RTX, but the validation runs universally so typos surface early on any backend. Build + test - RTX A100: 35 passed / 3 skipped on new + merged suites; 9 passed regression (test_002_cudagraphs_cpp.py + test_005_dynamic_ allocation.py). Wheel `torch_tensorrt_rtx-2.12.0.dev0+a4989c760`. - Standard TRT A100: wheel `torch_tensorrt-2.12.0.dev0+a4989c760` builds clean without `--use-rtx`. Import smoke shows `tensorrt_rtx=False`, `SERIALIZATION_LEN=11`. 7 passed / 31 skipped (all skips with clean "Runtime cache is only available with TensorRT-RTX" / "CUDA graph strategy is a TensorRT-RTX feature" messages); 9 regression passed. Covers review comments 3126975981, 3127004055, 3127028393, 3127038410, 3127076231, and 3127100282.

Follow-up to 612556b addressing the latest batch of comments on pytorch/TensorRT PR pytorch#4202. Two categories of changes: Reviewer-suggested C++ simplifications (TRTRuntimeConfig.cpp) - load_runtime_cache: inlined the deserialize() call directly into TORCHTRT_CHECK instead of going through an intermediate bool. - ensure_initialized / setRuntimeCache: flipped the if/else so the success branch comes first and the warning + reset lands in the else, matching the reviewer's diff suggestion. - ensure_initialized / setCudaGraphStrategy: inlined the call into the if-condition and dropped the intermediate `bool ok` local. - disable_rtx_native_cudagraphs: same shape fix for the disable-path setCudaGraphStrategy call. Runtime cache durability (TRTEngine.cpp) - recreate_execution_context now flushes the runtime cache before rebuilding the IExecutionContext. The destructor already saves at teardown, but recreate can happen mid-lifetime around profiling toggles and allocator changes; without flushing there, a process kill between an allocator flip and teardown would lose any kernels compiled during the previous context. No-op on standard TensorRT and when no cache path is configured. Test deduplication (tests/py/dynamo/**/test_*{runtime_cache,dynamic_ shapes_kernel_strategy}*.py) Reviewer asked to stop copy-pasting bodies between the Python- and C++-runtime test classes. The persistence, model, and dynamic-shape suites now share one parameterized body that runs on both runtimes: - test_000_runtime_cache.py: TestRuntimeCachePersistence holds the single body; parameterized.expand(_RUNTIMES) fans out over ("python", True) and ("cpp", False). The CppPersistence class, its helpers, and CppSimpleModel are gone; a shared ConvModel with seeded init drives both paths. The C++ parameter skips itself via self.skipTest when torch_tensorrt_runtime is off. - test_001_dynamic_shapes_kernel_strategy.py: the lazy/eager/none test trio in TestDynamicShapesKernelStrategyCpp collapses into a single parameterized test_strategy_inference. Same parameter sweep on TestDynamicShapesKernelStrategySetup.test_strategy_ applied. - test_runtime_cache_models.py: TestRuntimeCacheModels, TestRuntimeCacheDynamicShapes, and TestRuntimeCachePerformance are parameterized over (runtime, use_python_runtime); the Cpp* sibling class is removed. - test_dynamic_shapes_kernel_strategy_models.py: one parameter product (strategy × runtime) drives both the resnet18 and dynamic-batch tests; the Cpp* sibling class is removed. Net: ~200 fewer lines of test code, same coverage, plus symmetry between Python- and C++-runtime test execution. Build + verification - RTX A100 (ipp1-2162, cuda13.0 dev container), wheel torch_tensorrt_rtx-2.12.0.dev0+612556ba0. - runtime/test_000_runtime_cache.py + runtime/test_001_dynamic_shapes_kernel_strategy.py + runtime/test_001_cuda_graph_strategy.py: 36 passed / 3 skipped (up from 35 pre-dedup — the param expansion picks up one extra per-runtime variant on the strategy applied test). - runtime/test_005_dynamic_allocation.py + runtime/test_002_cudagraphs_cpp.py: 9 passed (regression clean). - Model-level subset (resnet18 + dynamic-batch sweep across both runtimes and all three strategies): 10 passed. - Dedicated C++-runtime verification script confirms that use_python_runtime=False produces TorchTensorRTModule (not PythonTorchTensorRTModule), and that the runtime cache is populated and flushed through the C++ path (file size > 0 on engine destruction). Covers review comments 3128480385, 3128493651, 3128747920, 3128754155, 3128759096, and 3128764510.

meta-cla Bot added the cla signed label Apr 21, 2026

github-actions Bot added component: tests Issues re: Tests component: core Issues re: The core compiler component: api [Python] Issues re: Python API component: runtime component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Apr 21, 2026

github-actions Bot requested a review from cehongwang April 21, 2026 20:27

tp5uiuc commented Apr 21, 2026

View reviewed changes

Comment thread core/runtime/runtime.h

tp5uiuc commented Apr 21, 2026

View reviewed changes

Comment thread core/runtime/register_jit_hooks.cpp

tp5uiuc commented Apr 21, 2026

View reviewed changes

Comment thread core/runtime/TRTEngine.cpp Outdated

tp5uiuc commented Apr 21, 2026

View reviewed changes

Comment thread core/runtime/TRTEngine.cpp Outdated

tp5uiuc commented Apr 21, 2026

View reviewed changes

Comment thread core/runtime/TRTEngine.cpp

tp5uiuc commented Apr 21, 2026

View reviewed changes

Comment thread core/runtime/TRTEngine.cpp

tp5uiuc commented Apr 21, 2026

View reviewed changes

Comment thread core/runtime/TRTEngine.cpp Outdated

tp5uiuc commented Apr 21, 2026

View reviewed changes

Comment thread core/runtime/TRTEngine.h Outdated

tp5uiuc commented Apr 21, 2026

View reviewed changes

Comment thread core/runtime/TRTEngine.cpp Outdated