✨[Feature] Multiple Optimization Profiles for Disjoint Input Shape Regimes #4313

cehongwang · 2026-05-30T00:21:14Z

cehongwang
May 30, 2026
Collaborator

RFC: Multiple Optimization Profiles for Disjoint Input Shape Regimes

Status	Draft
Targets	`torch_tensorrt.dynamo` (AOT `compile` / `torch.compile` backend)
Touches	`_Input`, `_tracer`, `_TRTInterpreter`, `_TRTEngine`, `partitioning/common.py`, engine cache, runtime

1. Problem

Torch-TensorRT builds one TRT IOptimizationProfile per engine (_TRTInterpreter.__init__ line 123; placeholder() always writes optimization_profiles[0] — see TODO at line 712).

That is fine for unimodal shape distributions. It is a poor fit for bimodal workloads like LLM inference:

Phase	`input_ids`	KV cache
Prefill	`[B, 32..2048]`	static shape; valid region via indices
Decode	`[B, 1]`	same static cache

One profile spanning [1, 2048] with opt=1024 picks kernels that are wrong for both phases. TensorRT supports N profiles per engine; we should expose that.

Measured gap (Alpamayo / Edge-LLM)

Decode benchmark on umb-b200-247, nvcr.io/nvidia/pytorch:25.12-py3, TensorRT v10.14 (trtexec --dumpProfile --dumpLayerInfo --profilingVerbosity=detailed --separateProfileRun). All runs at decode shape: batch=6, inputs_embeds=6×1×4096, past_key_values=6×2×8×4096×128 (KV len 4096).

Engine	GPU Compute mean	dumpProfile total
ONNX-TRT (decode profile)	5.18 ms	7.20 ms
Torch-TRT (prefill-like `opt`, seq≈3424)	10.93 ms	11.78 ms
Torch-TRT (decode profile, `inputs_embeds` opt/max `6×1×4096`)	5.14 ms	6.96 ms

Decode-profile Torch-TRT is ~2.1× faster than the prefill-oriented build and matches ONNX-TRT decode. Layer profiles show the original Torch-TRT engine pays a large FC penalty tuned for long context; the decode-profile engine's qkv/o/up/down matmul totals align with ONNX-TRT decode — consistent with opt targeting seq=3424 while the benchmark runs seq=1.

Today this required a separate sanity-check engine with a decode-only profile. Multi-profile support would keep one engine and switch at runtime.

2. Goals / Non-goals

Goals

Declare N named profiles at compile time; one engine, one weight table.
Select the active profile at runtime: manual by default via optimization_profile(trt_gm, name); auto-select opt-in via optimization_profile(trt_gm, "auto") or compile flag when shapes alone should determine the profile.
Backward compatible: no Input.profiles → today's zero/one-profile behavior (min/opt/max_shape or EP-only).
Work with torch.export (AOT path) and torch.compile(backend="tensorrt") (JIT path).

3. Design: export once, specialize in TRT

torch.export gives each dynamic dim one contiguous [min, max]. There is no disjoint-union dim.

Chosen approach: export once using the union of all profile ranges per dim, then attach N TRT profiles at build time.

Example: prefill seq_len ∈ [32, 2048], decode seq_len = 1 → export with Dim("seq_len", min=1, max=2048).

Rejected alternatives:

Two exports / two engines — 2× memory, Python dispatch every forward (users can still do this manually).
Upstream disjoint Dims — out of scope.

The engine only accepts shapes inside a declared profile, even though export accepts the full union envelope.

3.1 Export envelope vs. profile ranges

Multi-profile relies on one torch.export over an envelope that covers every profile. Export traces once at the opt example (Input.torch_tensor = example_tensor("opt_shape") in _Input.py; _tracer.py passes that tensor to export()). Each dynamic axis gets one Dim(min, max) whose range must span all profiles the user (or we) intend to build.

Worked example (two disjoint image-size profiles):

Profile	min	opt	max
`"small"`	`[1, 3, 64, 64]`	`[1, 3, 128, 128]`	`[1, 3, 256, 256]`
`"large"`	`[1, 3, 1024, 1024]`	`[1, 3, 2048, 2048]`	`[1, 3, 4096, 4096]`

Export must declare img_dim = Dim("img_dim", min=64, max=4096) (elementwise union over profiles). TRT then gets two optimization profiles with those disjoint ranges. If the user hands us a profile whose [min, max] falls outside the export envelope, we cannot build it — reject at compile time with a clear error.

Validation rules (compile time):

Check	Action
Profile `[min, max]` exceeds export `Dim` envelope	Error — profile is unreachable; user must re-export with a wider `Dim` or shrink the profile
Export envelope wider than the union of declared profiles (e.g. user exported `Dim(max=8192)` but profiles only use up to `4096`)	Allow — log at info/warn that we are narrowing the effective TRT range to the profile union (prior art: apurba's PR). The exported graph is valid over `[64, 8192]` but the engine only accepts shapes within the declared profiles
Dim specialized to a constant during export	Error when the user declared multiple profiles on that axis

Post-export, for each dynamic dim in Input.profiles, assert every profile corner ⊆ ShapeEnv.var_to_range for the corresponding export symbol. Document clearly: you (or we, when deriving the envelope from Input.profiles) must export with the union range — profiles are subsets of export, never the other way around.

Other failure modes (export trace itself):

Risk	What happens	Mitigation
Data-dependent control flow	Python `if seq_len == 1:` specializes to the opt branch; decode path never traced	Model must be shape-polymorphic over the union (§5 static KV cache; no regime-dependent branching on token length)
0/1 specialization	Dynamo specializes dims whose traced value is 0 or 1 to static (`extract_var_range_info` also remaps `min=2 → 1`)	Declare `Dim(min=1, …)` for decode; validate post-export that the symbol was not narrowed

Optional corner check: run the exported / partitioned module at profile corners and confirm shapes match expectations. Not required for v1 if the model satisfies §5.

3.1.1 Overlapping profiles

Overlapping profiles are allowed. Two or more profiles may share part of their [min, max] envelope on the same dynamic dim. This is common in LLM workloads:

Profile	`seq_len` range	Notes
`"prefill"`	`[1, 4096]`	min=1, opt=3424, max=4096
`"decode"`	`[1, 1]`	min=opt=max=1

At runtime, input shape seq_len=1 satisfies both profiles. Overlap is a feature, not an error — it lets one engine cover adjacent regimes without forcing disjoint bounds.

Compile time: no restriction beyond §3.1 validation (each profile must still fit inside the export envelope). Overlapping [min, max] ranges on the same binding are fine.

Runtime (auto-selection enabled, §4.2): after ruling out invalid profiles per input and intersecting survivors:

Zero survivors → error.
Exactly one survivor → select it.
Multiple survivors → select the profile whose opt shape is closest to the actual input shapes (see below). Manual pin via optimization_profile(trt_gm, name) always overrides auto and skips this tie-break.

Distance metric: for each surviving profile p, compute a scalar distance to the current inputs. For every input binding b and every dynamic dim d where the profile defines an opt value:

dist(p) = Σ over (b, d)  | actual_shape[b][d] − opt_shape[p][b][d] |

Use the profile's declared opt tuple per binding (from compile-time Input.profiles, cached in _profile_dim_ranges alongside min/max at load). Static dims contribute 0. Pick p with minimum dist(p); break ties deterministically by lowest profile index.

Example (LLM overlap at decode): inputs_embeds shape (6, 1, 4096).

Profile	`seq` opt	`\|1 − opt\|`	Selected?
`"prefill"`	3424	3423
`"decode"`	1	0	yes

Auto-selection picks "decode" even though "prefill" also accepts seq=1, because decode's opt is closest to the actual shape.

Example (mid-prefill): inputs_embeds shape (6, 512, 4096).

Profile	`[min, max]` on `seq`	Passes containment?
`"prefill"`	`[1, 4096]`	yes
`"decode"`	`[1, 1]`	no (512 > max)

Only "prefill" survives → selected unambiguously. The closest-opt tie-break (§3.1.1 step 3) applies only among profiles that already pass the [min, max] check.

Rationale: closest-opt picks the profile TensorRT already tuned kernels for, minimizing the gap between runtime shape and the profile's specialization point. Users who need a specific profile regardless of opt distance (e.g. force decode kernels during a warm-up shape that technically fits prefill) should pin manually.

3.2 Graph breaks: propagate profiles to intermediate submodules

Partitioning can produce a mix of Torch and TRT submodules. Each TRT submodule needs the same named profiles as the top-level inputs (e.g. "decode" on every engine, including intermediate graphs). We do not re-export or re-trace per profile.

Where intermediate shapes come from: torch.export attaches symbolic shapes to every placeholder via meta["val"] (FakeTensor / SymInt). After a graph break, construct_submodule_inputs() reads those placeholders and builds TRT Inputs:

construct_submodule_inputs(submodule)
  → placeholder meta["val"].size()     # e.g. [1, 3, s0/4, s0/4]
  → get_input() → construct_dynamic_input()
  → extract_var_range_info(dim)        # per SymInt dim → {min, opt, max}

Today this yields one min/opt/max envelope per submodule input — the union range implied by export's ShapeEnv (see construct_dynamic_input, extract_var_range_info). That is correct for single-profile; multi-profile extends it.

Propagation rule: intermediate tensor shapes are sympy expressions over the same source symbols assigned at export (e.g. top-level s0). User-provided profile bounds apply to those source symbols; intermediate dims inherit profile bounds by evaluating the expression.

Example:

	Top-level input	Intermediate submodule input (after graph break)
Symbolic shape	`[1, 3, s0, s0]`	`[1, 3, s0/4, s0/4]`
Profile `"small"`	`s0: min=64, opt=128, max=256`	dim: min=16, opt=32, max=64
Profile `"large"`	`s0: min=1024, opt=2048, max=4096`	dim: min=256, opt=512, max=1024

For each profile name, substitute the profile's {min, opt, max} value of every free symbol into the intermediate SymInt expression (expr.xreplace(...)) and evaluate to an integer. Shape ops in export produce affine expressions (s0, s0/4, 2*s0, …) that are monotonic in each source symbol, so per-corner substitution is exact — no separate trace per profile.

Algorithm (per TRT submodule, after partition):

Top level: map each Input.profiles dynamic dim → export symbol name (read from top-level placeholder meta["val"] SymInt nodes). Build profile_source_bounds: {profile_name: {symbol: {min, opt, max}}}.
Per submodule: construct_submodule_inputs(submodule) as today to get symbolic shapes from placeholders; for each SymInt dim, call extract_var_range_info only for the union fallback (or evaluate per profile via substitution as above).
Emit one Input per submodule placeholder with:
- union min_shape / opt_shape / max_shape (backward compatible), and
- profiles={name: {min, opt, max}} using the same profile names as the top level.
_TRTInterpreter: write all N profiles for that submodule (§6).

Shape-tensor (scalar SymInt) inputs: handled by the same symbolic path in construct_submodule_inputs (is_shape_tensor=True). Profile substitution applies to the scalar expression the same way.

4. User API

4.1 Compile — `Input.profiles`

Multi-profile shape ranges live on Input, not a separate compile kwarg. Each dynamic input declares named regimes:

B, EMBED, MAX_SEQ = 6, 4096, 4096

inputs_embeds = torchtrt.Input(
    name="inputs_embeds",
    dtype=torch.float16,
    profiles={
        "prefill": {
            "min": (B, 1, EMBED),
            "opt": (B, 3424, EMBED),
            "max": (B, MAX_SEQ, EMBED),
        },
        "decode": {
            "min": (B, 1, EMBED),
            "opt": (B, 1, EMBED),
            "max": (B, 1, EMBED),
        },
    },
)

trt_gm = torchtrt.dynamo.compile(ep, arg_inputs=[inputs_embeds], **settings)

min / opt / max are full shape tuples (same semantics as today's min_shape / opt_shape / max_shape).

Default (single profile) — unchanged. Either no profiles key:

# static
Input(shape=(1, 3, 224, 224), dtype=torch.float32)

# one dynamic range (today)
Input(min_shape=(1, 32), opt_shape=(1, 1024), max_shape=(1, 2048))

# or shape info from EP only
torchtrt.dynamo.compile(ep, **settings)

How compile() uses Input.profiles:

Collect the union of profile names across all Inputs that define profiles.
For profile "prefill", zip each input's "prefill" entry into one TRT profile (internal normalized form for _TRTInterpreter).
Derive each input's export envelope as the elementwise min/max over its profile entries → one Dim per dynamic axis in _tracer.py.
Static inputs (no profiles) keep one shape; that shape is reused in every TRT profile.
After partition (§3.2), derive per-submodule Input.profiles by symbolic propagation from the same profile names — no second export.

Rules:

profiles and top-level min_shape / opt_shape / max_shape are mutually exclusive on the same Input.
Every dynamic input must define the same set of profile names (or be static).
min ≤ opt ≤ max element-wise; min ≥ 1 on every dim; each profile must fit inside the EP's Dim envelope.
Profile names ("prefill", "decode") are compile-time labels on Input.profiles; the engine blob stores only TRT profile indices 0…N−1.

Today compile() still requires inputs= when not inferrable from EP (_compiler.py:668). Plan: derive static Inputs from EP placeholders; multi-profile inputs must still be supplied explicitly.

4.2 Runtime — profile selection (manual default, auto opt-in)

No serialization format change. After deserialize_cuda_engine, rebuild profile bounds from the TRT API (ICudaEngine::getTensorProfileShape / Python get_tensor_profile_shape). Cache once in _setup_engine():

# binding_name -> dynamic_dim_index -> [(min, max), ...]  one tuple per profile index
self._profile_dim_ranges: Dict[str, Dict[int, List[Tuple[int, int]]]] = {}

for p in range(self.cuda_engine.num_optimization_profiles):
    for name in self.in_binding_names:
        rmin, _, rmax = self.cuda_engine.get_tensor_profile_shape(name, p)
        for d, (lo, hi) in enumerate(zip(rmin, rmax)):
            if lo != hi or _varies_across_profiles(name, d):
                self._profile_dim_ranges.setdefault(name, {}).setdefault(d, []).append((lo, hi))

Works for engines compiled in-process, loaded from cache, or deserialized from disk — no new fields in _serialized_engine_layout.

Selection modes

Profile selection is manual by default. Auto-selection is opt-in so users who know their phase boundaries (e.g. LLM prefill/decode) can avoid per-forward matching overhead.

Mode	How	When to use
Manual (default)	`with optimization_profile(trt_gm, "decode"):` or `optimization_profile(trt_gm, 0)`	Known phase boundaries; lowest runtime overhead; no shape→profile scan
Auto (opt-in)	`with optimization_profile(trt_gm, "auto"):` enables auto-selection for that span; or compile flag `auto_profile_selection=True` for module-wide auto	Overlap disambiguation handled automatically; convenience when phase is inferred from shapes

# Manual — recommended for LLM serving (prefill/decode boundaries are explicit)
with optimization_profile(trt_gm, "prefill"):
    out = trt_gm(inputs_embeds=embeds, past_key_values=kv)
with optimization_profile(trt_gm, "decode"):
    out = trt_gm(inputs_embeds=embeds, past_key_values=kv)

# Auto — opt-in when shapes alone determine the profile
with optimization_profile(trt_gm, "auto"):
    out = trt_gm(x)  # selects profile from input shapes each forward

Priority: pinned name/index > auto (if enabled) > error on ambiguity. Outside an "auto" span and without an explicit pin, do not switch profiles — the active profile from the last optimization_profile(...) call remains in effect (or profile 0 if never set).

Profile-switch overhead: setOptimizationProfileAsync may invalidate CUDA Graph captures and incurs a small device sync. Benchmark switching cost during implementation; document expected overhead in the user guide. Manual selection avoids the per-forward candidate scan entirely.

Optional: persist Input.profiles name→index map in the existing serialized_metadata JSON blob (no C++ layout change) so optimization_profile(m, "decode") works after reload. If absent, the context manager accepts indices only.

Auto-selection algorithm (when enabled)

Runs in setup_input_tensors before set_input_shape:

Start with the set of all profile indices P = {0, …, N−1}.
For each input binding, eliminate profiles where any dynamic dim of the actual tensor shape falls outside that profile's [min, max] for that binding.
Require all inputs to agree on the same surviving profile. If input A leaves {0, 1} and input B leaves {1, 2} with no intersection → raise (conflicting shape signals).
Exactly one survivor → set_active_profile(p) if not already active (idempotent).
Zero survivors → raise with cached ranges and a hint to fix shapes or pin explicitly.
Multiple survivors (overlapping profiles) → select the profile whose opt shape is closest to the actual input shapes (§3.1.1); set_active_profile(p) if not already active.

See §3.1.1 for the distance metric and overlap examples. Manual pin via optimization_profile(trt_gm, name) skips steps 1–6 entirely.

Context manager

with optimization_profile(trt_gm, "decode"):   # pin by name
    out = trt_gm(...)
with optimization_profile(trt_gm, 1):         # pin by index
    out = trt_gm(...)
with optimization_profile(trt_gm, "auto"):    # enable auto-selection for this span
    out = trt_gm(...)

Stack semantics, idempotent switch when already on the requested profile, CUDA Graph invalidation on change, C++ op parity.

4.3 JIT: `torch.compile(backend="tensorrt")`

The JIT backend (backend/backends.py → compile_module → same _TRTInterpreter) shares conversion and runtime with AOT. Differences are only at how shape ranges are supplied.

	AOT `dynamo.compile(ep)`	JIT `torch.compile(..., backend="tensorrt")`
Shape source	EP `Dim(min, max)` + optional profiles	First-forward tensors are concrete; profiles must carry min/opt/max explicitly
Export	`torch.export` once over union envelope	`aot_export_joint_simple` on dynamo-traced graph
Runtime profile switch	`optimization_profile(...)` context manager	Same — works on the returned `GraphModule`

Usage — pass the same Input(profiles=...) objects via options:

inputs_embeds = torchtrt.Input(name="inputs_embeds", profiles={...})

compiled = torch.compile(
    model,
    backend="tensorrt",
    dynamic=True,
    options={"arg_inputs": [inputs_embeds]},
)

Rules for JIT:

Supply Input(profiles=...) in options["arg_inputs"] (or kwarg_inputs) — do not rely on prepare_inputs(first_tensor) alone.
Use dynamic=True so prefill and decode hit one dynamo compile / one engine; switch profiles at runtime via §4.2.
optimization_profile() unchanged from §4.2.

Not supported in v1: inferring profiles from recompilation (guards building a new engine per shape regime). LLM users who need serialization should prefer AOT dynamo.compile(ep, ...).

5. LLM constraints

No min=0. Reject in Input.profiles validation at Input construction / compile() entry.

Use a static KV cache (tools/llm/static_cache_v1.py): fixed [B, H, max_seq_len, D] tensors; track the valid region with scalar start_idx / end_idx. Multi-profile only needs to specialize token inputs (input_ids, position_ids), not cache shape.

Shape-polymorphic over the union: the exported graph must not branch on token length in Python (§3.1). With static cache + index-based attention, intermediate TRT subgraphs see symbolic shapes derived from token inputs only; profile names propagate automatically (§3.2).

6. Implementation sketch

Area	Change
`_Input.py`	Add `profiles: Dict[str, {min, opt, max}]`; mutual exclusion with `min/opt/max_shape`; validation
`_compiler.py`	`collect_optimization_profiles(inputs)` → normalized list for interpreter + cache hash; build `profile_source_bounds` from top-level `Input.profiles` + EP placeholder symbols; after partition, attach propagated profiles to submodule inputs
`_tracer.py`	Union envelope per dim from `Input.profiles` (AOT); post-export validation: profile ⊆ export envelope, log narrowing when export envelope ⊃ profile union (§3.1); JIT: same in backend when `Input.profiles` set
`partitioning/common.py`	Extend `construct_dynamic_input` / add `construct_dynamic_input_multi_profile`: given symbolic shape + `profile_source_bounds`, evaluate per-profile min/opt/max via `SymInt` expr substitution; extend `construct_submodule_inputs` to emit `Input.profiles` when top-level profiles present
`dynamo/utils.py`	Add `extract_var_range_info_for_profile(symint, profile_bounds, mode)` (or generalize `extract_var_range_info`) for per-profile evaluation; keep existing function for union / single-profile
`_TRTInterpreter.py`	Create N profiles; loop in `placeholder()` instead of writing index 0 only
`_settings.py`	Internal `optimization_profiles` list (built from inputs, picklable for cache/serialization)
`_TRTEngine.py`	After deserialize: build `_profile_dim_ranges` (min/max) and cache per-profile opt shapes for closest-opt tie-break (§3.1.1); opt-in auto-select in `setup_input_tensors`; `set_active_profile`; benchmark profile-switch overhead
`runtime/_optimization_profile.py`	Context manager: pin by name/index, opt-in `"auto"` span; overlap disambiguation
Engine cache	Hash `Input.profiles` content at compile time
Serialization	No new tuple fields; optional name→index in existing `serialized_metadata` only

Prior art: FX converter already loops over N shape_ranges in fx2trt.py. Submodule symbolic shapes already flow from export through construct_submodule_inputs today; multi-profile extends construct_dynamic_input + extract_var_range_info from one envelope to N named profiles.

7. Example (Alpamayo)

Alpamayo decode benchmark shape: batch=6, inputs_embeds=(6, 1, 4096), static KV (6, 2, 8, 4096, 128). Only inputs_embeds seq len varies between prefill and decode; KV stays static (§5).

inputs_embeds = torchtrt.Input(
    name="inputs_embeds",
    profiles={
        "prefill": {"min": (6, 1, 4096), "opt": (6, 3424, 4096), "max": (6, 4096, 4096)},
        "decode":  {"min": (6, 1, 4096), "opt": (6, 1, 4096),    "max": (6, 1, 4096)},
    },
)

trt_gm = torchtrt.dynamo.compile(ep, arg_inputs=[inputs_embeds])

# Manual (recommended for LLM — explicit phase boundaries, lowest overhead):
with optimization_profile(trt_gm, "decode"):
    out = trt_gm(inputs_embeds=embeds, past_key_values=kv)

# Or opt into auto when shapes determine the profile:
with optimization_profile(trt_gm, "auto"):
    out = trt_gm(inputs_embeds=embeds, past_key_values=kv)

Decode under "decode" should match ONNX-TRT decode (~5.1 ms GPU Compute on the Alpamayo repro) instead of the ~2× slower single-profile engine tuned for prefill opt=3424.

Graph-break example (symbolic propagation)

Suppose partition yields a TRT submodule whose placeholder shape is (B, H, s0, D) where the top-level inputs_embeds seq dim is s0. User profiles on inputs_embeds:

Profile	`s0` (seq)	Submodule dim `s0`
`"prefill"`	min=1, opt=3424, max=4096	same
`"decode"`	min=1, opt=1, max=1	same

Both engines (root and submodule) get TRT optimization profiles "prefill" / "decode" with identical bounds on s0 — derived from the user's top-level Input.profiles, not a second trace.

If an intermediate op divides spatially, e.g. submodule input (B, H, s0/4, D):

Profile	`s0`	`s0/4` (min / opt / max)
`"prefill"`	1 / 3424 / 4096	1 / 856 / 1024
`"decode"`	1 / 1 / 1	1 / 1 / 1

No user action required on the intermediate tensor; propagation is automatic from export's symbolic metadata.

8. Backward compatibility

No Input.profiles → existing zero/one-profile paths unchanged; same cache key.
Old engines load with 0–1 profiles; _profile_dim_ranges empty or single-entry; behavior unchanged.
Multi-profile engines deserialized from an older Torch-TRT build still expose ranges via get_tensor_profile_shape once that build writes N profiles into the blob.

9. Implementation plan

_Input.py: profiles dict + validation; collect_optimization_profiles() in _compiler.py.
_tracer.py: union envelope; post-export profile ⊆ envelope validation (§3.1).
partitioning/common.py + utils.py: per-profile symbolic propagation for submodule inputs (§3.2); wire into _compiler.py partition loop (before / instead of per-submodule single-profile construct_submodule_inputs only).
_TRTInterpreter.py: multi-profile loop.
_TRTEngine: _profile_dim_ranges from get_tensor_profile_shape at load; opt-in auto-select + per-input rule-out algorithm (§4.2); set_active_profile; profile-switch overhead benchmark.
runtime/_optimization_profile.py + C++ op: manual pin, "auto" span, stack semantics.
Cache hash; optional name map in serialized_metadata.
Tests: export envelope validation (profile ⊄ export → error; export ⊃ union → warn/log); symbolic propagation; submodule profile reuse; range reconstruction; manual pin; opt-in auto-select; overlap closest-opt tie-break (§3.1.1); multi-input agreement; Alpamayo e2e.

narendasan · 2026-05-30T00:24:00Z

narendasan
May 30, 2026
Collaborator

@cehongwang is this something we could add to the Input class instead of an additional API. Like allowing for disjoint shape ranges?

Something like

Input(
    profiles = {
       "prefill": {
          "min": ..., "max": .., "opt": ...
       }, "decode": {
          "min": ..., "max": .., "opt": ...
       }
    ] 
)

With some cross input error checking?

5 replies

narendasan May 30, 2026
Collaborator

Is there a usecase for disjoint profiles? like one input always has a default but other inputs may have a few different ones?

cehongwang May 30, 2026
Collaborator Author

We can use input and require that the number of profiles is the same for all inputs. like one input always has a default but other inputs may have a few different ones? That would cause some issues if there are lots of input with different optimization profile and it would be in a mess.

narendasan May 30, 2026
Collaborator

Is it something possible that we dont want to deal with or something tensorrt does not allow?

cehongwang May 30, 2026
Collaborator Author

But since Input is not used anywhere in the current compilation process, it is equvalent as adding a new API, just a matter of which API is more user-friendly and intuitive

narendasan May 30, 2026
Collaborator

IMO input makes more sense than a new API because there are other usecases for input that overlap (see below)

narendasan · 2026-05-30T00:27:50Z

narendasan
May 30, 2026
Collaborator

There is a somewhat related thread that @apbose should be working on which is named tuples for different dimensions to allow for cross input dynamic dimensions. These two features should work together

3 replies

cehongwang May 30, 2026
Collaborator Author

Is there any RFC or PR for this feature?

apbose Jun 2, 2026
Collaborator

This is the PR #4233. This decides which dynamic symbols exist and which input dims share it, with the same Dim shared across multiple inputs (eg one batch for both input_ids and attention_mask). I assume for this RFC we would want shared seq_en across input_ids/position_ids/input_embeds and then prefill/decode profiles over those shared seq_len. I need to change the above PR to not have a separate dynamic_shapes dict alongside Input, and instead have all the info in the Input objects in kwargs_inputs/arg_inputs. Can push that in by today/tomorrow.

cehongwang Jun 4, 2026
Collaborator Author

With this PR, should the new Input object be something like

torchtrt.Input(
            profiles={
                            "prefill":
                        {min_shape=(1, seq),
                        opt_shape=(4, seq),
                        max_shape=(4, seq)},
            , 
              
                        "decode":
                        {min_shape=(1, seq),
                        opt_shape=(1, seq),
                        max_shape=(1, seq)},
            }
            dtype=torch.int64,
            name="input_ids",
            name_dims={0: "B"},
        )

narendasan · 2026-05-30T00:29:00Z

narendasan
May 30, 2026
Collaborator

4.2 Runtime

How do we store the information different profiles to be used at runtime?

5 replies

cehongwang May 30, 2026
Collaborator Author

It would be good to store in metadata

narendasan May 30, 2026
Collaborator

How would it work in the C++ runtime then, dont we need to verify that there is an available profile?

cehongwang May 30, 2026
Collaborator Author

TRT stores optimization profile as a list, and there is an API to check the number of optimization profile: https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/c-api/classnvinfer1_1_1_i_cuda_engine.html#ad8778c30ad905b7f7fc02673127557c7:~:text=getNbLayers-,getNbOptimizationProfiles,-getProfileShape

When setting the optimization profiles it also only takes an index to select: https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/c-api/classnvinfer1_1_1_i_execution_context.html#a74c361a3d93e70a3164988df7d60a4cc

So I think the only thing matter is the shape of inputs, which is the user's responsibility to make them match

cehongwang May 30, 2026
Collaborator Author

In Python, we can set an alias that is associated with indices to make it more user-friendly. In pure C++ runtime it only cares about the index

narendasan Jun 1, 2026
Collaborator

I think we need to think a bit more about this. Should we detect the profile? should we add an api for this and make users manage it?

cehongwang · 2026-06-01T20:22:28Z

cehongwang
Jun 1, 2026
Collaborator Author

There are apis to check number of profiles and each profile tensor shape. We can detect it from the engine. But if the engine is compiled with torchtrt why do we need to detect it? Users can manage this via context manager. You have a better way to deal with it?

9 replies

narendasan Jun 1, 2026
Collaborator

We could also reconstruct this here: https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/c-api/classnvinfer1_1_1_i_cuda_engine.html#a37eb21beda0055cc4a442df485b682ed

My thought is that we can on deserialization cache this info and queue a profile switch if the input falls into another profile in addition to giving people a context manager to disambiguate overlapping cases (if allowed)

That way we dont need to expand the serialization format

narendasan Jun 1, 2026
Collaborator

basically I think we should have a data structure we construct at de-serialization time which is something like

{ binding_name : { dynamic_index : [(min, max), (min, max)] }}

As part of setting the tensor address, we should check the input's shape on the dynamic axis to select the profile as well as verifying that we arent violating the profile for multiple inputs

cehongwang Jun 1, 2026
Collaborator Author

Do you mean { binding_name : { profile_name1: dynamic_index : [(min, max), (min, max)] }, { profile_name2: dynamic_index : [(min, max), (min, max)] }} }

cehongwang Jun 1, 2026
Collaborator Author

My thought is that we can on deserialization cache this info and queue a profile switch if the input falls into another profile in addition to giving people a context manager to disambiguate overlapping cases (if allowed) What if one input in a profile and another input in a different profile? I think overlapping should not be allowed or otherwise it is too complicated. It is really easy for users to define what the range they need

cehongwang Jun 2, 2026
Collaborator Author

https://github.com/NVIDIA/TensorRT-Edge-LLM/blob/364769036fc83351d9d0aac4cc064a8e56a83178/cpp/runtime/llmEngineRunner.cpp#L1236

Here is how EdgeLLM set the optimization profiles in runtime

Uh oh!

✨[Feature] Multiple Optimization Profiles for Disjoint Input Shape Regimes #4313

Uh oh!

Uh oh!

cehongwang May 30, 2026 Collaborator

RFC: Multiple Optimization Profiles for Disjoint Input Shape Regimes

1. Problem

Measured gap (Alpamayo / Edge-LLM)

2. Goals / Non-goals

3. Design: export once, specialize in TRT

3.1 Export envelope vs. profile ranges

3.1.1 Overlapping profiles

3.2 Graph breaks: propagate profiles to intermediate submodules

4. User API

4.1 Compile — Input.profiles

4.2 Runtime — profile selection (manual default, auto opt-in)

Selection modes

Auto-selection algorithm (when enabled)

Context manager

4.3 JIT: torch.compile(backend="tensorrt")

5. LLM constraints

6. Implementation sketch

7. Example (Alpamayo)

Graph-break example (symbolic propagation)

8. Backward compatibility

9. Implementation plan

Replies: 4 comments · 22 replies

Uh oh!

Uh oh!

narendasan May 30, 2026 Collaborator

Uh oh!

narendasan May 30, 2026 Collaborator

Uh oh!

cehongwang May 30, 2026 Collaborator Author

Uh oh!

narendasan May 30, 2026 Collaborator

Uh oh!

cehongwang May 30, 2026 Collaborator Author

Uh oh!

narendasan May 30, 2026 Collaborator

Uh oh!

narendasan May 30, 2026 Collaborator

Uh oh!

cehongwang May 30, 2026 Collaborator Author

Uh oh!

apbose Jun 2, 2026 Collaborator

Uh oh!

Uh oh!

cehongwang Jun 4, 2026 Collaborator Author

Uh oh!

Uh oh!

narendasan May 30, 2026 Collaborator

Uh oh!

cehongwang May 30, 2026 Collaborator Author

Uh oh!

narendasan May 30, 2026 Collaborator

Uh oh!

cehongwang May 30, 2026 Collaborator Author

Uh oh!

cehongwang May 30, 2026 Collaborator Author

Uh oh!

narendasan Jun 1, 2026 Collaborator

Uh oh!

Uh oh!

cehongwang Jun 1, 2026 Collaborator Author

Uh oh!

Uh oh!

narendasan Jun 1, 2026 Collaborator

Uh oh!

narendasan Jun 1, 2026 Collaborator

Uh oh!

cehongwang Jun 1, 2026 Collaborator Author

Uh oh!

cehongwang Jun 1, 2026 Collaborator Author

Uh oh!

cehongwang Jun 2, 2026 Collaborator Author

cehongwang
May 30, 2026
Collaborator

4.1 Compile — `Input.profiles`

4.3 JIT: `torch.compile(backend="tensorrt")`

Replies: 4 comments 22 replies

narendasan
May 30, 2026
Collaborator

narendasan May 30, 2026
Collaborator

cehongwang May 30, 2026
Collaborator Author

narendasan May 30, 2026
Collaborator

cehongwang May 30, 2026
Collaborator Author

narendasan May 30, 2026
Collaborator

narendasan
May 30, 2026
Collaborator

cehongwang May 30, 2026
Collaborator Author

apbose Jun 2, 2026
Collaborator

cehongwang Jun 4, 2026
Collaborator Author

narendasan
May 30, 2026
Collaborator

cehongwang May 30, 2026
Collaborator Author

narendasan May 30, 2026
Collaborator

cehongwang May 30, 2026
Collaborator Author

cehongwang May 30, 2026
Collaborator Author

narendasan Jun 1, 2026
Collaborator

cehongwang
Jun 1, 2026
Collaborator Author

narendasan Jun 1, 2026
Collaborator

narendasan Jun 1, 2026
Collaborator

cehongwang Jun 1, 2026
Collaborator Author

cehongwang Jun 1, 2026
Collaborator Author

cehongwang Jun 2, 2026
Collaborator Author