Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions docs/api/kernel.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,10 @@ bound_static = shape_specialized_kernel.bind((torch.randn(100, 50),))
result = bound_static(torch.randn(100, 50)) # Must be exactly [100, 50]
```

```{warning}
Helion shape-specializes kernels by default (`static_shapes=True`) for the best performance. Bound kernels and caches require tensors with the exact same shapes and strides as the examples you compile against. Set `static_shapes=False` if you need the same compiled kernel to serve many shapes.
```

### BoundKernel Methods

The returned BoundKernel has these methods:
Expand Down Expand Up @@ -131,9 +135,9 @@ print(triton_code)
Kernels are automatically cached based on:

- **Argument types** (dtype, device)
- **Tensor shapes** (when using `static_shapes=True`)
- **Tensor shapes** (default: `static_shapes=True`)

By default (`static_shapes=False`), kernels only specialize on basic shape categories (0, 1, or ≥2 per dimension) rather than exact shapes, allowing the same compiled kernel to handle different tensor sizes efficiently.
By default (`static_shapes=True`), Helion treats shapes and strides as compile-time constants, baking them into generated Triton code for the best performance. To reuse a single compiled kernel across size variations, set `static_shapes=False`, which instead buckets each dimension as `{0, 1, ≥2}` and allows more inputs to share the same cache entry.

```python
# These create separate cache entries
Expand Down
2 changes: 1 addition & 1 deletion docs/api/settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ with helion.set_default_settings(

.. autoattribute:: Settings.static_shapes

When enabled, tensor shapes are treated as compile-time constants for optimization. Default is ``False``.
When enabled, tensor shapes are treated as compile-time constants for optimization. Default is ``True``. Set this to ``False`` if you need a single compiled kernel instance to serve many shape variants.
```

### Autotuning Settings
Expand Down
31 changes: 17 additions & 14 deletions docs/deployment_autotuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,13 +146,16 @@ config and selecting the fastest.
A key detail here is controlling the specialization key, which
determines when to re-benchmark. Options include:

- **Default (dynamic shapes):** we reuse the timing result as long as
tensor dtypes and device types stay constant. Shape changes only trigger
a re-selection when a dimension size crosses the buckets `{0, 1, ≥2}`.
- **Default (`static_shapes=True`):** Helion shape-specializes on the exact
shape/stride signature, rerunning the selection whenever those shapes
differ. This delivers the best per-shape performance but requires all calls
to match the example shapes exactly.

- **`static_shapes=True`:** add this setting to the decorator to specialize
on the exact shape/stride signature, rerunning the selection whenever
those shapes differ.
- **`static_shapes=False`:** switch to bucketed dynamic shapes. Helion
reuses results as long as tensor dtypes and device types stay constant.
Shape changes only trigger a re-selection when a dimension size crosses
the buckets `{0, 1, ≥2}`. Use this when you need one compiled kernel to
handle many input sizes.

- **Custom keys:** pass `key=` to group calls however you like.
This custom key is in addition to the above.
Expand Down Expand Up @@ -197,15 +200,15 @@ input types. You can pre-compile as many configs as you need using
`BoundKernel.compile_config`. **Warning:** `kernel.bind()` specializes,
and the result will only work with the same input types you passed.

- With `static_shapes=False` (default) it will specialize on the input
dtypes, device types, and whether each dynamic dimension falls into the
0, 1, or ≥2 bucket. Python types are also specialized. For dimensions
that can vary across those buckets, supply representative inputs ≥2
to avoid excessive specialization.
- With `static_shapes=True` (default) the bound kernel only works for the
exact shape/stride signature of the example inputs. The generated code
has shapes baked in, which often provides a performance boost.

- With `static_shapes=True` the bound kernel only works for the exact
shape/stride signature of the example inputs. The generated code will
have shapes baked in, which often provides a performance boost.
- With `static_shapes=False` it will specialize on the input dtypes,
device types, and whether each dynamic dimension falls into the 0, 1,
or ≥2 bucket. Python types are also specialized. For dimensions that
can vary across those buckets, supply representative inputs ≥2 to avoid
excessive specialization.

If you need to support multiple input types, bind multiple times with
representative inputs.
Expand Down
Loading