pytorch · jansel · Oct 15, 2025 · Oct 15, 2025
diff --git a/docs/api/kernel.md b/docs/api/kernel.md
@@ -88,6 +88,10 @@ bound_static = shape_specialized_kernel.bind((torch.randn(100, 50),))
 result = bound_static(torch.randn(100, 50))  # Must be exactly [100, 50]
 ```
 
+```{warning}
+Helion shape-specializes kernels by default (`static_shapes=True`) for the best performance. Bound kernels and caches require tensors with the exact same shapes and strides as the examples you compile against. Set `static_shapes=False` if you need the same compiled kernel to serve many shapes.
+```
+
 ### BoundKernel Methods
 
 The returned BoundKernel has these methods:
@@ -131,9 +135,9 @@ print(triton_code)
 Kernels are automatically cached based on:
 
 - **Argument types** (dtype, device)
-- **Tensor shapes** (when using `static_shapes=True`)
+- **Tensor shapes** (default: `static_shapes=True`)
 
-By default (`static_shapes=False`), kernels only specialize on basic shape categories (0, 1, or ≥2 per dimension) rather than exact shapes, allowing the same compiled kernel to handle different tensor sizes efficiently.
+By default (`static_shapes=True`), Helion treats shapes and strides as compile-time constants, baking them into generated Triton code for the best performance. To reuse a single compiled kernel across size variations, set `static_shapes=False`, which instead buckets each dimension as `{0, 1, ≥2}` and allows more inputs to share the same cache entry.
 
 ```python
 # These create separate cache entries

diff --git a/docs/api/settings.md b/docs/api/settings.md
@@ -98,7 +98,7 @@ with helion.set_default_settings(
 
 .. autoattribute:: Settings.static_shapes
 
-   When enabled, tensor shapes are treated as compile-time constants for optimization. Default is ``False``.
+   When enabled, tensor shapes are treated as compile-time constants for optimization. Default is ``True``. Set this to ``False`` if you need a single compiled kernel instance to serve many shape variants.
 ```
 
 ### Autotuning Settings

diff --git a/docs/deployment_autotuning.md b/docs/deployment_autotuning.md
@@ -146,13 +146,16 @@ config and selecting the fastest.
 A key detail here is controlling the specialization key, which
 determines when to re-benchmark. Options include:
 
-- **Default (dynamic shapes):** we reuse the timing result as long as
-tensor dtypes and device types stay constant. Shape changes only trigger
-a re-selection when a dimension size crosses the buckets `{0, 1, ≥2}`.
+- **Default (`static_shapes=True`):** Helion shape-specializes on the exact
+  shape/stride signature, rerunning the selection whenever those shapes
+  differ. This delivers the best per-shape performance but requires all calls
+  to match the example shapes exactly.
 
-- **`static_shapes=True`:** add this setting to the decorator to specialize
-on the exact shape/stride signature, rerunning the selection whenever
-those shapes differ.
+- **`static_shapes=False`:** switch to bucketed dynamic shapes. Helion
+  reuses results as long as tensor dtypes and device types stay constant.
+  Shape changes only trigger a re-selection when a dimension size crosses
+  the buckets `{0, 1, ≥2}`. Use this when you need one compiled kernel to
+  handle many input sizes.
 
 - **Custom keys:** pass `key=` to group calls however you like.
 This custom key is in addition to the above.
@@ -197,15 +200,15 @@ input types. You can pre-compile as many configs as you need using
 `BoundKernel.compile_config`.  **Warning:** `kernel.bind()` specializes,
 and the result will only work with the same input types you passed.
 
-- With `static_shapes=False` (default) it will specialize on the input
-dtypes, device types, and whether each dynamic dimension falls into the
-0, 1, or ≥2 bucket.  Python types are also specialized.  For dimensions
-that can vary across those buckets, supply representative inputs ≥2
-to avoid excessive specialization.
+- With `static_shapes=True` (default) the bound kernel only works for the
+exact shape/stride signature of the example inputs.  The generated code
+has shapes baked in, which often provides a performance boost.
 
-- With `static_shapes=True` the bound kernel only works for the exact
-shape/stride signature of the example inputs.  The generated code will
-have shapes baked in, which often provides a performance boost.
+- With `static_shapes=False` it will specialize on the input dtypes,
+device types, and whether each dynamic dimension falls into the 0, 1,
+or ≥2 bucket.  Python types are also specialized.  For dimensions that
+can vary across those buckets, supply representative inputs ≥2 to avoid
+excessive specialization.
 
 If you need to support multiple input types, bind multiple times with
 representative inputs.