# Helion Puzzles

Programming for accelerators such as GPUs is critical for modern AI systems. This often means programming directly in proprietary low-level languages such as CUDA. Helion is a Python-embedded domain-specific language (DSL) for authoring machine learning kernels, designed to compile down to Triton, a performant backend for programming GPUs and other devices.

Helion aims to raise the level of abstraction compared to Triton, making it easier to write correct and efficient kernels while enabling more automation in the autotuning process.

This set of puzzles is meant to teach you how to use Helion from first principles in an interactive fashion. You will start with trivial examples and build your way up to real algorithms like Flash Attention and Quantized neural networks.

## Setup

First, let's install the necessary dependencies. Helion requires a recent version of PyTorch and a development version of Triton.

In [None]:
%%capture
# Only need to run the first time.
!pip install torch
# !pip install git+https://github.com/triton-lang/triton.git
!pip install git+https://github.com/pytorch-labs/helion.git

In [None]:
import torch
import helion
import helion.language as hl
from torch import Tensor

Let's also create a simple testing function to verify our implementations.

In [None]:
from triton.testing import do_bench
def test_kernel(kernel_fn, spec_fn, *args):
    """Test a Helion kernel against a reference implementation."""
    # Run our implementation
    result = kernel_fn(*args)
    # Run reference implementation
    expected = spec_fn(*args)

    # Check if results match
    torch.testing.assert_close(result, expected)
    print("✅ Results Match ✅")


def benchmark_kernel(kernel_fn, *args, **kwargs):
    """Benchmark a Helion kernel."""
    no_args = lambda: kernel_fn(*args, **kwargs)
    time_in_ms = do_bench(no_args)
    print(f"⏱ Time: {time_in_ms} ms")


def compare_implementations(kernel_fn, spec_fn, *args, **kwargs):
    """Benchmark a Helion kernel and its reference implementation."""
    kernel_no_args = lambda: kernel_fn(*args, **kwargs)
    spec_no_args = lambda: spec_fn(*args, **kwargs)
    kernel_time = do_bench(kernel_no_args)
    spec_time = do_bench(spec_no_args)
    print(f"⏱ Helion Kernel Time: {kernel_time:.3f} ms, PyTorch Reference Time: {spec_time:.3f} ms, Speedup: {spec_time/kernel_time:.3f}x")

## Introduction to Helion

Helion allows you to write GPU kernels using familiar PyTorch syntax. The code outside the `for` loops is standard PyTorch code executed on the CPU. The code inside the `for` loops is compiled into a Triton kernel, resulting in a single GPU kernel.

Unlike raw Triton, Helion handles memory management, tiling, and other low-level details automatically. This allows you to focus on the algorithm rather than the implementation details.

## Basic Structure of a Helion Kernel

A Helion kernel consists of two main parts:
1. **Host Code**: Standard PyTorch code executed on the CPU (outside the loops)
2. **Device Code**: Operations inside `hl.tile()` loops that execute on the GPU

Let's examine a simple example:

In [None]:
@helion.kernel(config=helion.Config(block_sizes = [128, 128]))  # The @helion.kernel decorator marks this function for compilation
def example_add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    # Host code: Standard PyTorch operations
    m, n = x.size()
    out = torch.empty_like(x)  # Allocate output tensor

    # The hl.tile loop defines the parallel execution structure
    for tile_m, tile_n in hl.tile([m, n]):
        # Device code: Everything inside the hl.tile loop runs on GPU
        out[tile_m, tile_n] = x[tile_m, tile_n] + y[tile_m, tile_n] # Simple element-wise addition expressed w/ pytorch ops

    return out  # Return the result back to the host

# Create some sample data
x = torch.randn(10, 10, device="cuda")
y = torch.randn(10, 10, device="cuda")

# Run the kernel
result = example_add(x, y)

# Verify result
expected = x + y
torch.testing.assert_close(result, expected)
print("✅ Results Match ✅")
benchmark_kernel(example_add, x, y)
compare_implementations(example_add, torch.add, x, y)


## Autotuning in Helion

In the previous example, we explicitly specified a configuration using `config=helion.Config(block_sizes=[128, 128])`. This bypasses Helion's autotuning mechanism and uses our predefined settings. While this is quick to run, manually choosing optimal parameters can be challenging and hardware-dependent.

### What is Autotuning?

Autotuning is Helion's process of automatically finding the best configuration parameters for your specific:
- Hardware (GPU model)
- Problem size
- Operation patterns

When you omit the `config` parameter, Helion will automatically search for the optimal configuration:

```python
@helion.kernel()  # No config = automatic tuning
def autotuned_add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
   m, n = x.size()
   out = torch.empty_like(x)
   for tile_m, tile_n in hl.tile([m, n]):
       out[tile_m, tile_n] = x[tile_m, tile_n] + y[tile_m, tile_n]
   return out
```

Feel free to remove the above code to see how much more performant it is than the original, although be warned it might take some time 😃

Now let's move on to our puzzles!

## Puzzle 1: Constant Add

Add a constant to a vector.

In [None]:
def add_spec(x: Tensor) -> Tensor:
    """This is the spec that you should implement."""
    return x + 10.

 # ---- ✨ Is this the best block size? ----
@helion.kernel(config = helion.Config(block_sizes = [1,]))
def add_kernel(x: torch.Tensor) -> torch.Tensor:
    # ---- ✨ Your Code Here ✨----
    # Set up the output buffer which you will return

    # Use Helion to tile the computation
    for tile_n in hl.tile(TILE_RANGE):
         # ---- ✨ Your Code Here ✨----
         # Get the tile from x and add 10

    return out

# Test the kernel
x = torch.randn(8192, device="cuda")
test_kernel(add_kernel, add_spec, x)
benchmark_kernel(add_kernel, x)
compare_implementations(add_kernel, add_spec, x)

## Puzzle 2: Outer Vector Add

Add two vectors using an outer product pattern.

In [None]:
def broadcast_add_spec(x: Tensor, y: Tensor) -> Tensor:
    return x[None, :] + y[:, None]

 # ---- ✨ What should the block sizes be? ----
@helion.kernel(config = helion.Config(block_sizes = []))
def broadcast_add_kernel(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    # Get tensor sizes
     # ---- ✨ Your Code Here ✨----

    return out

# Test the kernel
x = torch.randn(1142, device="cuda")
y = torch.randn(512, device="cuda")
test_kernel(broadcast_add_kernel, broadcast_add_spec, x, y)
benchmark_kernel(broadcast_add_kernel, x, y)
compare_implementations(broadcast_add_kernel, broadcast_add_spec, x, y)

## Puzzle 3: Fused Outer Multiplication

Multiply a row vector to a column vector and take a relu.

In [None]:
def mul_relu_block_spec(x: Tensor, y: Tensor) -> Tensor:
    return torch.relu(x[None, :] * y[:, None])


 # ---- ✨ Is this the best block size? ----
@helion.kernel(config = helion.Config(block_sizes = [32, 32]))
def mul_relu_block_kernel(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    return out

# Test the kernel
x = torch.randn(512, device="cuda")
y = torch.randn(512, device="cuda")
test_kernel(mul_relu_block_kernel, mul_relu_block_spec, x, y)
compare_implementations(mul_relu_block_kernel, mul_relu_block_spec, x, y)

## Puzzle 4: Long Sum

Sum of a batch of numbers. TODO Give good example of how this reduction is done and why we need to register blocks

In [None]:
def sum_spec(x: Float32[Tensor, "4 200"]) -> Float32[Tensor, "4"]:
    return x.sum(1)

 # ---- ✨ Your Code Here ✨----
@helion.kernel()
def sum_kernel(x: torch.Tensor) -> torch.Tensor:
    # Get tensor sizes
    # ---- ✨ Your Code Here ✨----

    return out

# Test the kernel
x = torch.randn(4, 200, device="cuda")
test_kernel(sum_kernel, sum_spec, x)

## Autotuning in Helion

One of the major advantages of Helion is its sophisticated autotuning capability. Let's see how we can leverage this for our matrix multiplication kernel:

In [None]:
import torch
import helion
import helion.language as hl
import time

# Define a matrix multiplication kernel
@helion.kernel()  # No config means autotuning will be used
def matmul_autotune(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    m, k = x.size()
    k, n = y.size()
    out = torch.empty([m, n], dtype=x.dtype, device=x.device)

    for tile_m, tile_n in hl.tile([m, n]):
        acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
        for tile_k in hl.tile(k):
            acc = acc + torch.matmul(x[tile_m, tile_k], y[tile_k, tile_n])
        out[tile_m, tile_n] = acc

    return out

# Create larger tensors for better autotuning results
x = torch.randn(1024, 1024, device="cuda")
y = torch.randn(1024, 1024, device="cuda")

# First run will trigger autotuning
print("Running with autotuning (this might take a while)...")
start = time.time()
result = matmul_autotune(x, y)
end = time.time()
print(f"First run time (including autotuning): {end - start:.2f}s")

# Second run will use the tuned configuration
start = time.time()
result = matmul_autotune(x, y)
end = time.time()
print(f"Second run time (using tuned config): {end - start:.2f}s")

# Verify correctness
expected = x @ y
print(f"Result is correct: {torch.allclose(result, expected, rtol=1e-2, atol=1e-2)}")

## Hardcoding Configurations

After autotuning, you might want to hardcode the best configuration:

In [None]:
# Example of hardcoding a configuration after autotuning
@helion.kernel(config=helion.Config(
    block_sizes=[[64, 128], [16]],
    loop_orders=[[1, 0]],
    num_warps=4,
    num_stages=3,
    indexing='block_ptr',
    l2_grouping=32
))
def matmul_fixed_config(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    m, k = x.size()
    k, n = y.size()
    out = torch.empty([m, n], dtype=x.dtype, device=x.device)

    for tile_m, tile_n in hl.tile([m, n]):
        acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
        for tile_k in hl.tile(k):
            acc = acc + torch.matmul(x[tile_m, tile_k], y[tile_k, tile_n])
        out[tile_m, tile_n] = acc

    return out

# Run with fixed configuration (no autotuning)
start = time.time()
result = matmul_fixed_config(x, y)
end = time.time()
print(f"Run time with fixed config: {end - start:.2f}s")

# Verify correctness
expected = x @ y
print(f"Result is correct: {torch.allclose(result, expected, rtol=1e-2, atol=1e-2)}")

## Conclusion

In this notebook, we've explored how to use Helion to write efficient GPU kernels using a high-level, PyTorch-like syntax. The key advantages of Helion include:

1. **Higher-level abstraction** than raw Triton, making it easier to write correct kernels
2. **Automatic tiling and memory management**, eliminating a common source of bugs
3. **Powerful autotuning** that can explore a wide range of implementations automatically
4. **Familiar PyTorch syntax** that builds on existing knowledge

These puzzles should give you a good foundation for writing your own Helion kernels for a variety of applications.