`torch.compile` is introduced in PyTorch 2.0 and is intended to replace TorchScript (`torch.jit`).

## Overall
A torch compiled Python code will go through two stages: TorchDynamo + Inductor:

1. TorchDynamo
Parse Python code and get Python byte code, then generate a "FX Graph".

2. TorchInductor
Convert FX Graph into efficient code with potential Operator Fusion.

The higher-level FX Graph operators (e.g., aten.add) will be converted into a loop-level IR in which operation fusion will be performed and are merged into one loop (e.g., Add + ReLU).

Then, it decides how to introduce hardware-dependent code into these loops (e.g., adding Tile size and `tl.program_id`).

Finally, in codegen, these loops are processed with a Triton template engine to generate Python `@triton.jit` code (saved at `/tmp/torchinductor_xxx`).
TorchInductor utilize Triton as a backend to further optimize the generated code.

## Example Dissect
Create a toy example `torch_compile_builtin_fusion.py`:

In [35]:
import torch

a = torch.rand((100, 100), device='cuda')
b = torch.rand((100, 100), device='cuda')

def fn(x, y):
    z = torch.matmul(x, y)
    return torch.nn.functional.softmax(z, dim=1)

compiled_fn = torch.compile(fn)
print(compiled_fn(a, b))

tensor([[0.0060, 0.0038, 0.0091,  ..., 0.0089, 0.0004, 0.0031],
        [0.0034, 0.0020, 0.0045,  ..., 0.0048, 0.0013, 0.0002],
        [0.0031, 0.0067, 0.0035,  ..., 0.0387, 0.0001, 0.0006],
        ...,
        [0.0035, 0.0028, 0.0057,  ..., 0.0206, 0.0006, 0.0003],
        [0.0050, 0.0026, 0.0029,  ..., 0.0086, 0.0005, 0.0034],
        [0.0013, 0.0067, 0.0165,  ..., 0.0294, 0.0002, 0.0005]],
       device='cuda:0')


We can print the two-stage outcome: `graph_code` (FX Graph code presentation), and `output_code` are the output Triton code by Inductor.

In [36]:
!TORCH_LOGS="graph_code,output_code" python torch_compile_builtin_fusion.py

V0113 15:50:31.765000 550912 site-packages/torch/_dynamo/output_graph.py:1667] [0/0] [__graph_code] TRACED GRAPH
V0113 15:50:31.765000 550912 site-packages/torch/_dynamo/output_graph.py:1667] [0/0] [__graph_code]  ===== __compiled_fn_1_ce4531cc_6d02_43c8_8060_d25081027612 =====
V0113 15:50:31.765000 550912 site-packages/torch/_dynamo/output_graph.py:1667] [0/0] [__graph_code]  /home/tk/Desktop/jupyter/simp-intelligence/.pixi/envs/default/lib/python3.13/site-packages/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
V0113 15:50:31.765000 550912 site-packages/torch/_dynamo/output_graph.py:1667] [0/0] [__graph_code]     def forward(self, L_y_: "[31mf32[0m[34m[100, 100][0m[2m[34m[100, 1][0m[2m[32mcuda:0[0m", L_x_: "[31mf32[0m[34m[100, 100][0m[2m[34m[100, 1][0m[2m[32mcuda:0[0m"):
V0113 15:50:31.765000 550912 site-packages/torch/_dynamo/output_graph.py:1667] [0/0] [__graph_code]         l_y_ = L_y_
V0113 15:50:31.765000 550912 site-packages/torch/_dynamo/

As shown above, the generated Triton code is using a fused function `triton_per_fused__softmax_0`.

We can also pass in more specific flags to output the intermediate steps before and after fusion taking palce:

In [37]:
!TORCH_LOGS="ir_pre_fusion,ir_post_fusion" TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 python torch_compile_builtin_fusion.py

  warn_once(
W0113 15:50:38.425000 550945 site-packages/torch/_inductor/utils.py:1436] [0/0] Not enough SMs to use max_autotune_gemm mode
I0113 15:50:39.608000 550945 site-packages/torch/_inductor/debug.py:672] [0/0] [__ir_pre_fusion] BEFORE FUSION
I0113 15:50:39.608000 550945 site-packages/torch/_inductor/debug.py:672] [0/0] [__ir_pre_fusion] op0: ExternKernelSchedulerNode(ExternKernelOut)
I0113 15:50:39.608000 550945 site-packages/torch/_inductor/debug.py:672] [0/0] [__ir_pre_fusion] op0.writes = [StarDep(name='buf0', mode=None)]
I0113 15:50:39.608000 550945 site-packages/torch/_inductor/debug.py:672] [0/0] [__ir_pre_fusion] op0.unmet_dependencies = []
I0113 15:50:39.608000 550945 site-packages/torch/_inductor/debug.py:672] [0/0] [__ir_pre_fusion] op0.met_dependencies = [StarDep(name='arg0_1', mode=None), StarDep(name='arg1_1', mode=None)]
I0113 15:50:39.608000 550945 site-packages/torch/_inductor/debug.py:672] [0/0] [__ir_pre_fusion] op0.outputs = [
I0113 15:50:39.608000 550945 site

Let's have a more customized toy code `torch_compile_custom_toy.py` with branches:

In [38]:
import torch

@torch.compile
def toy_example(a, b):
    x = a / (torch.abs(a) + 1)
    if b.sum() < 0:
        b = b * -1
    return x * b

for _ in range(10):
    res = toy_example(torch.randn(10), torch.randn(10))
    print(res)

tensor([ 1.1075, -0.3664,  0.2113, -0.5729,  0.0149,  0.3904, -0.3129,  1.0964,
         0.1867, -0.4614])
tensor([ 0.0096, -0.2170,  0.3665, -0.0233,  0.2757,  0.0350,  0.0531,  0.3920,
         0.0509,  0.6125])
tensor([ 0.3135, -0.4333, -0.2436, -0.5342, -0.0245, -0.4712,  0.0111,  0.1872,
        -0.1006, -0.1696])
tensor([-0.3448, -0.0287, -1.2579, -0.4756, -0.3052, -0.1545, -1.0011, -0.0348,
        -0.2332, -0.2360])
tensor([-0.3706, -0.1271, -0.1108, -0.4871,  0.1010, -0.3885,  0.1082,  0.0667,
         0.0929,  0.3780])
tensor([-0.5566,  0.0852,  0.6850, -0.1277,  0.1957, -0.1823, -0.5636, -0.2572,
        -0.4672,  0.0193])
tensor([ 0.0365,  0.0038,  0.2167,  0.2694, -0.8846,  0.0393, -0.0992, -0.1545,
         0.1965,  0.2731])
tensor([ 0.1262,  0.6381, -0.5660, -0.0135, -0.0962,  0.1142,  0.0935, -0.0147,
        -0.1915,  0.7921])
tensor([ 0.4005,  0.8971, -0.2554, -0.0670,  0.6740, -1.2135, -0.3693, -2.1402,
         0.0676,  0.1542])
tensor([ 0.1144,  0.0662,  0.8342,  0

Now, use the `TORCHINDUCTOR_FORCE_DISABLE_CACHES` flag forces Pytorch to recompile each time, and setting the `TORCH_COMPILE_DEBUG` will generate a more comprehensive set of intermediate output files under `./torch_compile_debug`:

In [39]:
!rm -rf ./torch_compile_debug
!TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 python torch_compile_custom_toy.py
!ls ./torch_compile_debug/run_*/torchinductor

  warn_once(
W0113 15:50:50.544000 551025 site-packages/torch/_inductor/debug.py:449] [0/0] model__0_inference_0 debug trace: /home/tk/Desktop/jupyter/simp-intelligence/simp_intelligence/torch/torch_compile_debug/run_2026_01_13_15_50_45_140088-pid_551025/torchinductor/model__0_inference_0.0
W0113 15:50:51.256000 551025 site-packages/torch/_inductor/debug.py:449] [1/0] model__1_inference_1 debug trace: /home/tk/Desktop/jupyter/simp-intelligence/simp_intelligence/torch/torch_compile_debug/run_2026_01_13_15_50_45_140088-pid_551025/torchinductor/model__1_inference_1.1
tensor([ 0.0489,  0.1555,  0.0107, -0.2745,  0.0188,  0.5434,  0.0668,  0.0561,
         0.4224, -0.2818])
W0113 15:50:51.845000 551025 site-packages/torch/_inductor/debug.py:449] [2/0] model__2_inference_2 debug trace: /home/tk/Desktop/jupyter/simp-intelligence/simp_intelligence/torch/torch_compile_debug/run_2026_01_13_15_50_45_140088-pid_551025/torchinductor/model__2_inference_2.2
tensor([-0.0043, -0.1765, -0.1555, -0.1065,

All the `model__X_inference_X.X` and `[X/0] [__ir_post_fusion]` names have $X=0,1,2$ representing the *basic blocks* of our toy code control flow.
To lookup their original code positions FX graph code, take a look at, e.g., `model__2_inference_2.2/fx_graph_readable.py`.

## CUDA Graph Support
One can also generate CUDA graph using torch.compile "reduce-overhead" mode:
```python
optimized_model = torch.compile(model, mode="reduce-overhead")
```
However, PyTorch only reduces overhead for CUDA-only graphs which do not mutate inputs. It seems like PyTorch's support for this is limited [1].

## Custom Operators
In PyTorch (2.4 or later), `torch.compile` supports *opaque callable* [2] custom operator (such as C++ kernels or Modular Mojo kernels), however, `torch.compile` is unable to trace into custom operators.

Note that the “graph-breaker” must be wrapped around the PyTorch custom operator. If it mutates any input Tensors, their names must be specified.
And if the operator returns anything, it must be registered as a “FakeTensor kernel” (aka “meta kernel”) to the custom operator.

## Reference
[1] https://docs.pytorch.org/docs/stable/generated/torch.compile.html

[2] https://docs.pytorch.org/tutorials/advanced/python_custom_ops.html