`torch.compile` is introduced in PyTorch 2.0 and is intended to replace TorchScript (`torch.jit`).

## Overall
A torch compiled Python code will go through two stages: TorchDynamo + Inductor:

1. TorchDynamo
Parse Python code and get Python byte code, then generate a "FX Graph".

2. TorchInductor
Convert FX Graph into efficient code with potential Operator Fusion.

The higher-level FX Graph operators (e.g., aten.add) will be converted into a loop-level IR in which operation fusion will be performed and are merged into one loop (e.g., Add + ReLU).

Then, it decides how to introduce hardware-dependent code into these loops (e.g., adding Tile size and `tl.program_id`).

Finally, in codegen, these loops are processed with a Triton template engine to generate Python `@triton.jit` code (saved at `/tmp/torchinductor_xxx`).
TorchInductor utilize Triton as a backend to further optimize the generated code.

## Example Dissect
Create a toy example `torch_compile_builtin_fusion.py`:

In [20]:
import torch

a = torch.rand((100, 100), device='cuda')
b = torch.rand((100, 100), device='cuda')

def fn(x, y):
    z = torch.matmul(x, y)
    return torch.nn.functional.softmax(z, dim=1)

compiled_fn = torch.compile(fn)
print(compiled_fn(a, b))

tensor([[0.0042, 0.0006, 0.0176,  ..., 0.0003, 0.0017, 0.0257],
        [0.0402, 0.0009, 0.0153,  ..., 0.0012, 0.0012, 0.0075],
        [0.0427, 0.0011, 0.0052,  ..., 0.0008, 0.0016, 0.0079],
        ...,
        [0.0029, 0.0002, 0.0040,  ..., 0.0004, 0.0002, 0.0206],
        [0.0013, 0.0015, 0.0185,  ..., 0.0024, 0.0001, 0.0043],
        [0.0134, 0.0003, 0.0135,  ..., 0.0007, 0.0002, 0.0140]],
       device='cuda:0')


We can print the two-stage outcome: `graph_code` (FX Graph code presentation), and `output_code` are the output Triton code by Inductor.

In [21]:
!TORCH_LOGS="graph_code,output_code" python torch_compile_builtin_fusion.py

V0113 15:02:14.896000 541798 site-packages/torch/_dynamo/output_graph.py:1667] [0/0] [__graph_code] TRACED GRAPH
V0113 15:02:14.896000 541798 site-packages/torch/_dynamo/output_graph.py:1667] [0/0] [__graph_code]  ===== __compiled_fn_1_764aecdc_de0b_44f3_b87f_7dc1542901a0 =====
V0113 15:02:14.896000 541798 site-packages/torch/_dynamo/output_graph.py:1667] [0/0] [__graph_code]  /home/tk/Desktop/jupyter/simp-intelligence/.pixi/envs/default/lib/python3.13/site-packages/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
V0113 15:02:14.896000 541798 site-packages/torch/_dynamo/output_graph.py:1667] [0/0] [__graph_code]     def forward(self, L_y_: "[31mf32[0m[34m[100, 100][0m[2m[34m[100, 1][0m[2m[32mcuda:0[0m", L_x_: "[31mf32[0m[34m[100, 100][0m[2m[34m[100, 1][0m[2m[32mcuda:0[0m"):
V0113 15:02:14.896000 541798 site-packages/torch/_dynamo/output_graph.py:1667] [0/0] [__graph_code]         l_y_ = L_y_
V0113 15:02:14.896000 541798 site-packages/torch/_dynamo/

As shown above, the generated Triton code is using a fused function `triton_per_fused__softmax_0`.

However, the above example calls built-in torch functions without too much room for `torch.compile` optimization (and many stages are not applied at all).
Let's have a more customized toy code `torch_compile_custom_toy.py` so that we can also look more deeper into Inductor stages.

In [26]:
import torch

@torch.compile
def toy_example(a, b):
    x = a / (torch.abs(a) + 1)
    if b.sum() < 0:
        b = b * -1
    return x * b

for _ in range(10):
    res = toy_example(torch.randn(10), torch.randn(10))
    print(res)

tensor([ 0.1044, -0.0396, -0.1876, -0.8878,  0.3451,  0.6469, -0.2108, -0.0597,
         0.9823, -0.4830])
tensor([ 0.0235,  0.3957, -0.9331,  0.1595,  0.0335, -0.0365, -0.0452, -0.1091,
         0.3090, -0.0818])
tensor([ 6.4036e-04, -3.9523e-02,  5.7321e-01,  5.8537e-01, -9.4209e-02,
         4.9878e-01,  1.1121e+00, -1.8547e-01,  6.8314e-01,  1.6813e-02])
tensor([ 0.1461,  0.2607, -0.2339, -0.1078, -1.2529, -0.1974, -0.9010, -0.1161,
         0.2254,  0.0440])
tensor([ 0.1684, -0.7341, -0.2091,  0.0287,  0.0245, -0.1058, -0.3187, -0.4545,
        -1.2139,  0.4952])
tensor([ 6.6728e-02, -9.9504e-01, -7.9639e-01, -7.3802e-03,  7.2258e-01,
        -4.4539e-01, -7.6488e-01,  2.3857e-04, -2.2679e-01, -1.9297e-01])
tensor([ 0.2817,  0.7707, -0.3862, -0.1633, -0.7569,  0.0060, -0.0264,  0.2699,
        -0.0318, -0.3182])
tensor([-0.5252, -0.1214, -1.1576,  0.6829, -0.0219,  0.1326, -0.0383,  0.0491,
         0.8157,  0.0070])
tensor([ 0.5871, -0.2340, -0.5601, -0.7857, -0.5513, -0.1814, -0

The `TORCHINDUCTOR_FORCE_DISABLE_CACHES` flag forces Pytorch to recompile each time, and setting the `TORCH_COMPILE_DEBUG` will generate a more comprehensive set of intermediate output files under `./torch_compile_debug`:

In [24]:
!rm -rf ./torch_compile_debug
!TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 python torch_compile_custom_toy.py
!ls ./torch_compile_debug/run_*/torchinductor

  warn_once(
W0113 15:14:27.756000 544086 site-packages/torch/_inductor/debug.py:449] [0/0] model__0_inference_0 debug trace: /home/tk/Desktop/jupyter/simp-intelligence/simp_intelligence/torch/torch_compile_debug/run_2026_01_13_15_14_22_348694-pid_544086/torchinductor/model__0_inference_0.0
W0113 15:14:28.478000 544086 site-packages/torch/_inductor/debug.py:449] [1/0] model__1_inference_1 debug trace: /home/tk/Desktop/jupyter/simp-intelligence/simp_intelligence/torch/torch_compile_debug/run_2026_01_13_15_14_22_348694-pid_544086/torchinductor/model__1_inference_1.1
tensor([-0.5091,  0.1598, -0.3580,  0.1102, -0.4926,  0.3581,  0.0663, -0.0721,
        -0.1218, -0.2372])
W0113 15:14:29.068000 544086 site-packages/torch/_inductor/debug.py:449] [2/0] model__2_inference_2 debug trace: /home/tk/Desktop/jupyter/simp-intelligence/simp_intelligence/torch/torch_compile_debug/run_2026_01_13_15_14_22_348694-pid_544086/torchinductor/model__2_inference_2.2
tensor([-0.0385,  0.0960, -0.0088, -1.0253,

We can also pass in more specific flags to output the code before and after fusion taking palce:

In [25]:
!TORCH_LOGS="ir_pre_fusion,ir_post_fusion" TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 python torch_compile_custom_toy.py

  warn_once(
I0113 15:17:07.579000 545412 site-packages/torch/_inductor/debug.py:672] [0/0] [__ir_pre_fusion] BEFORE FUSION
I0113 15:17:07.579000 545412 site-packages/torch/_inductor/debug.py:672] [0/0] [__ir_pre_fusion] op0: SchedulerNode(ComputedBuffer)
I0113 15:17:07.579000 545412 site-packages/torch/_inductor/debug.py:672] [0/0] [__ir_pre_fusion] op0.writes = [MemoryDep('buf0', 0, {})]
I0113 15:17:07.579000 545412 site-packages/torch/_inductor/debug.py:672] [0/0] [__ir_pre_fusion] op0.unmet_dependencies = []
I0113 15:17:07.579000 545412 site-packages/torch/_inductor/debug.py:672] [0/0] [__ir_pre_fusion] op0.met_dependencies = [MemoryDep('arg1_1', c0, {c0: 10})]
I0113 15:17:07.579000 545412 site-packages/torch/_inductor/debug.py:672] [0/0] [__ir_pre_fusion] op0.outputs = [
I0113 15:17:07.579000 545412 site-packages/torch/_inductor/debug.py:672] [0/0] [__ir_pre_fusion]     buf0: ComputedBuffer
I0113 15:17:07.579000 545412 site-packages/torch/_inductor/debug.py:672] [0/0] [__ir_pre_fu

All the `model__X_inference_X.X` and `[X/0] [__ir_post_fusion]` names ($X=0,1,2$) represent the *basic blocks* of our toy code control flow.