pytorch · youkaichao · Aug 29, 2023 · Aug 29, 2023 · Aug 29, 2023 · Aug 30, 2023
diff --git a/docs/source/_static/img/dynamo/flowchart.jpg b/docs/source/_static/img/dynamo/flowchart.jpg
diff --git a/docs/source/torch.compiler_deepdive.rst b/docs/source/torch.compiler_deepdive.rst
@@ -210,53 +210,129 @@ point for the first time.
 How to inspect artifacts generated by TorchDynamo?
 --------------------------------------------------
 
-To inspect the artifacts generated by TorchDynamo, there is an API `torch._dynamo.eval_frame._debug_get_cache_entry_list` that retrieves compiled code and guards out of a function's `__code__` object. A compiled function can have several cache entries, and each cache entry consists a generated function to check guards, and a `types.CodeType` object to keep the code to be executed if the guarding conditions are satisfied.
+To inspect the artifacts generated by TorchDynamo, there is an API ``torch._dynamo.eval_frame._debug_get_cache_entry_list`` that retrieves compiled code and guards out of a function's ``__code__`` object. A compiled function can have several cache entries, and each cache entry consists a generated function to check guards, and a ``types.CodeType`` object to keep the code to be executed if the guarding conditions are satisfied.
 
 .. code-block:: python
 
    from torch._dynamo.eval_frame import _debug_get_cache_entry_list
-   cache_entries = _debug_get_cache_entry_list(toy_example._torchdynamo_orig_callable.__code__)
-   guard, code = cache_entries[0]
-   # the guard takes an input frame, and tells whether a re-compilation should be triggered.
-   import inspect
-   print(inspect.getfullargspec(guard))
-   # if you know python bytecode, you can understand the following code.
+   cache_entries = _debug_get_cache_entry_list(toy_example._torchdynamo_orig_callable)
+   cache_entry = cache_entries[0]
+   guard, code = cache_entry.check_fn, cache_entry.code
+   # the guard takes the local variables of an input frame, and tells whether a re-compilation should be triggered.
    import dis
    dis.dis(guard)
    dis.dis(code)
 
-The compiled bytecode, printed by `dis.dis(code)`, will call the result of the backend compiler function which is stored inside a global variable such as `__compiled_fn_0` in the module containing the original function.
-
-The generated bytecodes are roughly equivalent to the following Python (converted manually for illustration purposes).
+If you know Python bytecode, you can understand the above output. There is also a tool ``depyf`` to convert the bytecode into human-readable source code. If you don't have ``depyf`` already installed, run ``pip install depyf`` before running the code below.
 
 .. code-block:: python
 
-   def compiled_example(a, b):
-       # behind the scene, pytorch C code checks the guarding condition
-       # if all guard fails, trigger re-compile
-       # else, run the compiled code
-       # after some setup work, the code finally looks like the following
-       x, b_sum_less_than_0 = __compiled_fn_0._torchdynamo_orig_callable(a, b)
-       # the condition test on tensor value leads to graph break here
-       # we use python interpreter to select the branch
-       # depending on the value, the rest graph is either `__resume_at_30_1`
-       # or `__resume_at_38_2`
-       if b_sum_less_than_0:
+   from depyf import decompile
+   print("guard code:")
+   print(decompile(guard))
+   print("compiled code:")
+   print(decompile(code))
+
+The output is:
+
+::
+
+   guard code:
+   def guard(L):
+       if not getattr(___guarded_code, 'valid'):
+           return False
+       _var0 = L['a']
+       if not hasattr(_var0, '_dynamo_dynamic_indices') == False:
+           return False
+       _var1 = L['b']
+       if not hasattr(_var1, '_dynamo_dynamic_indices') == False:
+           return False
+       if not ___is_grad_enabled():
+           return False
+       if ___are_deterministic_algorithms_enabled():
+           return False
+       if not ___is_torch_function_enabled():
+           return False
+       if not getattr(utils_device, 'CURRENT_DEVICE') == None:
+           return False
+       if not ___check_tensors(_var0, _var1, tensor_check_names=tensor_check_names
+           ):
+           return False
+       return True
+
+   compiled code:
+   def toy_example(a, b):
+       __temp_1 = __compiled_fn_0(a, b)
+       x = __temp_1[0]
+       if __temp_1[1]:
            return __resume_at_30_1(b, x)
-        return __resume_at_38_2(b, x)
+       return __resume_at_38_2(b, x)
 
-   def __resume_at_38_2(b, x):
-       return x * b
+Some names referenced in the code are:
+
+- Compiled functions, stored in the global namespace of the module containing the original function ``toy_example``. These include names like ``__compiled_fn_0`` / ``__resume_at_30_1`` / ``__resume_at_38_2``.
+
+- Closure variables used for checking guards. The names can be accessed from ``guard.__code__.co_freevars``, and the values are stored in ``guard.__closure__``. These include names like ``___guarded_code`` / ``___is_grad_enabled`` / ``___are_deterministic_algorithms_enabled`` / ``___is_torch_function_enabled`` / ``utils_device`` / ``___check_tensors`` / ``tensor_check_names``.
+
+- Argument ``L`` of the ``guard`` function. This is a dict mapping the name of arguments of ``toy_example`` to its values. This is only available when the function is called, where the frame evaluation API comes into play. In short, ``L`` is a ``dict`` with structure of ``{'a': value_a, 'b': value_b}``. Therefore, you can see the code uses ``L['a']`` to refer to the input variable ``a``.
+
+The graph break is shown in the code of compiled ``toy_example``, where we have to use Python interpreter to select the following graph to execute.
+
+Note that we pass a simple ``my_compiler`` function as the backend compiler, therefore the subgraph code ``__resume_at_38_2``, ``__resume_at_30_1``, and ``__compiled_fn_0`` remain Python code. This can also be inspected (please ignore the function name, and only use the function signature and function body code):
+
+.. code-block:: python
 
-   def __resume_at_30_1(b, x):
+   print("source code of __compiled_fn_0:")
+   print(__compiled_fn_0._torchdynamo_orig_callable.__self__)
+   print("=" * 60)
+   print("source code of __resume_at_30_1:")
+   print(decompile(__resume_at_30_1))
+   print("=" * 60)
+   print("source code of __resume_at_38_2:")
+   print(decompile(__resume_at_38_2))
+
+::
+
+   source code of __compiled_fn_0:
+   GraphModule()
+
+
+
+   def forward(self, L_a_ : torch.Tensor, L_b_ : torch.Tensor):
+       l_a_ = L_a_
+       l_b_ = L_b_
+       abs_1 = torch.abs(l_a_)
+       add = abs_1 + 1;  abs_1 = None
+       truediv = l_a_ / add;  l_a_ = add = None
+       sum_1 = l_b_.sum();  l_b_ = None
+       lt = sum_1 < 0;  sum_1 = None
+       return (truediv, lt)
+
+   # To see more debug info, please use ``graph_module.print_readable()``
+   ============================================================
+   source code of __resume_at_30_1:
+   def <resume in toy_example>(b, x):
        b = b * -1
        return x * b
 
-   def fn(a, b):
-       x = a / (torch.abs(a) + 1)
-       lt = b.sum() < 0
-       return x, lt
+   ============================================================
+   source code of __resume_at_38_2:
+   def <resume in toy_example>(b, x):
+       return x * b
+
+However, if we use other backends like the built-in ``inductor``, the subgraph code will be compiled CUDA kernels for GPU or C++ code for CPU.
+
+To summarize, the compiled code is conceptually equivalent to the code below:
+
+.. code-block:: python
+
+   def compiled_example(a, b):
+       L = {'a': a, 'b': b}
+       for guard, code in get_cache_entries():
+           if guard(L):
+               return code(a, b)
+       recompile_and_add_another_cache_entry()
 
-   __compiled_fn_0._torchdynamo_orig_callable = fn
+The following diagram demonstrates how ``torch.compile`` transforms and optimizes user-written code: it first extracts computation graphs from the user-written function, and compiles these graphs into optimized functions, then assembles them into a new function, which is functionally equivalent to the user-written code but optimized to have a good computation speed.
 
-Note that we pass a simple `my_compiler` function as the backend compiler, therefore the subgraph code `__resume_at_38_2`, `__resume_at_30_1`, and `__compiled_fn_0._torchdynamo_orig_callable` remain python code. However, if we use other backends like the built-in `inductor`, the subgraph code will be compiled CUDA kernels for GPU or C++ code for CPU.
+.. image:: _static/img/dynamo/flowchart.jpg