Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/source/_static/img/dynamo/flowchart.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
138 changes: 107 additions & 31 deletions docs/source/torch.compiler_deepdive.rst
Original file line number Diff line number Diff line change
Expand Up @@ -210,53 +210,129 @@ point for the first time.
How to inspect artifacts generated by TorchDynamo?
--------------------------------------------------

To inspect the artifacts generated by TorchDynamo, there is an API `torch._dynamo.eval_frame._debug_get_cache_entry_list` that retrieves compiled code and guards out of a function's `__code__` object. A compiled function can have several cache entries, and each cache entry consists a generated function to check guards, and a `types.CodeType` object to keep the code to be executed if the guarding conditions are satisfied.
To inspect the artifacts generated by TorchDynamo, there is an API ``torch._dynamo.eval_frame._debug_get_cache_entry_list`` that retrieves compiled code and guards out of a function's ``__code__`` object. A compiled function can have several cache entries, and each cache entry consists a generated function to check guards, and a ``types.CodeType`` object to keep the code to be executed if the guarding conditions are satisfied.

.. code-block:: python

from torch._dynamo.eval_frame import _debug_get_cache_entry_list
cache_entries = _debug_get_cache_entry_list(toy_example._torchdynamo_orig_callable.__code__)
guard, code = cache_entries[0]
# the guard takes an input frame, and tells whether a re-compilation should be triggered.
import inspect
print(inspect.getfullargspec(guard))
# if you know python bytecode, you can understand the following code.
cache_entries = _debug_get_cache_entry_list(toy_example._torchdynamo_orig_callable)
cache_entry = cache_entries[0]
guard, code = cache_entry.check_fn, cache_entry.code
# the guard takes the local variables of an input frame, and tells whether a re-compilation should be triggered.
import dis
dis.dis(guard)
dis.dis(code)

The compiled bytecode, printed by `dis.dis(code)`, will call the result of the backend compiler function which is stored inside a global variable such as `__compiled_fn_0` in the module containing the original function.

The generated bytecodes are roughly equivalent to the following Python (converted manually for illustration purposes).
If you know Python bytecode, you can understand the above output. There is also a tool ``depyf`` to convert the bytecode into human-readable source code. If you don't have ``depyf`` already installed, run ``pip install depyf`` before running the code below.

.. code-block:: python

def compiled_example(a, b):
# behind the scene, pytorch C code checks the guarding condition
# if all guard fails, trigger re-compile
# else, run the compiled code
# after some setup work, the code finally looks like the following
x, b_sum_less_than_0 = __compiled_fn_0._torchdynamo_orig_callable(a, b)
# the condition test on tensor value leads to graph break here
# we use python interpreter to select the branch
# depending on the value, the rest graph is either `__resume_at_30_1`
# or `__resume_at_38_2`
if b_sum_less_than_0:
from depyf import decompile
print("guard code:")
print(decompile(guard))
print("compiled code:")
print(decompile(code))

The output is:

::

guard code:
def guard(L):
if not getattr(___guarded_code, 'valid'):
return False
_var0 = L['a']
if not hasattr(_var0, '_dynamo_dynamic_indices') == False:
return False
_var1 = L['b']
if not hasattr(_var1, '_dynamo_dynamic_indices') == False:
return False
if not ___is_grad_enabled():
return False
if ___are_deterministic_algorithms_enabled():
return False
if not ___is_torch_function_enabled():
return False
if not getattr(utils_device, 'CURRENT_DEVICE') == None:
return False
if not ___check_tensors(_var0, _var1, tensor_check_names=tensor_check_names
):
return False
return True

compiled code:
def toy_example(a, b):
__temp_1 = __compiled_fn_0(a, b)
x = __temp_1[0]
if __temp_1[1]:
return __resume_at_30_1(b, x)
return __resume_at_38_2(b, x)
return __resume_at_38_2(b, x)

def __resume_at_38_2(b, x):
return x * b
Some names referenced in the code are:

- Compiled functions, stored in the global namespace of the module containing the original function ``toy_example``. These include names like ``__compiled_fn_0`` / ``__resume_at_30_1`` / ``__resume_at_38_2``.

- Closure variables used for checking guards. The names can be accessed from ``guard.__code__.co_freevars``, and the values are stored in ``guard.__closure__``. These include names like ``___guarded_code`` / ``___is_grad_enabled`` / ``___are_deterministic_algorithms_enabled`` / ``___is_torch_function_enabled`` / ``utils_device`` / ``___check_tensors`` / ``tensor_check_names``.

- Argument ``L`` of the ``guard`` function. This is a dict mapping the name of arguments of ``toy_example`` to its values. This is only available when the function is called, where the frame evaluation API comes into play. In short, ``L`` is a ``dict`` with structure of ``{'a': value_a, 'b': value_b}``. Therefore, you can see the code uses ``L['a']`` to refer to the input variable ``a``.

The graph break is shown in the code of compiled ``toy_example``, where we have to use Python interpreter to select the following graph to execute.

Note that we pass a simple ``my_compiler`` function as the backend compiler, therefore the subgraph code ``__resume_at_38_2``, ``__resume_at_30_1``, and ``__compiled_fn_0`` remain Python code. This can also be inspected (please ignore the function name, and only use the function signature and function body code):

.. code-block:: python

def __resume_at_30_1(b, x):
print("source code of __compiled_fn_0:")
print(__compiled_fn_0._torchdynamo_orig_callable.__self__)
print("=" * 60)
print("source code of __resume_at_30_1:")
print(decompile(__resume_at_30_1))
print("=" * 60)
print("source code of __resume_at_38_2:")
print(decompile(__resume_at_38_2))

::

source code of __compiled_fn_0:
GraphModule()



def forward(self, L_a_ : torch.Tensor, L_b_ : torch.Tensor):
l_a_ = L_a_
l_b_ = L_b_
abs_1 = torch.abs(l_a_)
add = abs_1 + 1; abs_1 = None
truediv = l_a_ / add; l_a_ = add = None
sum_1 = l_b_.sum(); l_b_ = None
lt = sum_1 < 0; sum_1 = None
return (truediv, lt)

# To see more debug info, please use ``graph_module.print_readable()``
============================================================
source code of __resume_at_30_1:
def <resume in toy_example>(b, x):
b = b * -1
return x * b

def fn(a, b):
x = a / (torch.abs(a) + 1)
lt = b.sum() < 0
return x, lt
============================================================
source code of __resume_at_38_2:
def <resume in toy_example>(b, x):
return x * b

However, if we use other backends like the built-in ``inductor``, the subgraph code will be compiled CUDA kernels for GPU or C++ code for CPU.

To summarize, the compiled code is conceptually equivalent to the code below:

.. code-block:: python

def compiled_example(a, b):
L = {'a': a, 'b': b}
for guard, code in get_cache_entries():
if guard(L):
return code(a, b)
recompile_and_add_another_cache_entry()

__compiled_fn_0._torchdynamo_orig_callable = fn
The following diagram demonstrates how ``torch.compile`` transforms and optimizes user-written code: it first extracts computation graphs from the user-written function, and compiles these graphs into optimized functions, then assembles them into a new function, which is functionally equivalent to the user-written code but optimized to have a good computation speed.

Note that we pass a simple `my_compiler` function as the backend compiler, therefore the subgraph code `__resume_at_38_2`, `__resume_at_30_1`, and `__compiled_fn_0._torchdynamo_orig_callable` remain python code. However, if we use other backends like the built-in `inductor`, the subgraph code will be compiled CUDA kernels for GPU or C++ code for CPU.
.. image:: _static/img/dynamo/flowchart.jpg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a paragraph that describes what is going on on the diagram?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, fixed in 454676e.