<a href="https://colab.research.google.com/github/rdspring1/Autopilot-TensorFlow/blob/master/(Draft)_Acquiring_Deep_Learning_Programs_TorchDynamo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TorchDynamo

TorchDynamo uses CPython's frame evaluation API (from [PEP 523](https://peps.python.org/pep-0523/)) to trace the execution of a Python program. This is distinct from TorchScript Scripting, which reads the Python program using the Python "Abstract Syntax Tree (AST)", and from TorchScript Tracing, which records PyTorch operations as they're performed. 

In this notebook we'll look at some simple examples to begin understanding the capabilities and limitations of TorchDynamo. For more information on TorchDynamo, see there posts on PyTorch Dev Discussions, like [this one](https://dev-discuss.pytorch.org/t/torchdynamo-update-8-torchdynamo-passed-correctness-check-on-7k-github-models/663).

### Getting Started with TorchDynamo in Colab

TorchDynamo is still experimental, and it's designed to work with the nightly version of PyTorch, so we'll start by configuring our Colab environment. This should take a few minutes, and it will build TorchDynamo from source.

In [None]:
# Uninstalls Colab's default PyTorch and install PyTorch nightly
!pip3 uninstall --yes torch
!pip3 install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cpu

Found existing installation: torch 1.12.1+cu113
Uninstalling torch-1.12.1+cu113:
  Successfully uninstalled torch-1.12.1+cu113
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://download.pytorch.org/whl/nightly/cpu
Collecting torch
  Downloading https://download.pytorch.org/whl/nightly/cpu/torch-1.13.0.dev20221006%2Bcpu-cp37-cp37m-linux_x86_64.whl (198.6 MB)
[K     |████████████████████████████████| 198.6 MB 56 kB/s 
Installing collected packages: torch
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.13.1+cu113 requires torch==1.12.1, but you have torch 1.13.0.dev20221006+cpu which is incompatible.
torchtext 0.13.1 requires torch==1.12.1, but you have torch 1.13.0.dev20221006+cpu which is incompatible.
torchaudio 0.12.1+cu113 requires torch==1.12.1, but you have torch 1.13.0.dev2022100

In [None]:
# Verifies we have the right version
import torch
print(torch.__version__)

In [None]:
# Clones the TorchDynamo rep from source
!git clone https://github.com/pytorch/torchdynamo.git

Cloning into 'torchdynamo'...
remote: Enumerating objects: 16091, done.[K
remote: Counting objects: 100% (2944/2944), done.[K
remote: Compressing objects: 100% (401/401), done.[K
remote: Total 16091 (delta 2695), reused 2706 (delta 2536), pack-reused 13147[K
Receiving objects: 100% (16091/16091), 6.25 MiB | 29.88 MiB/s, done.
Resolving deltas: 100% (12510/12510), done.


In [None]:
%cd torchdynamo

/content/torchdynamo


In [None]:
!pip3 install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting black==22.8.0
  Downloading black-22.8.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 27.5 MB/s 
[?25hCollecting flake8==5.0.4
  Downloading flake8-5.0.4-py2.py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 499 kB/s 
[?25hCollecting isort==5.10.1
  Downloading isort-5.10.1-py3-none-any.whl (103 kB)
[K     |████████████████████████████████| 103 kB 79.0 MB/s 
[?25hCollecting mypy==0.960
  Downloading mypy-0.960-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 66.6 MB/s 
[?25hCollecting click>=8.1
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
[K     |████████████████████████████████| 96 kB 7.0 MB/s 
[?25hCollecting expecttest
  Downloading expecttest-0.1.3-py3-none-a

In [None]:
!python setup.py develop

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
running develop
running egg_info
creating torchdynamo.egg-info
writing torchdynamo.egg-info/PKG-INFO
writing dependency_links to torchdynamo.egg-info/dependency_links.txt
writing requirements to torchdynamo.egg-info/requires.txt
writing top-level names to torchdynamo.egg-info/top_level.txt
writing manifest file 'torchdynamo.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'torchdynamo.egg-info/SOURCES.txt'
running build_ext
building 'torchdynamo._eval_frame' extension
creating build
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/torchdynamo
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.7m -c torchdynamo/_eval_frame.c -o build/temp.lin

In [None]:
import torchdynamo

### Introductory Example
We'll start with a very simple example to see how to invoke TorchDynamo and what it produces.

In [None]:
from typing import List

# Clears any previously registered optimizer
# NOTE: this is useful if you want to experiment with tweaking the 
#   dynamo_tabular_printer function below
torchdynamo.reset()

# A callback to review the FX graphs that TorchDynamo generates
def dynamo_printer(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
    # gm.graph.print_tabular()
    print(gm.code)
    return gm.forward

In [None]:
@torchdynamo.optimize(dynamo_printer)
def foo_simple(a, b):
  return a + b

a = torch.ones(4)
b = torch.arange(4)

result = foo_simple(a, b)




def forward(self, a : torch.Tensor, b : torch.Tensor):
    add = a + b;  a = b = None
    return (add,)
    


There's a lot going on here, so let's break it down.

Let's start by considering the `foo_simple` function, which just adds two tensors. In the cell above those two tensors are `a` and `b`. `foo_simple` is decorated with `@torchdynamo.optimize`, and when `foo_simple` is called the Python equivalent of the traced operations is printed.

TorchDynamo observes Python frames to create one or more FX `GraphModules`. The function passed to `@torchdynamo.optimize` is then given these `GraphModules` (one at a time, along with the inputs used to generate them). It can do whatever it likes to each graph, and it must return a callable that TorchDynamo will execute instead of the original operation. The intention being that the callable returned might perform the same computation faster than the origial.

The `dynamo_printer` function doesn't actually do any optimization, however. It just prints the `GraphModules`'s Python and returns its forward function without modification. FX's definition of `forward` is a little different than our `foo_simple`, but it clearly captures the addition of `a` and `b`.

To learn more about FX, see the [FX documentation](https://pytorch.org/docs/stable/fx.html). FX's `GraphModules` are easy to read and transform, and it's convenient that TorchDynamo produces them.

### TorchDynamo Traces Python

TorchDynamo, like TorchScript Tracing, is a tracer. It watches Python frames go by, so, like other tracers, it produces "traces." These "traces" are sequences of operations without control flow, and they represent one "path" through a function. Some functions, like `foo_simple` above, only have one path through them, and so TorchDynamo observes the entire program as it's run. We can see that TorchDynamo is tracing by looking at how it handles control flow, like if/else statements and loops.

In [None]:
@torchdynamo.optimize(dynamo_printer)
def foo_branching(a, b):
  if a.dtype is torch.float32:
    return a + b
  return a - b

result = foo_branching(a, b)




def forward(self, a : torch.Tensor, b : torch.Tensor):
    add = a + b;  a = b = None
    return (add,)
    


When `a` is a float32 tensor it's added with `b`, and the trace that TorchDynamo creates from our samples `a` and `b` only shows that addition. Passing `a` as a float64 tensor reveals the other path.

In [None]:
@torchdynamo.optimize(dynamo_printer)
def foo_branching(a, b):
  if a.dtype is torch.float32:
    return a + b
  return a - b

result = foo_branching(a.double(), b)




def forward(self, a : torch.Tensor, b : torch.Tensor):
    sub = a - b;  a = b = None
    return (sub,)
    


Since TorchDynamo is tracing it also "unrolls" loops. 

In [None]:
@torchdynamo.optimize(dynamo_printer)
def foo_loop(a):
  b = a
  for _ in range(3):
    b = b + a

  return b

result = foo_loop(a)




def forward(self, a : torch.Tensor):
    add = a + a
    add_1 = add + a;  add = None
    add_2 = add_1 + a;  add_1 = a = None
    return (add_2,)
    


TorchDynamo is actually capable of observing the loop since it's looking at Python frames, but the FX `GraphModules` it produces can only represent traces, which don't include control flow. There's a representational trade-off with this approach, as traces may be easier to transform and execute than graphs, which may contain control flow.

### Handling Foreign Functions and Multiple Graphs

So far we've seen TorchDynamo capture a single path through a function into a single FX `GraphModule`, but this is not always possible. Functions may include "foreign" functions that aren't PyTorch operations, and we don't want these appearing in the FX `GraphModule`. TorchDynamo deals with foreign functions by separating them from the `GraphModules` describing a function. Let's look at some examples.

In [None]:
@torchdynamo.optimize(dynamo_printer)
def foo_foreign(a, b):
  a = a + 2
  b = torch.from_numpy(b)
  return a + b

# Suppresses some SymPy warnings not relevant to what we're doing
result = foo_foreign(a, b.numpy())

    b = torch.from_numpy(b)




def forward(self, a : torch.Tensor):
    add = a + 2;  a = None
    return (add,)
    



def forward(self, _stack0 : torch.Tensor, a : torch.Tensor):
    add = a + _stack0;  a = _stack0 = None
    return (add,)
    


`foo_foreign` expects a PyTorch tensor and a NumPy array. When traced, TorchDynamo warns us that our use of `torch.from_numpy` is causing a "graph break," and instead of one trace we get two. The first performs the `a + 2` addition, and the second performs the `a + b` addition. The middle of the function appears to be missing. We can use `torchdynamo.explain` for more information.

In [None]:
explanation, out_guards, graphs, ops_per_graph, break_reasons = torchdynamo.explain(foo_foreign, a, b.numpy())

In [None]:
print(explanation)

Dynamo produced 2 graphswith 1 graph break and 2 ops
 Break reasons: 

1. call_function args: NumpyVariable() 
  File "<ipython-input-66-bf3e0f894115>", line 4, in foo_foreign
    b = torch.from_numpy(b)
 
TorchDynamo compilation metrics:
Function                                             Runtimes (s)
---------------------------------------------------  --------------
convert_frame_assert.<locals>._convert_frame_assert  0.0063, 0.0041


It seems like TorchDynamo is unhappy with about our using a NumPy array. Even though the function's middle appears to be missing, running it with TorchDynamo produces the expected result:

In [None]:
foo_foreign(a, b.numpy())

    b = torch.from_numpy(b)




def forward(self, a : torch.Tensor):
    add = a + 2;  a = None
    return (add,)
    



def forward(self, _stack0 : torch.Tensor, a : torch.Tensor):
    add = a + _stack0;  a = _stack0 = None
    return (add,)
    


tensor([3., 4., 5., 6.])

Behind the scenes, TorchDynamo orchestrates running each "optimized" callable as well as the regions for which it refuses to produce `GraphModules`. The latter regions are just run by the Python interpreter. 

This is an important feature of TorchDynamo. Other tracing systems, like TorchScript Tracing, are "blind" to many operations. Because TorchDynamo looks at Python frames, however, it's capable of observing everything the interpreter is doing. This lets it observe regions of a function that may not be representable in an FX `GraphModule`, but TorchDynamo can still record and execute these regions using the Python interpreter later.

Sometimes, however, TorchDynamo refuses to handle some Python. We can see this by extending our `foo_foreign` slightly.

In [None]:
import numpy as np

@torchdynamo.optimize(dynamo_printer)
def foo_foreign2(a, b):
  a = a + 2
  b = np.add(b, 1)
  b = torch.from_numpy(b)
  return a + b

# Suppresses some SymPy warnings not relevant to what we're doing
result = foo_foreign2(a, b.numpy())

@torchdynamo.optimize(dynamo_printer)
def foo_foreign3(a, b):
  a = a + 2
  b = b + 1
  b = torch.from_numpy(b)
  return a + b

# Suppresses some SymPy warnings not relevant to what we're doing
result = foo_foreign3(a, b.numpy())

    b = np.add(b, 1)




def forward(self, a : torch.Tensor):
    add = a + 2;  a = None
    return (add,)
    
    b = torch.from_numpy(b)




def forward(self, _stack0 : torch.Tensor, a : torch.Tensor):
    add = a + _stack0;  a = _stack0 = None
    return (add,)
    
torchdynamo.convert_frame: [ERROR] WON'T CONVERT foo_foreign3 <ipython-input-72-c87765088ab8> line 13 
due to: 
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/fx/proxy.py", line 165, in create_arg
    raise NotImplementedError(f"argument of type: {type(a)}")
NotImplementedError: argument of type: <class 'numpy.ndarray'>

from user code:
   File "<ipython-input-72-c87765088ab8>", line 17, in foo_foreign3
    b = torch.from_numpy(b)

Set torchdynamo.config.verbose=True for more information


While TorchDynamo will respect an explicit call to NumPy, it's unhappy with calling `__add__` on a NumPy array. This is probably just a technical glitch, and not an inherent limitation, however.

### Performance and Caching

Like any tracer, TorchDynamo is only interesting if it caches effectively. This is an inherent requirement to tracing, because tracing requires running a function at least once to observe its behavior. Since the function has already been run, there's no point in optimizing it unless we're going to call it again, and unless the tracer can quickly acquire the correct trace from its cache. If tracers never used a cache then we'd be paying a performance penalty for tracing and never realizing a benefit!

We can test TorchDynamo's caching empirically by seeing when it calls our "optimizer."

In [None]:
@torchdynamo.optimize(dynamo_printer)
def foo_easy(a, b):
  return a - b

result = foo_easy(a, b)




def forward(self, a : torch.Tensor, b : torch.Tensor):
    sub = a - b;  a = b = None
    return (sub,)
    


In [None]:
# The trace for foo_easy has been cached
result = foo_easy(a, b)

The first call to `foo_easy` invokes `dynamo_printer`, but the second doesn't. TorchDynamo recognizes that it can reuse the callable that `dynamo_printer` previously returned to execute `foo_easy` on the same inputs. 

Changing the dtype of the inputs will trigger another trace, however.

In [None]:
result = foo_easy(a.double(), b)




def forward(self, a : torch.Tensor, b : torch.Tensor):
    sub = a - b;  a = b = None
    return (sub,)
    


Which tells us that TorchDynamo's cache is encoding properties of the input to the function, like the datatype of tensors.

Now let's see how it handles Python objects.

In [None]:
result = foo_easy(a, 2)




def forward(self, a : torch.Tensor):
    sub = a - 2;  a = None
    return (sub,)
    


A scalar creates a new trace, and the scalar's value appears as a constant in the trace!

In [None]:
# The same scalar value doesn't cause a retrace
result = foo_easy(a, 2)

In [None]:
# ... but a different scalar value does!
result = foo_easy(a, 3)




def forward(self, a : torch.Tensor):
    sub = a - 3;  a = None
    return (sub,)
    


Different values for the Python scalar cause TorchDynamo to retrace, suggesting that using it for functions that accept frequently changing native Python types is a bad idea. Let's compare the performance of running the same function using the Python interpreter and then using TorchDynamo as we vary the value of the scalar to see. For this test we'll use a different "optimizer" that doesn't print so we're not overwhelmed with printing.

In [None]:
torchdynamo.reset()

# Simple TorchDynamo "optimizer" that does nothing
def dynamo_passthrough(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
    return gm.forward

def foo_python(a, b):
  return a - b

# Constructs the TorchDynamo traced version of the above
foo_passthrough = torchdynamo.optimize(dynamo_passthrough)(foo_python)

import time

# Times TorchDynamo 
start = time.time()

for b in range(1000):
  foo_passthrough(a, b)

end = time.time()
elapsed = end - start
print(f"TorchDynamo elapsed time: {elapsed}")

# Times using the Python interpreter
start = time.time()

for b in range(1000):
  foo_python(a, b)

end = time.time()
elapsed = end - start
print(f"Python elapsed time: {elapsed}")

   function: 'foo_python' (<ipython-input-82-eaf361744106>:6)
   reasons:  ['b == 0']
to diagnose recompilation issues, see https://github.com/pytorch/torchdynamo/blob/main/TROUBLESHOOTING.md.
TorchDynamo elapsed time: 0.21825957298278809
Python elapsed time: 0.004097938537597656


The profiling in the above cell is very simple, but it highlights TorchDynamo's current issue handling Python objects with changing values. TorchDynamo will even warn about "recompilation" in this case -- which is just another name for "retracing." 

This is not an inherent limitation of TorchDynamo's approach and I expect the TorchDynamo team to address this issue in the future. We can see what's happening more clearly by using `torchdynamo.explain` again:

In [None]:
explanation, out_guards, graphs, ops_per_graph, break_reasons = torchdynamo.explain(foo_passthrough, a, 2)

In [None]:
print(out_guards)

[{Guard(name='b', source=<GuardSource.LOCAL: 0>, create_fn=<function GuardBuilder.CONSTANT_MATCH at 0x7fa9a7f59050>, is_volatile=False, guard_types=['EQUALS_MATCH'], code_list=['___check_type_id(b, 11105824)', 'b == 2'], obj_weakref=None, guarded_class_weakref=<weakref at 0x7fa9e8a2be90; to 'type' at 0xa97620 (int)>), Guard(name='a', source=<GuardSource.LOCAL: 0>, create_fn=<function GuardBuilder.TENSOR_MATCH at 0x7fa9a7f597a0>, is_volatile=False, guard_types=['TENSOR_MATCH'], code_list=None, obj_weakref=<weakref at 0x7fa9a5cba770; to 'Tensor' at 0x7fa9a5cc7ad0>, guarded_class_weakref=<weakref at 0x7fa9b40b2710; to 'torch._C._TensorMeta' at 0x65387a0 (Tensor)>)}]


Although the formatting above isn't very nice, we can see that there's a "guard" for `b` that requires a `CONSTANT_MATCH`, suggesting what we've seen empirically that TorchDynamo is "guarding" reusing a previous trace on this value being the same. 

### Conclusion

*   Debate over Pep 523
*   If you're going to trace, possibly the best way to do it
*   Great observaibility
*   Awesome that it produces FX `GraphModules` (easy to transform and execute)
*   Addresses the top problems with TorchScript Tracing (too much metadata, "blind" to other operations)
*   Representation of Python regions seems lacking, and not all Python regions are understood
*   Caching model seems too restrictive, especially when working with native Python types like Numbers
