Added list clearing codegen to AOTAutograd (hidden behind config.aot_clear_list #83137

Chillee · 2022-08-10T01:30:38Z

Stack from ghstack (oldest at bottom):

…clear_list [ghstack-poisoned]

facebook-github-bot · 2022-08-10T01:30:44Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/83137
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit 0ca972f (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

…config.aot_clear_list" [ghstack-poisoned]

…clear_list ghstack-source-id: f849cfc8045f14bbfb59dd553e6209c9892afa72 Pull Request resolved: #83137

Chillee · 2022-08-10T02:06:57Z

cc: @albanD on changes to python_function.cpp

cc: @jansel and @SherlockNoMad on changes we need to make to the compilers being passed to AOTAutograd.

Chillee · 2022-08-10T02:22:50Z

An example of the changes.

albanD

Small nits only

torch/csrc/autograd/python_function.cpp

albanD · 2022-08-10T15:52:48Z

functorch/functorch/_src/config.py

@@ -20,6 +20,9 @@
 #   fix for complex numbers
 use_fake_tensor = False

+# Changes AOTAutograd to passing a list of tensors that are then cleared


The user is passing the list?

So today, the contract with the compiler is that we pass in a function like

f(a, b, c)

and the compiler returns a callable

compiled_f(a,b,c)

This changes the contract so that was pass in a function

f([a,b,c])

and the compiler returns a callable

compiled_f([a,b,c])

I'm not sure to see the difference? Don't you use pytree to "unpack" any data structure in the args?

@albanD I'll add some more context to this issue/commit, but basically, it's about object lifetimes.

If you call f(a,b,c), then a, b, and c will stay alive for the duration of f, and there's no way around this AFAIK.

OTOH, if you call f([a,b,c]), then we can clear the list and free a/b/c inside of f.

You can do del a, b, c ?

Unfortunately not :( Gimme a sec.

functorch/functorch/_src/aot_autograd.py

…config.aot_clear_list" [ghstack-poisoned]

…clear_list ghstack-source-id: 6206ebc725c7f241ccea15b1ea5fae577dc0b93e Pull Request resolved: #83137

Chillee · 2022-08-10T21:53:42Z

So... what is the problem here?

(cc: @albanD , @eellison)

Let's say you have a function like this

import torch
def get_mem():
    print(f"{torch.cuda.memory_allocated()/1e9} GB")

def f(x):
    del x
    get_mem()
    return None

f(torch.randn(2**30, device='cuda'))

What will this print?

Unfortunately for us, the answer is 4.294967296 GB. This is despite the fact that within f, we no longer have any references to a tensor, and outside f, we are creating the tensor to directly pass into f.

This is because, in Python, when you call a function, the arguments are always borrowed references. Thus, they must be kept alive for the duration of the function call. See this reference.

When you pass an object reference into another function, in general, the function borrows the reference from you

Thus, there is no way for us to free any of the inputs to the function for the duration of the function.

So, how do we solve this? Although the inputs themselves must stay alive for the duration of the function, there's no guarantee that the references they hold must be kept alive. So, if instead, we pass a list to the function, then clear that list, we can ensure that the tensor is freed in the function. So...

def list_f(x):
    val = x[0]
    x.clear()
    del val
    get_mem()
    return None

list_f([torch.randn(2**30, device='cuda')])

Why is this particularly a problem for AOTAutograd?

Now, why is this issue a problem for AOTAutograd? Well, if you think about the signature of the backwards pass, it looks something like...

def backwards(activation_0, activation_1, activation_2, ... activation_million, grad_output)

In other words, you're passing in a ton of activations to the backwards pass! And as activations constitute a significant part of the memory in any deep learning model, this can lead to higher memory usage in the backwards pass. Let's look at the graph for AOTAutograd today for a model like `resnet18.

You can see that in eager, we reach our peak memory in-between forwards + backwards. Then in the backwards pass, we start to free our activations as they're no longer needed, leading to a reduction in memory usage.

With AOTAutograd, however, our memory doesn't drop! In fact, it steadily rises during the backwards pass, as we additionally allocate gradients in addition to all the activations being saved.

So, that's what this PR does. It fixes this problem for AOTAutograd by changing the convention with which we interface with compilers.

…config.aot_clear_list" [ghstack-poisoned]

…graph take in boxed inputs" ### Context In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). ```python # original code def forward(inputs): a, b, c, d, e = inputs out = a out += b del b out += c del c out += d del d out += e del e return out # compiled code: def forward(a, b, c, d, e): # b, c, d, e can't be freed before end of function ``` This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory. ### Solution We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case). This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`. With the inputs as a list in the graph args, our compiled code can free inputs early. ```python def forward(inputs): # a, b, c, d, e can be freed within the function now ``` AOT/Inductor already support this list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478). Which was fixed in the previous PR of this stack. The next step after is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

…xed inputs" ### Context In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). ```python # original code def forward(inputs): a, b, c, d, e = inputs out = a out += b del b out += c del c out += d del d out += e del e return out # compiled code: def forward(a, b, c, d, e): # b, c, d, e can't be freed before end of function ``` This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory. ### Solution We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case). This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`. With the inputs as a list in the graph args, our compiled code can free inputs early. ```python def forward(inputs): # a, b, c, d, e can be freed within the function now ``` AOT/Inductor already support this list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478). Which was fixed in the previous PR of this stack. The next step after is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

…graph take in boxed inputs" ### Context In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). ```python # original code def forward(inputs): a, b, c, d, e = inputs inputs.clear() out = a out += b del b # frees memory out += c del c # frees memory out += d del d # frees memory out += e del e # frees memory return out # compiled code: def forward(a, b, c, d, e): # b, c, d, e can't be freed before end of function ``` This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory. ### Solution We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case). This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`. With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case. ```python def forward(inputs): # a, b, c, d, e can be freed within the function now ``` Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

…xed inputs" ### Context In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). ```python # original code def forward(inputs): a, b, c, d, e = inputs inputs.clear() out = a out += b del b # frees memory out += c del c # frees memory out += d del d # frees memory out += e del e # frees memory return out # compiled code: def forward(a, b, c, d, e): # b, c, d, e can't be freed before end of function ``` This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory. ### Solution We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case). This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`. With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case. ```python def forward(inputs): # a, b, c, d, e can be freed within the function now ``` Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]