Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added list clearing codegen to AOTAutograd (hidden behind config.aot_clear_list #83137

Closed
wants to merge 12 commits into from

Conversation

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 10, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit 0ca972f (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Chillee added a commit that referenced this pull request Aug 10, 2022
…clear_list

ghstack-source-id: f849cfc8045f14bbfb59dd553e6209c9892afa72
Pull Request resolved: #83137
@Chillee Chillee requested review from albanD, ngimel, jansel, SherlockNoMad and anijain2305 and removed request for albanD and soulitzer August 10, 2022 01:42
@Chillee
Copy link
Contributor Author

Chillee commented Aug 10, 2022

cc: @albanD on changes to python_function.cpp

cc: @jansel and @SherlockNoMad on changes we need to make to the compilers being passed to AOTAutograd.

@Chillee
Copy link
Contributor Author

Chillee commented Aug 10, 2022

image

An example of the changes.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nits only

torch/csrc/autograd/python_function.cpp Outdated Show resolved Hide resolved
torch/csrc/autograd/python_function.cpp Outdated Show resolved Hide resolved
torch/csrc/autograd/python_function.cpp Outdated Show resolved Hide resolved
@@ -20,6 +20,9 @@
# fix for complex numbers
use_fake_tensor = False

# Changes AOTAutograd to passing a list of tensors that are then cleared
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user is passing the list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So today, the contract with the compiler is that we pass in a function like

f(a, b, c)

and the compiler returns a callable

compiled_f(a,b,c)

This changes the contract so that was pass in a function

f([a,b,c])

and the compiler returns a callable

compiled_f([a,b,c])

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to see the difference? Don't you use pytree to "unpack" any data structure in the args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD I'll add some more context to this issue/commit, but basically, it's about object lifetimes.

If you call f(a,b,c), then a, b, and c will stay alive for the duration of f, and there's no way around this AFAIK.

OTOH, if you call f([a,b,c]), then we can clear the list and free a/b/c inside of f.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do del a, b, c ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not :( Gimme a sec.

functorch/functorch/_src/aot_autograd.py Outdated Show resolved Hide resolved
Chillee added a commit that referenced this pull request Aug 10, 2022
…clear_list

ghstack-source-id: 6206ebc725c7f241ccea15b1ea5fae577dc0b93e
Pull Request resolved: #83137
@Chillee
Copy link
Contributor Author

Chillee commented Aug 10, 2022

So... what is the problem here?

(cc: @albanD , @eellison)

Let's say you have a function like this

import torch
def get_mem():
    print(f"{torch.cuda.memory_allocated()/1e9} GB")

def f(x):
    del x
    get_mem()
    return None

f(torch.randn(2**30, device='cuda'))

What will this print?

Unfortunately for us, the answer is 4.294967296 GB. This is despite the fact that within f, we no longer have any references to a tensor, and outside f, we are creating the tensor to directly pass into f.

This is because, in Python, when you call a function, the arguments are always borrowed references. Thus, they must be kept alive for the duration of the function call. See this reference.

When you pass an object reference into another function, in general, the function borrows the reference from you

Thus, there is no way for us to free any of the inputs to the function for the duration of the function.

So, how do we solve this? Although the inputs themselves must stay alive for the duration of the function, there's no guarantee that the references they hold must be kept alive. So, if instead, we pass a list to the function, then clear that list, we can ensure that the tensor is freed in the function. So...

def list_f(x):
    val = x[0]
    x.clear()
    del val
    get_mem()
    return None

list_f([torch.randn(2**30, device='cuda')])

Why is this particularly a problem for AOTAutograd?

Now, why is this issue a problem for AOTAutograd? Well, if you think about the signature of the backwards pass, it looks something like...

def backwards(activation_0, activation_1, activation_2, ... activation_million, grad_output)

In other words, you're passing in a ton of activations to the backwards pass! And as activations constitute a significant part of the memory in any deep learning model, this can lead to higher memory usage in the backwards pass. Let's look at the graph for AOTAutograd today for a model like `resnet18.

image

You can see that in eager, we reach our peak memory in-between forwards + backwards. Then in the backwards pass, we start to free our activations as they're no longer needed, leading to a reduction in memory usage.

With AOTAutograd, however, our memory doesn't drop! In fact, it steadily rises during the backwards pass, as we additionally allocate gradients in addition to all the activations being saved.

So, that's what this PR does. It fixes this problem for AOTAutograd by changing the convention with which we interface with compilers.

xmfan added a commit that referenced this pull request Apr 2, 2024
…graph take in boxed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  out = a  
  out += b
  del b
  out += c
  del c
  out += d
  del d
  out += e
  del e
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs early.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

AOT/Inductor already support this list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478). Which was fixed in the previous PR of this stack.

The next step after is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 2, 2024
…xed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  out = a  
  out += b
  del b
  out += c
  del c
  out += d
  del d
  out += e
  del e
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs early.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

AOT/Inductor already support this list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478). Which was fixed in the previous PR of this stack.

The next step after is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 2, 2024
…graph take in boxed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 2, 2024
…xed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 2, 2024
…graph take in boxed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 2, 2024
…xed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 2, 2024
…graph take in boxed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 2, 2024
…xed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 9, 2024
…graph take in boxed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 9, 2024
…xed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 9, 2024
…graph take in boxed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 9, 2024
…xed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 9, 2024
…graph take in boxed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 9, 2024
…xed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 9, 2024
…graph take in boxed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 9, 2024
…xed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 9, 2024
…graph take in boxed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 9, 2024
…xed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 11, 2024
…graph take in boxed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 11, 2024
…xed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 11, 2024
…graph take in boxed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Apr 11, 2024
…xed inputs"


### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. 

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs. 

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)). 

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a  
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out
  
# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Apr 12, 2024
…122353)

### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container.

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs.

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](#83137 (comment)).

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out

# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

Pull Request resolved: #122353
Approved by: https://github.com/jansel
ghstack dependencies: #123630, #123674
yf225 pushed a commit to yf225/pytorch that referenced this pull request Apr 12, 2024
…ytorch#122353)

### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container.

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs.

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](pytorch#83137 (comment)).

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out

# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

Pull Request resolved: pytorch#122353
Approved by: https://github.com/jansel
ghstack dependencies: pytorch#123630, pytorch#123674
facebook-github-bot pushed a commit to pytorch/benchmark that referenced this pull request Apr 14, 2024
Summary:
### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container.

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs.

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](pytorch/pytorch#83137 (comment)).

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out

# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

X-link: pytorch/pytorch#122353
Approved by: https://github.com/jansel
ghstack dependencies: #123630, #123674

Reviewed By: DanilBaibak

Differential Revision: D56119828

Pulled By: xmfan

fbshipit-source-id: 1c19c828977f19098153ff4cff4cc46d97397e3e
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
…ytorch#122353)

### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container.

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs.

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](pytorch#83137 (comment)).

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out

# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

Pull Request resolved: pytorch#122353
Approved by: https://github.com/jansel
ghstack dependencies: pytorch#123630, pytorch#123674
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
…ytorch#122353)

### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container.

And [Dynamo generates](https://github.com/pytorch/pytorch/blob/fdc281f2587f9a5a935de1f1368e7ad7ed0f9828/torch/_dynamo/codegen.py#L371) the runtime function's signature using the graph's graphargs.

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](pytorch#83137 (comment)).

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out

# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](https://github.com/pytorch/pytorch/blob/597f479643f82859307ece38971f1c8e7d657c80/torch/_inductor/compile_fx.py#L1454-L1478), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

Pull Request resolved: pytorch#122353
Approved by: https://github.com/jansel
ghstack dependencies: pytorch#123630, pytorch#123674
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants