Skip to content

Conversation

chenyang78
Copy link
Contributor

@chenyang78 chenyang78 commented Sep 11, 2023

Summary: This PR adds dynamic-shape support for AOTInductor

  • On the runtime/interface side, we added two structs, StaticDimInfo
    and DynamicDimInfo, to hold values for static and dynamic dimensions,
    respectively. Dynamic dimensions are tracked by an unordered map field
    defined in AOTInductorModelBase. At inference time, the inference run
    method will assign the current real dimensional value to each dynamic
    dimension before executing any kernel.

  • On the CUDA wrapper codegen side, we generate dynamic symbols
    appropriately for shape computations. We simulate kernel launch grids
    in the C++ land by re-using the grid functions from the Python world.
    The returned grid configs, which may contain symbolic expressions,
    are printed out in their C++ forms via the CppPrinter. Note that
    when dynamic shapes are involved, we have to compute grid configs
    for each kernel at runtime in the same way as we do for launching
    the corresponding Triton kernel. Otherwise, we may end up with
    memory-access failures or mis-computations caused by invalid indices
    for fetching or storing data in device memory.

Differential Revision: D49100472

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @kadeng @muchulee8 @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 11, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109012

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0f6853b with merge base 025d1a1 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49100472

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49100472

chenyang78 added a commit to chenyang78/pytorch that referenced this pull request Sep 11, 2023
Summary:

This PR adds dynamic-shape support for AOTInductor

Differential Revision: D49100472
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49100472

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49100472

chenyang78 added a commit to chenyang78/pytorch that referenced this pull request Sep 12, 2023
Summary:

This PR adds dynamic-shape support for AOTInductor

Differential Revision: D49100472
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49100472

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49100472

chenyang78 added a commit to chenyang78/pytorch that referenced this pull request Sep 12, 2023
Summary:

This PR adds dynamic-shape support for AOTInductor

Differential Revision: D49100472
Copy link
Contributor

@ipiszy ipiszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chenyang78 ! Left some comments, mainly some clarification questions. It would be great if you could help summarize your changes in the PR description.


return ", ".join(new_args)

def generate_default_grid(self, name, grid, cuda=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this cuda arg used anywhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unnecessary since the cpu backend doesn't need to generate grid

return grid
assert isinstance(grid, list), f"expected {grid=} to be a list"
grid = [e.inner_expr if isinstance(e, SymbolicCallArg) else e for e in grid]
grid_fn = default_grid(*grid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is default_grid? I cannot find it in the pytorch repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from from ..triton_heuristics import grid as default_grid

)
stack.enter_context(self.wrapper_call.indent())

def generate_default_grid(self, name, grid_args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add more comments to describe what this function does, and the args. Also add type hints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Thanks.

@dataclasses.dataclass
class SymbolicCallArg:
inner: Any
inner_expr: sympy.Expr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@desertfire desertfire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nits. LGTM overall.

Currently AOTInductor on the OSS benchmarks are not a part of the CI yet. I have a pending PR, #108419, but it won't be merged soon given the recent CI infra instability. So can you manually trigger a run at https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml?

You will need to select the options like in the screenshot,
Screenshot 2023-09-12 at 10 33 33 PM
Unfortunately the Branch selection does not work with a PR from your forked repo. You can create a mirror PR on a pytorch branch in order to do the testing. Let me know if you need more help on the instructions.

for example_inputs, example_outputs in zip(
list_example_inputs, list_example_outputs
):
output_tensors = [torch.empty_like(output) for output in example_outputs]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the prod scenario, do we allocate a list of tensors with the max batch size and reuse those? If yes, we need to test that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't re-use output tensors. Instead, we allocate output tensors for each inference run and return these tensors to the caller of the forward function.

return grid
assert isinstance(grid, list), f"expected {grid=} to be a list"
grid = [e.inner_expr if isinstance(e, SymbolicCallArg) else e for e in grid]
grid_fn = default_grid(*grid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from from ..triton_heuristics import grid as default_grid


return ", ".join(new_args)

def generate_default_grid(self, name, grid, cuda=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unnecessary since the cpu backend doesn't need to generate grid

size_t num_inputs,
AOTInductorTensorHandle outputs_handle,
size_t num_outputs,
AOTInductorParamShape* output_shapes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to revisit for ABI compatibility, but ok for now.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49100472

chenyang78 added a commit to chenyang78/pytorch that referenced this pull request Sep 13, 2023
Summary:

This PR adds dynamic-shape support for AOTInductor

Reviewed By: khabinov

Differential Revision: D49100472
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49100472

@chenyang78
Copy link
Contributor Author

Left some nits. LGTM overall.

Currently AOTInductor on the OSS benchmarks are not a part of the CI yet. I have a pending PR, #108419, but it won't be merged soon given the recent CI infra instability. So can you manually trigger a run at https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml?

You will need to select the options like in the screenshot, Screenshot 2023-09-12 at 10 33 33 PM Unfortunately the Branch selection does not work with a PR from your forked repo. You can create a mirror PR on a pytorch branch in order to do the testing. Let me know if you need more help on the instructions.

Running it: https://github.com/pytorch/pytorch/actions/runs/6170063555

Hopefully I was doing it right. Thanks.

@desertfire
Copy link
Contributor

Left some nits. LGTM overall.
Currently AOTInductor on the OSS benchmarks are not a part of the CI yet. I have a pending PR, #108419, but it won't be merged soon given the recent CI infra instability. So can you manually trigger a run at https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml?
You will need to select the options like in the screenshot, Screenshot 2023-09-12 at 10 33 33 PM Unfortunately the Branch selection does not work with a PR from your forked repo. You can create a mirror PR on a pytorch branch in order to do the testing. Let me know if you need more help on the instructions.

Running it: https://github.com/pytorch/pytorch/actions/runs/6170063555

Hopefully I was doing it right. Thanks.

Given the long task queueing time, I ran this PR locally to verify. I haven't checked every model, but the results look good to me. I am ok with landing it.

Summary:

This PR adds dynamic-shape support for AOTInductor

Reviewed By: frank-wei, khabinov

Differential Revision: D49100472
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49100472

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49100472

@chenyang78
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 14, 2023
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator

Details for Dev Infra team Raised by workflow job

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants