Infrastructure for a new CUDA Fuser #34785

csarofeen · 2020-03-15T18:49:53Z

Summary: This PR contains the infrastructure of a new CUDA fuser. This CUDA fuser is based on many of the same principles of TensorExpressions and Halide, however the implementation is ground up. The fusion pass itself is similar to the default CUDA fuser, however, it has undergone some refactoring and is using the new code generation infrastructure. For those who are interested in how the code generation in this PR works, I would recommend reviewing test/cpp/jit/test_gpu_fusion.cpp as well as the long comment section at the beginning of torch/csrc/jit/codegen/cuda/transform_replay.h One of the largest differences between our approach and that of TVM/Halide, is the concept of "TensorView". TensorView from a high level should be thought of similarly to how we think of working with Tensors in PyTorch. It's an N-D object which can undergo transformations that change its dimensionality. Dimensionality changes are done through the operations split/merge/reorder/computeAt. These transformations are similar to split/fuse/reorder/compute_at of TVM, they modify how a tensor is iterated over to generate GPU code. Interestingly, in our scheme these transformations are applied to tensors and only impact how that tensor is generated.

Warning: This PR is purposefully not feature complete with the current fuser. We wanted to separate out the infrastructure from the fusion capabilities. Once in, smaller incremental PRs will be submitted to expand capabilities of the fuser.

Short term goals:

Parity with current CUDA fuser (including performance):

Dynamic shapes (no recompilation)
Implicit handling of braodcast (broadcasted tensors are treated as tensors of the braodcasted size in the generated code)
Dropout

Mid-term goals:

Transposes fused with pointwise operations where transpose involves only 2 axes (across the fused operation).
1-D reductions fused with pointwise operations

…ange tests to print just the fusion.

…Expr.

…fix mutator test.

…. Clean up replaceAll test.

…sform replay/compute_at.

…, TensorDomain.

…nly for now).

… Tensor* functions.

…nother value.

…be handled before rhs. Fixed now.

…ency chain of arithmetic operations will be broken.

…ced after reorder ops. Unrelated changes in dependency test.

These tests have been disabled in OSS CI since #34785. Differential Revision: [D34436844](https://our.internmc.facebook.com/intern/diff/D34436844) [ghstack-poisoned]

1) remove test_jit_cuda_fuser from list of disabled tests 2) make the tests run on cpu (skip the tests instead of erroring) These tests have been disabled in OSS CI since #34785. ghstack-source-id: d54af41 Pull Request resolved: #73322

These tests have been disabled in OSS CI since #34785. Differential Revision: [D34436844](https://our.internmc.facebook.com/intern/diff/D34436844) [ghstack-poisoned]

1) remove test_jit_cuda_fuser from list of disabled tests 2) make the tests run on cpu (skip the tests instead of erroring) These tests have been disabled in OSS CI since #34785. ghstack-source-id: 39ce824 Pull Request resolved: #73322

These tests have been disabled in OSS CI since #34785. Differential Revision: [D34436844](https://our.internmc.facebook.com/intern/diff/D34436844) [ghstack-poisoned]

These tests have been disabled in OSS CI since #34785. This disables the windows tests, which currently aren't passing. Differential Revision: [D34436844](https://our.internmc.facebook.com/intern/diff/D34436844) [ghstack-poisoned]

Summary: Pull Request resolved: #73322 These tests have been disabled in OSS CI since #34785. Test Plan: Imported from OSS Reviewed By: eellison Differential Revision: D34436844 Pulled By: davidberard98 fbshipit-source-id: c5b14b33e7f369a6fa1e9cfbcb484a30dffc659e

Summary: Pull Request resolved: #73322 These tests have been disabled in OSS CI since #34785. Test Plan: Imported from OSS Reviewed By: eellison Differential Revision: D34436844 Pulled By: davidberard98 fbshipit-source-id: c5b14b33e7f369a6fa1e9cfbcb484a30dffc659e (cherry picked from commit b08f515)

Summary: **Summary:** This PR contains the infrastructure of a new CUDA fuser. This CUDA fuser is based on many of the same principles of TensorExpressions and Halide, however the implementation is ground up. The fusion pass itself is similar to the default CUDA fuser, however, it has undergone some refactoring and is using the new code generation infrastructure. For those who are interested in how the code generation in this PR works, I would recommend reviewing _test/cpp/jit/test_gpu_fusion.cpp_ as well as the long comment section at the beginning of _torch/csrc/jit/codegen/cuda/transform_replay.h_ One of the largest differences between our approach and that of TVM/Halide, is the concept of "TensorView". TensorView from a high level should be thought of similarly to how we think of working with Tensors in PyTorch. It's an N-D object which can undergo transformations that change its dimensionality. Dimensionality changes are done through the operations split/merge/reorder/computeAt. These transformations are similar to split/fuse/reorder/compute_at of TVM, they modify how a tensor is iterated over to generate GPU code. Interestingly, in our scheme these transformations are applied to tensors and only impact how that tensor is generated. **Warning:** This PR is purposefully not feature complete with the current fuser. We wanted to separate out the infrastructure from the fusion capabilities. Once in, smaller incremental PRs will be submitted to expand capabilities of the fuser. **Short term goals:** Parity with current CUDA fuser (including performance): - Dynamic shapes (no recompilation) - Implicit handling of braodcast (broadcasted tensors are treated as tensors of the braodcasted size in the generated code) - Dropout **Mid-term goals:** - Transposes fused with pointwise operations where transpose involves only 2 axes (across the fused operation). - 1-D reductions fused with pointwise operations Pull Request resolved: pytorch/pytorch#34785 Reviewed By: ZolotukhinM Differential Revision: D20650977 Pulled By: soumith fbshipit-source-id: ee39c95a880e1b9822e874ed4cc180971572bf63

Summary: Build fix stemming from pytorch/pytorch#34785 Pull Request resolved: pytorch/pytorch#35917 Differential Revision: D20829353 Pulled By: soumith fbshipit-source-id: 4ba84ecedd354efbc9ac47c9b0f0e3871b404f13

Summary: **Summary:** This PR contains the infrastructure of a new CUDA fuser. This CUDA fuser is based on many of the same principles of TensorExpressions and Halide, however the implementation is ground up. The fusion pass itself is similar to the default CUDA fuser, however, it has undergone some refactoring and is using the new code generation infrastructure. For those who are interested in how the code generation in this PR works, I would recommend reviewing _test/cpp/jit/test_gpu_fusion.cpp_ as well as the long comment section at the beginning of _torch/csrc/jit/codegen/cuda/transform_replay.h_ One of the largest differences between our approach and that of TVM/Halide, is the concept of "TensorView". TensorView from a high level should be thought of similarly to how we think of working with Tensors in PyTorch. It's an N-D object which can undergo transformations that change its dimensionality. Dimensionality changes are done through the operations split/merge/reorder/computeAt. These transformations are similar to split/fuse/reorder/compute_at of TVM, they modify how a tensor is iterated over to generate GPU code. Interestingly, in our scheme these transformations are applied to tensors and only impact how that tensor is generated. **Warning:** This PR is purposefully not feature complete with the current fuser. We wanted to separate out the infrastructure from the fusion capabilities. Once in, smaller incremental PRs will be submitted to expand capabilities of the fuser. **Short term goals:** Parity with current CUDA fuser (including performance): - Dynamic shapes (no recompilation) - Implicit handling of braodcast (broadcasted tensors are treated as tensors of the braodcasted size in the generated code) - Dropout **Mid-term goals:** - Transposes fused with pointwise operations where transpose involves only 2 axes (across the fused operation). - 1-D reductions fused with pointwise operations Pull Request resolved: pytorch/pytorch#34785 Reviewed By: ZolotukhinM Differential Revision: D20650977 Pulled By: soumith fbshipit-source-id: ee39c95a880e1b9822e874ed4cc180971572bf63

Summary: Build fix stemming from pytorch/pytorch#34785 Pull Request resolved: pytorch/pytorch#35917 Differential Revision: D20829353 Pulled By: soumith fbshipit-source-id: 4ba84ecedd354efbc9ac47c9b0f0e3871b404f13

csarofeen and others added 30 commits March 14, 2020 17:31

Generate split/merge/reorder nodes in the IR, fix up printing, and ch…

c9be830

…ange tests to print just the fusion.

Update dispatch for split/merge/reorder. Add comment on adding a Val/…

f8db677

…Expr.

Fill out same_as operators.

6dfaa2b

Update mutator functions.

6177e45

Cleanup mutator, call removeExpr when setting a new origin Expr, and …

ec74f9e

…fix mutator test.

Move to hierarchical dispatch. Add replaceAll mutator. Format ir.cppp.

3d12db2

Add comment on dispatch.

ce9f565

updated tests; minor changes on function returns

7e1669a

Debug spew cleanup.

2c18ffb

Add function Fusion::removeVal(..). Remove old val when replacing all…

ea1f6e9

…. Clean up replaceAll test.

finished TensorContiguity::merge

c6c73d9

Move split/merge/reorder nodes to operate on Tensordomain. [WIP] Tran…

c2aad3c

…sform replay/compute_at.

Comment tensor.h describing the difference between Tensor, TensorView…

72dcc37

…, TensorDomain.

Add inline operator string printing.

d3d4504

Simple merge fix.

5fb80f1

Implement replay for compute_at operations. Add tests for it (print o…

2059792

…nly for now).

A couple bugs introduced during cleanup of replay for compute_at.

7d0e8b7

Protect compute_at range from split/merge/reorder.

a92a86c

Quick fix calling tv->getComputeAtView() move reorder function before…

5b97792

… Tensor* functions.

Add traversal from specific values.

63a8bce

Split/merge/reorder replaces tensor view in its origin operation.

bdc2497

Add a utility to check if a value is in the dependency hierarchy of a…

ce496f0

…nother value.

Add quick note on compute_at and it's axis/position.

1996379

Order of traversal was slightly off where lhs of an expression would …

bfcb05d

…be handled before rhs. Fixed now.

Replace all instances of view on split/merge/reoder, otherwise depend…

e8d8655

…ency chain of arithmetic operations will be broken.

Fix reorder tests, as views now are hard replaced so can't be referen…

a56a3c8

…ced after reorder ops. Unrelated changes in dependency test.

Add dependency analysis, allow pulling dependencies between two values.

7299295

Parser implemented that lowers JIT IR to codegen IR

8112547

Recrusive compute_at.

343fae9

Arith returns non const TensorViews.

3f821c4

davidberard98 added a commit that referenced this pull request Mar 3, 2022

Update on "[JIT] Enable NVFuser tests in CI"

45baf6f

These tests have been disabled in OSS CI since #34785. Differential Revision: [D34436844](https://our.internmc.facebook.com/intern/diff/D34436844) [ghstack-poisoned]

davidberard98 added a commit that referenced this pull request Mar 3, 2022

Update on "[JIT] Enable NVFuser tests in CI"

7f3cbd8

These tests have been disabled in OSS CI since #34785. Differential Revision: [D34436844](https://our.internmc.facebook.com/intern/diff/D34436844) [ghstack-poisoned]

davidberard98 added a commit that referenced this pull request Mar 7, 2022

Update on "[JIT] Enable NVFuser tests in CI"

730677e

These tests have been disabled in OSS CI since #34785. Differential Revision: [D34436844](https://our.internmc.facebook.com/intern/diff/D34436844) [ghstack-poisoned]

davidberard98 added a commit that referenced this pull request Mar 7, 2022

Update on "[JIT] Enable NVFuser tests in CI"

67617f2

These tests have been disabled in OSS CI since #34785. Differential Revision: [D34436844](https://our.internmc.facebook.com/intern/diff/D34436844) [ghstack-poisoned]

davidberard98 added a commit that referenced this pull request Mar 8, 2022

Update on "[JIT] Enable NVFuser tests in CI"

35cbae0

These tests have been disabled in OSS CI since #34785. Differential Revision: [D34436844](https://our.internmc.facebook.com/intern/diff/D34436844) [ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Infrastructure for a new CUDA Fuser #34785

Infrastructure for a new CUDA Fuser #34785

Uh oh!

csarofeen commented Mar 15, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Infrastructure for a new CUDA Fuser #34785

Infrastructure for a new CUDA Fuser #34785

Uh oh!

Conversation

csarofeen commented Mar 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

csarofeen commented Mar 15, 2020 •

edited

Loading