[nvFuser] Working towards reductions, codegen improvements #40864

csarofeen · 2020-07-01T15:42:37Z

Have basic reduction fusion working, and have improved code generator to approach performance of eager mode reductions. Coming soon will be pointwise-reduction fusions in a way that should prevent the possibility of hitting regressions. Also working on performant softmax kernels in the code generator which may be our next fusion target.

Fixing the bugs that Kevin finds. * Small alloc fix. * Add another reduction example, change fusion printMath. * Small test fix. * Change Reduction4 test to use TIDx.x * Rework allocation and buffer initialization, as init could be placed before alloc. Add lots of comments. * Fix bug in index compute when replaying reduction transformations for buffer initialization. * RFactor fix when root domain is reduction but has no rfactor axis. * Val isCconst fix. * update remote repo to local repo Co-authored-by: jiej <jiej@nvidia.com>

@csarofeen

On behalf of @csarofeen Working on breaking up the lowering logic to be more incremental and easier to follow. This is in preparation to fix predicates after reductions and in combinations with following broadcasts. This is to replace #65 for the updated base branch. * Start commenting lowering better, break indexing pass into its own class. * Refactor lowering to break up passes and make logic more incremental. * removing commented code Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>

@naoyam

On behalf of @naoyam: blockReduce has a bug when X_THREAD=true, Y_THREAD=false, Z_THREAD=true. This PR adds a test case that exposes the bug as well as a fix. * Add a new test case that hits a bug in blockReduce * Fix a bug in blockReduce * clang-format * Rename test function to avoid a conflict Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>

@naoyam

On behalf of @naoyam Add yet another reduction test: This is a test that exercises a bug that is recently fixed. Here's the fix: cab53ac. Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>

* Rewrite ExpressionEvaluator to use IterVisitor * renaming variables per review comments Co-authored-by: jiej <jiej@nvidia.com>

Start working on the issue of not predicating based on threads that were used to parallelize a reduction. * Add eq to arith.h * Initial pass at thread predicates for ops after parallelized reductions. * Remove erroneous print statement. * update variable name in reference code in cpp tests * clang-tidy * fixing typo * clang format * clang_tidy again Co-authored-by: jiej <jiej@nvidia.com>

Fix CI configuration

Support parallel reductions across thread blocks Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>

1. Remove all references to TensorContiguity (fixes #83) 2. Various small pieces extracted from an experimental branch

* [nvfuser] Debug flag via env_var Allow setting environment variable PYTORCH_CUDA_FUSER_DEBUG to spit out codegen cuda kernel in string.

* Remove extra new lines in fusion printMath. * Fix bug in predicate generation. * Add fusion.printKernel() * Fix potential segfault in unroll pass.

* Simplify a few test cases Replace custom exception checks with ASSERT_THROW macros. * ExpressionEvaluator * Stricter EvaluationContext binding rules 1. Don't allow overwriting concrete values 2. Don't allow binding values to expression results * Fix clang-format errors * Switch to Int::ScalarType The expression evaluator is now using Int::ScalarType instead of plain int. * Avoid a fight with clang-tidy * Check the numbers of kernel input and output parameters * Add an optional arc from TensorView to its root domain This is generated for detail_level >= DetailLevel::Explicit * Checks kernel arguments * Prefer pointers over references * Bug fix * Fix accidental construction of IValue * Use noReduction * Add const to const pointer * Make an integer tensor an error as it is not yet supported * clang-tidy * Incorporate review feedback * added lerp support in parser * add missing addcmul parser and tests * clang_format * Return TensorView* from binary/compound/ternary ops * clang-format * Use TensorView* param in reductionOp and sum * Prefer as instead of static_cast * Transform replay refactor (#53) Goal of this work is to have the transformation history be specific to IterDomains instead of TensorDomains. This should make it a lot easier to match up IterDomains during replay which can be complicated when taking into consideration reduction axes, rfactors, and broadcast axes. Co-authored-by: Jie <jiej@nvidia.com> Co-authored-by: Kevin Stephano <kevin.stephano@gmail.com> * python test fixes (#52) fix python tests failure: 1. put Fusion inside cudaKernel to facilitate runtime arg check. 2. relax rank check for broadcast support in integration; 3. add shape propagation for newly added opeartion: [addcmul, lerp]; 4. adding utility function to create FusionGuard from CudaKernel directly. * [nvFuser] add torch.jit.fuser context manager (pytorch#38993) (#54) Summary: 1. `torch.jit.fuser(str)` context manager facilitates switch between backend fusers: str - 'fuser0' enables only legacy fuser; str - 'fuser1' enables only NNC; str - 'fuser2' enables only nvFuser; 2. cleanup updated python tests. Pull Request resolved: pytorch#38993 Reviewed By: nairbv, pbelevich Differential Revision: D21800620 Pulled By: soumith fbshipit-source-id: 7fe855f5a5b97368e5e84c98c28d04b2e1276c85 * Add another reduction example, change fusion printMath. * Small test fix. * Change Reduction4 test to use TIDx.x * Minor cleanup. * Clean up some noexcepts. * More cleanup. * Refactor computeAt, get first broadcast example working. * Validate first non-trivial broadcast kernel. * Fix replay when broadcast is merged with non-broadcast dim. * Add constness in replay and index compute. * Add another broadcast test. Rework index computation for producers, base on consumer computed indices. * Val isCconst fix. * Add dot product gemm example. * Clang. * Minor bug fixes. * Format and add comments to GEMM test. * WIP: Fix for enabling broadcast after reduction plus a Softmax test. (#66) * Fix for enabling broadcast after reduction plus a Softmax test. * Cleaner way of fixing checks for matching non-broadcast dims to non-reduction dims. * Clang. Co-authored-by: Kevin Stephano <kstephano@nvidia.com> Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com> * Backout bad merge conflict resolutions. * More post rebase cleanup. * Refix a few tests. Some from a bad rebase. * Address comments. * Missed some review comments. * tmp Co-authored-by: Lemo <lemo1234@gmail.com> Co-authored-by: Naoya Maruyama <maruyama3@llnl.gov> Co-authored-by: Jie <jiej@nvidia.com> Co-authored-by: Kevin Stephano <kevin.stephano@gmail.com> Co-authored-by: Kevin Stephano <kstephano@nvidia.com>

* Fix replace size when a reduction dim is not in inner most. * Clang tidy. * Remove print statement in test.

This fixes #85 1. Fix Fusion::validateInputs() 2. Cleanup Fusion::~Fusion()

Fix for issue #88

* Fix MSVC (Windows) build ..\test\cpp\jit\test_gpu.cpp(2609): error C2398: Element '1': conversion from 'size_t' to '_Ty' requires a narrowing conversion with [ _Ty=int64_t ] ..\test\cpp\jit\test_gpu.cpp(2609): error C2398: Element '2': conversion from 'size_t' to '_Ty' requires a narrowing conversion with [ _Ty=int64_t ] * Fix newForReduction()

Allow partition registration to exclude nodes from fusion.

Move computeAt logic to separate file/class. In progress reworking of computeAt logic.

* Clang++ warnings. * Test fix.

dr-ci · 2020-07-01T15:44:19Z

💊 CI failures summary and remediations

As of commit c7c4425 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 39 times.

jjsjann123

Looks like we are preparing upstream PR.
Approval stamp

facebook-github-bot

@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-07-06T22:11:27Z

@soumith merged this pull request in b9b4f05.

Summary: It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type. Should fix regression caused by #40864 Pull Request resolved: #41055 Differential Revision: D22405478 Pulled By: malfet fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6

Summary: It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type. Should fix regression caused by pytorch#40864 Pull Request resolved: pytorch#41055 Differential Revision: D22405478 Pulled By: malfet fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6

Summary: Have basic reduction fusion working, and have improved code generator to approach performance of eager mode reductions. Coming soon will be pointwise-reduction fusions in a way that should prevent the possibility of hitting regressions. Also working on performant softmax kernels in the code generator which may be our next fusion target. Pull Request resolved: pytorch/pytorch#40864 Reviewed By: ngimel Differential Revision: D22392877 Pulled By: soumith fbshipit-source-id: 457448a807d628b1035f6d90bc0abe8a87bf8447

Summary: It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type. Should fix regression caused by pytorch/pytorch#40864 Pull Request resolved: pytorch/pytorch#41055 Differential Revision: D22405478 Pulled By: malfet fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6

Summary: Have basic reduction fusion working, and have improved code generator to approach performance of eager mode reductions. Coming soon will be pointwise-reduction fusions in a way that should prevent the possibility of hitting regressions. Also working on performant softmax kernels in the code generator which may be our next fusion target. Pull Request resolved: pytorch/pytorch#40864 Reviewed By: ngimel Differential Revision: D22392877 Pulled By: soumith fbshipit-source-id: 457448a807d628b1035f6d90bc0abe8a87bf8447

Summary: It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type. Should fix regression caused by pytorch/pytorch#40864 Pull Request resolved: pytorch/pytorch#41055 Differential Revision: D22405478 Pulled By: malfet fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6

csarofeen and others added 30 commits June 11, 2020 23:43

Add yet another reduction test (#74)

e3f3224

On behalf of @naoyam Add yet another reduction test: This is a test that exercises a bug that is recently fixed. Here's the fix: cab53ac. Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>

Rewrite ExpressionEvaluator to use IterVisitor (#58)

3eb8a9b

* Rewrite ExpressionEvaluator to use IterVisitor * renaming variables per review comments Co-authored-by: jiej <jiej@nvidia.com>

Minor cleanup. (#79)

f230d9c

Fix CI configuration

43d9573

Merge pull request #82 from csarofeen/fix_ci_setup

b330974

Fix CI configuration

Grid reduction take 2 (#73)

ddec5c8

Support parallel reductions across thread blocks Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>

Add graphviz support for the new node types (#81)

f0ab847

Cleanup (#89)

8a64e88

1. Remove all references to TensorContiguity (fixes #83) 2. Various small pieces extracted from an experimental branch

[nvfuser] Debug flag via env_var (#86)

ce2496e

* [nvfuser] Debug flag via env_var Allow setting environment variable PYTORCH_CUDA_FUSER_DEBUG to spit out codegen cuda kernel in string.

Quick fixes (#92)

b9af528

* Remove extra new lines in fusion printMath. * Fix bug in predicate generation. * Add fusion.printKernel() * Fix potential segfault in unroll pass.

Replace size quick fix (#97)

b1725ac

* Fix replace size when a reduction dim is not in inner most. * Clang tidy. * Remove print statement in test.

Fix some indexing issues with broadcasted tensors. (#98)

cf307bd

Fix Fusion::validateInputs() (#99)

1355059

This fixes #85 1. Fix Fusion::validateInputs() 2. Cleanup Fusion::~Fusion()

Fix ReplaceExprsInScope (#101)

d060620

Fix for issue #88

Do not allow tensors with origin expressions to be used as input (#103)

b21e67f

[Graph Partition]

83057ec

Allow partition registration to exclude nodes from fusion.

[WIP] adding sum into integration

ada8280

[WIP] builds now but fails at codegen/scheduling

c07a3d0

debugging

22bedfd

[WIP] concept works now in the cpp tests

945b673

first buggy kernel worked

bf7bdc9

non fcd reduction compiled

c9b9462

fixing rebasing issue

ceef40e

bug fixes

9b503ab

csarofeen added 2 commits July 1, 2020 10:35

[WIP] Reworking computeAt (#132)

35d0a18

Move computeAt logic to separate file/class. In progress reworking of computeAt logic.

Fixing clang++ warnings (#138)

b530eee

* Clang++ warnings. * Test fix.

csarofeen requested a review from apaszke as a code owner July 1, 2020 15:42

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Jul 1, 2020

jjsjann123 approved these changes Jul 1, 2020

View reviewed changes

pytorchbot added the open source label Jul 1, 2020

Merge branch 'master' into 20_6_11_devel

b5a014c

jjsjann123 mentioned this pull request Jul 1, 2020

Scheduler separation from integration logic csarofeen/pytorch#134

Merged

csarofeen and others added 5 commits July 1, 2020 12:25

Revert github/workflow files.

5b4fa9d

Fix a merge conflict.

e0dd11c

adding back

4900bf1

enable test_jit_cuda_fuser.py for profiling_mode & legacy executor

1c1dd5c

update run_test with new scripts

1a78d63

csarofeen mentioned this pull request Jul 2, 2020

[DO NOT REVIEW] [CI Debug] #40921

Closed

Reduce block size for registers in CI.

c7c4425

facebook-github-bot reviewed Jul 6, 2020

View reviewed changes

facebook-github-bot closed this in b9b4f05 Jul 6, 2020

facebook-github-bot added the merged label Jul 6, 2020

malfet mentioned this pull request Jul 7, 2020

Fix CUDA jit codegen compilation with gcc-5.4 #41055

Closed

mruberry added the Merged label Oct 28, 2020

csarofeen deleted the 20_6_11_devel branch June 9, 2021 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nvFuser] Working towards reductions, codegen improvements #40864

[nvFuser] Working towards reductions, codegen improvements #40864

csarofeen commented Jul 1, 2020

dr-ci bot commented Jul 1, 2020 •

edited

Loading

jjsjann123 left a comment

facebook-github-bot left a comment

facebook-github-bot commented Jul 6, 2020

[nvFuser] Working towards reductions, codegen improvements #40864

[nvFuser] Working towards reductions, codegen improvements #40864

Conversation

csarofeen commented Jul 1, 2020

dr-ci bot commented Jul 1, 2020 • edited Loading

💊 CI failures summary and remediations

jjsjann123 left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 6, 2020

dr-ci bot commented Jul 1, 2020 •

edited

Loading