Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nvFuser] Working towards reductions, codegen improvements #40864

Closed
wants to merge 71 commits into from

Conversation

csarofeen
Copy link
Contributor

Have basic reduction fusion working, and have improved code generator to approach performance of eager mode reductions. Coming soon will be pointwise-reduction fusions in a way that should prevent the possibility of hitting regressions. Also working on performant softmax kernels in the code generator which may be our next fusion target.

csarofeen and others added 30 commits June 11, 2020 23:43
Fixing the bugs that Kevin finds.

* Small alloc fix.
* Add another reduction example, change fusion printMath.
* Small test fix.
* Change Reduction4 test to use TIDx.x
* Rework allocation and buffer initialization, as init could be placed before alloc. Add lots of comments.
* Fix bug in index compute when replaying reduction transformations for buffer initialization.
* RFactor fix when root domain is reduction but has no rfactor axis.
* Val isCconst fix.
* update remote repo to local repo
Co-authored-by: jiej <jiej@nvidia.com>
On behalf of @csarofeen 

Working on breaking up the lowering logic to be more incremental and easier to follow. This is in preparation to fix predicates after reductions and in combinations with following broadcasts.

This is to replace #65 for the updated base branch.

* Start commenting lowering better, break indexing pass into its own class.
* Refactor lowering to break up passes and make logic more incremental.
* removing commented code

Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>
On behalf of @naoyam:

blockReduce has a bug when X_THREAD=true, Y_THREAD=false, Z_THREAD=true. This PR adds a test case that exposes the bug as well as a fix.

* Add a new test case that hits a bug in blockReduce
* Fix a bug in blockReduce
* clang-format
* Rename test function to avoid a conflict

Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
On behalf of @naoyam

Add yet another reduction test:
This is a test that exercises a bug that is recently fixed. Here's the fix: cab53ac.

Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
* Rewrite ExpressionEvaluator to use IterVisitor
* renaming variables per review comments

Co-authored-by: jiej <jiej@nvidia.com>
Start working on the issue of not predicating based on threads that were used to parallelize a reduction.

* Add eq to arith.h
* Initial pass at thread predicates for ops after parallelized reductions.
* Remove erroneous print statement.
* update variable name in reference code in cpp tests
* clang-tidy
* fixing typo
* clang format
* clang_tidy again

Co-authored-by: jiej <jiej@nvidia.com>
Support parallel reductions across thread blocks

Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
1. Remove all references to TensorContiguity (fixes #83)
2. Various small pieces extracted from an experimental branch
* [nvfuser] Debug flag via env_var

Allow setting environment variable PYTORCH_CUDA_FUSER_DEBUG to spit out codegen
cuda kernel in string.
* Remove extra new lines in fusion printMath.

* Fix bug in predicate generation.

* Add fusion.printKernel()

* Fix potential segfault in unroll pass.
* Simplify a few test cases

Replace custom exception checks with ASSERT_THROW macros.

* ExpressionEvaluator

* Stricter EvaluationContext binding rules

1. Don't allow overwriting concrete values
2. Don't allow binding values to expression results

* Fix clang-format errors

* Switch to Int::ScalarType

The expression evaluator is now using Int::ScalarType instead
of plain int.

* Avoid a fight with clang-tidy

* Check the numbers of kernel input and output parameters

* Add an optional arc from TensorView to its root domain

This is generated for detail_level >= DetailLevel::Explicit

* Checks kernel arguments

* Prefer pointers over references

* Bug fix

* Fix accidental construction of IValue

* Use noReduction

* Add const to const pointer

* Make an integer tensor an error as it is not yet supported

* clang-tidy

* Incorporate review feedback

* added lerp support in parser

* add missing addcmul parser and tests

* clang_format

* Return TensorView* from binary/compound/ternary ops

* clang-format

* Use TensorView* param in reductionOp and sum

* Prefer as instead of static_cast

* Transform replay refactor (#53)

Goal of this work is to have the transformation history be specific to IterDomains instead of TensorDomains. This should make it a lot easier to match up IterDomains during replay which can be complicated when taking into consideration reduction axes, rfactors, and broadcast axes.

Co-authored-by: Jie <jiej@nvidia.com>
Co-authored-by: Kevin Stephano <kevin.stephano@gmail.com>

* python test fixes (#52)

fix python tests failure:

1. put Fusion inside cudaKernel to facilitate runtime arg check.
2. relax rank check for broadcast support in integration;
3. add shape propagation for newly added opeartion: [addcmul, lerp];
4. adding utility function to create FusionGuard from CudaKernel directly.

* [nvFuser] add torch.jit.fuser context manager (pytorch#38993) (#54)

Summary:
1. `torch.jit.fuser(str)` context manager facilitates switch between backend fusers:
  str - 'fuser0' enables only legacy fuser;
  str - 'fuser1' enables only NNC;
  str - 'fuser2' enables only nvFuser;
2. cleanup updated python tests.
Pull Request resolved: pytorch#38993

Reviewed By: nairbv, pbelevich

Differential Revision: D21800620

Pulled By: soumith

fbshipit-source-id: 7fe855f5a5b97368e5e84c98c28d04b2e1276c85

* Add another reduction example, change fusion printMath.

* Small test fix.

* Change Reduction4 test to use TIDx.x

* Minor cleanup.

* Clean up some noexcepts.

* More cleanup.

* Refactor computeAt, get first broadcast example working.

* Validate first non-trivial broadcast kernel.

* Fix replay when broadcast is merged with non-broadcast dim.

* Add constness in replay and index compute.

* Add another broadcast test. Rework index computation for producers, base on consumer computed indices.

* Val isCconst fix.

* Add dot product gemm example.

* Clang.

* Minor bug fixes.

* Format and add comments to GEMM test.

* WIP: Fix for enabling broadcast after reduction plus a Softmax test. (#66)

* Fix for enabling broadcast after reduction plus a Softmax test.

* Cleaner way of fixing checks for matching non-broadcast dims to non-reduction dims.

* Clang.

Co-authored-by: Kevin Stephano <kstephano@nvidia.com>
Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>

* Backout bad merge conflict resolutions.

* More post rebase cleanup.

* Refix a few tests. Some from a bad rebase.

* Address comments.

* Missed some review comments.

* tmp

Co-authored-by: Lemo <lemo1234@gmail.com>
Co-authored-by: Naoya Maruyama <maruyama3@llnl.gov>
Co-authored-by: Jie <jiej@nvidia.com>
Co-authored-by: Kevin Stephano <kevin.stephano@gmail.com>
Co-authored-by: Kevin Stephano <kstephano@nvidia.com>
* Fix replace size when a reduction dim is not in inner most.

* Clang tidy.

* Remove print statement in test.
This fixes #85

1. Fix Fusion::validateInputs()
2. Cleanup Fusion::~Fusion()
* Fix MSVC (Windows) build

..\test\cpp\jit\test_gpu.cpp(2609): error C2398: Element '1': conversion from 'size_t' to '_Ty' requires a narrowing conversion
        with
        [
            _Ty=int64_t
        ]
..\test\cpp\jit\test_gpu.cpp(2609): error C2398: Element '2': conversion from 'size_t' to '_Ty' requires a narrowing conversion
        with
        [
            _Ty=int64_t
        ]

* Fix newForReduction()
Allow partition registration to exclude nodes from fusion.
Move computeAt logic to separate file/class. In progress reworking of computeAt logic.
* Clang++ warnings.

* Test fix.
@csarofeen csarofeen requested a review from apaszke as a code owner July 1, 2020 15:42
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Jul 1, 2020
@dr-ci
Copy link

dr-ci bot commented Jul 1, 2020

💊 CI failures summary and remediations

As of commit c7c4425 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 39 times.

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are preparing upstream PR.
Approval stamp

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@soumith merged this pull request in b9b4f05.

facebook-github-bot pushed a commit that referenced this pull request Jul 7, 2020
Summary:
It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type.

Should fix regression caused by #40864

Pull Request resolved: #41055

Differential Revision: D22405478

Pulled By: malfet

fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6
csarofeen pushed a commit to csarofeen/pytorch that referenced this pull request Jul 7, 2020
Summary:
It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type.

Should fix regression caused by pytorch#40864

Pull Request resolved: pytorch#41055

Differential Revision: D22405478

Pulled By: malfet

fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6
@csarofeen csarofeen deleted the 20_6_11_devel branch June 9, 2021 13:37
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Summary:
Have basic reduction fusion working, and have improved code generator to approach performance of eager mode reductions. Coming soon will be pointwise-reduction fusions in a way that should prevent the possibility of hitting regressions. Also working on performant softmax kernels in the code generator which may be our next fusion target.

Pull Request resolved: pytorch/pytorch#40864

Reviewed By: ngimel

Differential Revision: D22392877

Pulled By: soumith

fbshipit-source-id: 457448a807d628b1035f6d90bc0abe8a87bf8447
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Summary:
It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type.

Should fix regression caused by pytorch/pytorch#40864

Pull Request resolved: pytorch/pytorch#41055

Differential Revision: D22405478

Pulled By: malfet

fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Summary:
Have basic reduction fusion working, and have improved code generator to approach performance of eager mode reductions. Coming soon will be pointwise-reduction fusions in a way that should prevent the possibility of hitting regressions. Also working on performant softmax kernels in the code generator which may be our next fusion target.

Pull Request resolved: pytorch/pytorch#40864

Reviewed By: ngimel

Differential Revision: D22392877

Pulled By: soumith

fbshipit-source-id: 457448a807d628b1035f6d90bc0abe8a87bf8447
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Summary:
It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type.

Should fix regression caused by pytorch/pytorch#40864

Pull Request resolved: pytorch/pytorch#41055

Differential Revision: D22405478

Pulled By: malfet

fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged oncall: jit Add this issue/PR to JIT oncall triage queue open source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants