-
Notifications
You must be signed in to change notification settings - Fork 22.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nvFuser] Working towards reductions, codegen improvements #40864
Conversation
Fixing the bugs that Kevin finds. * Small alloc fix. * Add another reduction example, change fusion printMath. * Small test fix. * Change Reduction4 test to use TIDx.x * Rework allocation and buffer initialization, as init could be placed before alloc. Add lots of comments. * Fix bug in index compute when replaying reduction transformations for buffer initialization. * RFactor fix when root domain is reduction but has no rfactor axis. * Val isCconst fix. * update remote repo to local repo Co-authored-by: jiej <jiej@nvidia.com>
On behalf of @csarofeen Working on breaking up the lowering logic to be more incremental and easier to follow. This is in preparation to fix predicates after reductions and in combinations with following broadcasts. This is to replace #65 for the updated base branch. * Start commenting lowering better, break indexing pass into its own class. * Refactor lowering to break up passes and make logic more incremental. * removing commented code Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>
On behalf of @naoyam: blockReduce has a bug when X_THREAD=true, Y_THREAD=false, Z_THREAD=true. This PR adds a test case that exposes the bug as well as a fix. * Add a new test case that hits a bug in blockReduce * Fix a bug in blockReduce * clang-format * Rename test function to avoid a conflict Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
On behalf of @naoyam Add yet another reduction test: This is a test that exercises a bug that is recently fixed. Here's the fix: cab53ac. Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
* Rewrite ExpressionEvaluator to use IterVisitor * renaming variables per review comments Co-authored-by: jiej <jiej@nvidia.com>
Start working on the issue of not predicating based on threads that were used to parallelize a reduction. * Add eq to arith.h * Initial pass at thread predicates for ops after parallelized reductions. * Remove erroneous print statement. * update variable name in reference code in cpp tests * clang-tidy * fixing typo * clang format * clang_tidy again Co-authored-by: jiej <jiej@nvidia.com>
Fix CI configuration
Support parallel reductions across thread blocks Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
* [nvfuser] Debug flag via env_var Allow setting environment variable PYTORCH_CUDA_FUSER_DEBUG to spit out codegen cuda kernel in string.
* Remove extra new lines in fusion printMath. * Fix bug in predicate generation. * Add fusion.printKernel() * Fix potential segfault in unroll pass.
* Simplify a few test cases Replace custom exception checks with ASSERT_THROW macros. * ExpressionEvaluator * Stricter EvaluationContext binding rules 1. Don't allow overwriting concrete values 2. Don't allow binding values to expression results * Fix clang-format errors * Switch to Int::ScalarType The expression evaluator is now using Int::ScalarType instead of plain int. * Avoid a fight with clang-tidy * Check the numbers of kernel input and output parameters * Add an optional arc from TensorView to its root domain This is generated for detail_level >= DetailLevel::Explicit * Checks kernel arguments * Prefer pointers over references * Bug fix * Fix accidental construction of IValue * Use noReduction * Add const to const pointer * Make an integer tensor an error as it is not yet supported * clang-tidy * Incorporate review feedback * added lerp support in parser * add missing addcmul parser and tests * clang_format * Return TensorView* from binary/compound/ternary ops * clang-format * Use TensorView* param in reductionOp and sum * Prefer as instead of static_cast * Transform replay refactor (#53) Goal of this work is to have the transformation history be specific to IterDomains instead of TensorDomains. This should make it a lot easier to match up IterDomains during replay which can be complicated when taking into consideration reduction axes, rfactors, and broadcast axes. Co-authored-by: Jie <jiej@nvidia.com> Co-authored-by: Kevin Stephano <kevin.stephano@gmail.com> * python test fixes (#52) fix python tests failure: 1. put Fusion inside cudaKernel to facilitate runtime arg check. 2. relax rank check for broadcast support in integration; 3. add shape propagation for newly added opeartion: [addcmul, lerp]; 4. adding utility function to create FusionGuard from CudaKernel directly. * [nvFuser] add torch.jit.fuser context manager (pytorch#38993) (#54) Summary: 1. `torch.jit.fuser(str)` context manager facilitates switch between backend fusers: str - 'fuser0' enables only legacy fuser; str - 'fuser1' enables only NNC; str - 'fuser2' enables only nvFuser; 2. cleanup updated python tests. Pull Request resolved: pytorch#38993 Reviewed By: nairbv, pbelevich Differential Revision: D21800620 Pulled By: soumith fbshipit-source-id: 7fe855f5a5b97368e5e84c98c28d04b2e1276c85 * Add another reduction example, change fusion printMath. * Small test fix. * Change Reduction4 test to use TIDx.x * Minor cleanup. * Clean up some noexcepts. * More cleanup. * Refactor computeAt, get first broadcast example working. * Validate first non-trivial broadcast kernel. * Fix replay when broadcast is merged with non-broadcast dim. * Add constness in replay and index compute. * Add another broadcast test. Rework index computation for producers, base on consumer computed indices. * Val isCconst fix. * Add dot product gemm example. * Clang. * Minor bug fixes. * Format and add comments to GEMM test. * WIP: Fix for enabling broadcast after reduction plus a Softmax test. (#66) * Fix for enabling broadcast after reduction plus a Softmax test. * Cleaner way of fixing checks for matching non-broadcast dims to non-reduction dims. * Clang. Co-authored-by: Kevin Stephano <kstephano@nvidia.com> Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com> * Backout bad merge conflict resolutions. * More post rebase cleanup. * Refix a few tests. Some from a bad rebase. * Address comments. * Missed some review comments. * tmp Co-authored-by: Lemo <lemo1234@gmail.com> Co-authored-by: Naoya Maruyama <maruyama3@llnl.gov> Co-authored-by: Jie <jiej@nvidia.com> Co-authored-by: Kevin Stephano <kevin.stephano@gmail.com> Co-authored-by: Kevin Stephano <kstephano@nvidia.com>
* Fix replace size when a reduction dim is not in inner most. * Clang tidy. * Remove print statement in test.
This fixes #85 1. Fix Fusion::validateInputs() 2. Cleanup Fusion::~Fusion()
Fix for issue #88
* Fix MSVC (Windows) build ..\test\cpp\jit\test_gpu.cpp(2609): error C2398: Element '1': conversion from 'size_t' to '_Ty' requires a narrowing conversion with [ _Ty=int64_t ] ..\test\cpp\jit\test_gpu.cpp(2609): error C2398: Element '2': conversion from 'size_t' to '_Ty' requires a narrowing conversion with [ _Ty=int64_t ] * Fix newForReduction()
Allow partition registration to exclude nodes from fusion.
Move computeAt logic to separate file/class. In progress reworking of computeAt logic.
* Clang++ warnings. * Test fix.
💊 CI failures summary and remediationsAs of commit c7c4425 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 39 times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are preparing upstream PR.
Approval stamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type. Should fix regression caused by #40864 Pull Request resolved: #41055 Differential Revision: D22405478 Pulled By: malfet fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6
Summary: It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type. Should fix regression caused by pytorch#40864 Pull Request resolved: pytorch#41055 Differential Revision: D22405478 Pulled By: malfet fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6
Summary: Have basic reduction fusion working, and have improved code generator to approach performance of eager mode reductions. Coming soon will be pointwise-reduction fusions in a way that should prevent the possibility of hitting regressions. Also working on performant softmax kernels in the code generator which may be our next fusion target. Pull Request resolved: pytorch/pytorch#40864 Reviewed By: ngimel Differential Revision: D22392877 Pulled By: soumith fbshipit-source-id: 457448a807d628b1035f6d90bc0abe8a87bf8447
Summary: It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type. Should fix regression caused by pytorch/pytorch#40864 Pull Request resolved: pytorch/pytorch#41055 Differential Revision: D22405478 Pulled By: malfet fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6
Summary: Have basic reduction fusion working, and have improved code generator to approach performance of eager mode reductions. Coming soon will be pointwise-reduction fusions in a way that should prevent the possibility of hitting regressions. Also working on performant softmax kernels in the code generator which may be our next fusion target. Pull Request resolved: pytorch/pytorch#40864 Reviewed By: ngimel Differential Revision: D22392877 Pulled By: soumith fbshipit-source-id: 457448a807d628b1035f6d90bc0abe8a87bf8447
Summary: It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type. Should fix regression caused by pytorch/pytorch#40864 Pull Request resolved: pytorch/pytorch#41055 Differential Revision: D22405478 Pulled By: malfet fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6
Have basic reduction fusion working, and have improved code generator to approach performance of eager mode reductions. Coming soon will be pointwise-reduction fusions in a way that should prevent the possibility of hitting regressions. Also working on performant softmax kernels in the code generator which may be our next fusion target.