-
Notifications
You must be signed in to change notification settings - Fork 21.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nvfuser_upstream_push] nvfuser code base bump 052422 #78244
Conversation
This PR relaxes the constraint so that arbitrary padding sizes can be used as long as output domains don't get larger than input domains.
* force segment un-connected graphs * derive heuristic on empty groups * add test * lint * handled aliased output in batchnorm * empty tensor * lint and comment * clang format * check reference tv available in pointwise scheduler * comment * cleanup test and check utils
* Have Kernel Inherit IrContainer (pytorch#1375) * Kernel<-Fusion Step 1 - Convert ExprSort to StmtSort (pytorch#1376) * Kernel<-Fusion Step 2 - Mutator refactor (pytorch#1377) * Kernel<-Fusion Step 3 - Debug print for expr_eval and type promotion fix (pytorch#1379) * Kernel<-Fusion Step 4 - Have kernel inherit Fusion (pytorch#1380) * Kernel<-Fusion Step 5 - Move lowering passes into their own files (pytorch#1382) * Kernel<-Fusion Step 6 - Remove kir::IrBuilder (pytorch#1383) * Kernel<-Fusion Step 7 - Remove kir functions from ComputeAtMap (pytorch#1384) * Kernel<-Fusion Step 8 - Clean up [lower/executor] utils (pytorch#1387) * Kernel<-Fusion Step 9 - Remove TensorView::fuserTv (pytorch#1388) * Kernel<-Fusion Step 10 - Remove lowerVal/lowerExpr (pytorch#1389) * Kernel<-Fusion Step 11 - Finish cleaning up kir (pytorch#1390)
Adds TensorView::doubleBuffer(). See the new tests how it is used. For an overview of the lowering algorithm, please see lower_double_buffer.h.
1. extend buildShapeExpression for squeeze_copy/unsqueeze_copy ops. 2. patching broadcastSizes insertion point for buildShapeExpression to avoid graph::copy() linter assert. 3. adding tests 4. supports no-op squeeze (squeezing on dimension that's not size-1) TODO (in follow up PRs): 1. extend buildShapeExpression to view_copy and reshape_copy as well 2. refactor broadcastSizesExpression to allow graceful failure instead of hard assert
* Verify vectorization eligibility for intermediate tensors
Used to avoid unnecessary parallel broadcast operations. Needed to clean up the thread predicate info passed to Kernel. The only necessary information is which parallel types are in fact parallel in broadcast ops, so a map holding that information is added to KernelSummary. ThreadPredicateMap is dropped from Kernel. Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>
Fixes pytorch#1311 A scalar tensor is defined as rank-0, size-1 tensor. PyTorch eager (mostly TensorIterator) supports device promotion of cpu scalar tensor, where you can have cross device tensors (cpu scalar tensor and cuda tensors) feeding to a single operator, and cpu scalar tensor would be promoted to a scalar. We extended this support to nvfuser. A few changes that's required to support this: API to query if a given tensor is indeed a scalar tensor is_scalar. Current criteria is tensor rank and size (utils.h & utils.cpp) Update to partition logic where the device of a cpu scalar tensor is ignored. This should avoid us accidentally merging an operator of two cpu scalar tensors. Integration code updated: i. maps TS cpu scalar tensor into codegen scalar; ii. skips usual tensor checks (vectorization / valid inputs) for cpu scalar tensor iii. kernel arguments to extract scalar value from cpu scalar tensor cpu scalar tests. Need to verify: 1. cpu scalar tensor with gpu tensor; 2. cpu scalar tensor with cpu scalar tensor; 3. cpu scalar tensor with cpu tensor; 4. cpu tensor with gpu scalar tensor Note that, we briefly tried the alternative approach where we move cpu scalar tensor to gpu scalar tensor. Implementation is very straight forward, but a cuda tensor creation and copy is really slow. Hence the motivation to extract it into a scalar argument. More details in the issue pytorch#1311
… upstream_master_bump_012622
…12622 Upstream master bump 012622
* Set nondet_tol to 1e-5 for gradcheck * Fix random seed for deterministic results
* Implement operations to support AOTAutograd
hmmm. github is sometime weird. merging master doesn't really give me any conflict 🤷 |
@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Sorry for the repetitive ping, wondering if there's any update on the internal tests. Trying to start our next push sooner rather than later to avoid 10k+ lines of code changes. |
@jjsjann123 sorry for the delays.. unfortunately I'm still trying to run the land internally, there's a few flaky jobs that have been blocking me from landing |
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
@pytorchbot successfully started a merge job. Check the current status here |
Merge failed due to list index out of range |
@davidberard98 In case you missed this failure. Strangely the job status shows success.... 😕 |
@pytorchbot merge |
@pytorchbot successfully started a merge job. Check the current status here |
Merge failed due to list index out of range |
@pytorchbot merge -f |
@pytorchbot successfully started a merge job. Check the current status here |
Merge failed due to list index out of range |
Not sure if it's the same error: #77471 (comment) |
This is weird, but I can reproduce the failure locally. Debugging |
@pytorchbot successfully started a merge job. Check the current status here |
Merge failed due to list index out of range |
Yes, it is the same error, looks like GitHub API can not return us more than first 250 commits. @jjsjann123 is it possible to create those PRs with fewer commits? I will file the support tickets against GH, an will manually merge, but this is not maintainable long term |
errr... I personally hated divergent history but I lost the battle a while ago on that topic with our internal devs 😝 I can squash everything to a single commit for future PRs. I think that should be fine since we are logging all commits in the description as well. |
Hey @jjsjann123. |
Summary: Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f Fix missing cooperative launch (#1726) * dc670a2 Async gmem copy support on sm80+ (#1619) * 5e6a8da Add turing mma support and test (#1643) * d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39 Mma op integration on ampere (#1440) * fade8da patch python test for bfloat16 (#1724) * 8fbd0b1 Fine-grained kernel profiling (#1720) * 77c1b4f Adding dry run mode to skip arch dependent checks (#1702) * 151d95b More precise concretization analysis (#1719) * f4d3630 Enable complex python tests (#1667) * 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7 updating_ci_machine (#1718) * 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453 Allow using nvFuser on CUDA extension (#1701) * 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: #78244 Reviewed By: ejguan Differential Revision: D36678948 Pulled By: davidberard98 fbshipit-source-id: 0ccde965acbd31da67d99c6adb2eaaa888948105
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen/pytorch#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen/pytorch#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen/pytorch#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726) * dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619) * 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643) * d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440) * fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724) * 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720) * 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702) * 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719) * f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667) * 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718) * 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701) * 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: pytorch/pytorch#78244 Approved by: https://github.com/csarofeen, https://github.com/malfet
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen/pytorch#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen/pytorch#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen/pytorch#1440 Commits that's actually in this PR from the csarofeen branch ``` * 939e6c9 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * e4a514b Fix missing cooperative launch (#1726) * 1bb7b65 Async gmem copy support on sm80+ (#1619) * 69354da Add turing mma support and test (#1643) * 7ca0fa9 Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 8c5fb93 Mma op integration on ampere (#1440) * fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724) * 1278624 Fine-grained kernel profiling (#1720) * 34cb422 Adding dry run mode to skip arch dependent checks (#1702) * 4c3cba4 More precise concretization analysis (#1719) * 5a9ad9c Enable complex python tests (#1667) * 8102c05 Minor bugfix in transform_rfactor.cpp (#1715) * 2c0363c Separate root domain and rfactor domain in TransformPrinter (#1716) * 1679226 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718) * acde15c Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701) * e57cc6b Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: pytorch/pytorch#78244 Approved by: https://github.com/csarofeen, https://github.com/malfet
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/
A few bigger updates:
Commits that's actually in this PR from the csarofeen branch