Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nvfuser_upstream_push] nvfuser code base bump 052422 #78244

Closed
wants to merge 808 commits into from

Conversation

jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented May 25, 2022

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:

  1. Initial support of cp.async and cp.async.wait: Async gmem copy support on sm80+ csarofeen/pytorch#1619
  2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: Add turing mma support and test csarofeen/pytorch#1643
  3. Extending the infrastructure to support mma operators on turing and ampere arch: Mma op integration on ampere csarofeen/pytorch#1440

Commits that's actually in this PR from the csarofeen branch

* dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726)
* dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619)
* 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643)
* d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440)
* fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724)
* 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720)
* 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719)
* f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667)
* 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718)
* 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701)
* 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676)

naoyam and others added 30 commits January 6, 2022 18:35
 This PR relaxes the constraint so that arbitrary padding sizes can be used as long as output domains don't get larger than input domains.
)

* Implement alias_copy operations only for CudaFusionGroup to support fallback path
* Remove alias (a) annotation from alias_copy schema
* force segment un-connected graphs

* derive heuristic on empty groups

* add test

* lint

* handled aliased output in batchnorm

* empty tensor

* lint and comment

* clang format

* check reference tv available in pointwise scheduler

* comment

* cleanup test and check utils
* Have Kernel Inherit IrContainer (pytorch#1375)
* Kernel<-Fusion Step 1 - Convert ExprSort to StmtSort (pytorch#1376)
* Kernel<-Fusion Step 2 - Mutator refactor (pytorch#1377)
* Kernel<-Fusion Step 3 - Debug print for expr_eval and type promotion fix (pytorch#1379)
* Kernel<-Fusion Step 4 - Have kernel inherit Fusion (pytorch#1380)
* Kernel<-Fusion Step 5 - Move lowering passes into their own files (pytorch#1382)
* Kernel<-Fusion Step 6 - Remove kir::IrBuilder (pytorch#1383)
* Kernel<-Fusion Step 7 - Remove kir functions from ComputeAtMap (pytorch#1384)
* Kernel<-Fusion Step 8 - Clean up [lower/executor] utils (pytorch#1387)
* Kernel<-Fusion Step 9 - Remove TensorView::fuserTv (pytorch#1388)
* Kernel<-Fusion Step 10 - Remove lowerVal/lowerExpr (pytorch#1389)
* Kernel<-Fusion Step 11 - Finish cleaning up kir (pytorch#1390)
Adds TensorView::doubleBuffer(). See the new tests how it is used.

For an overview of the lowering algorithm, please see lower_double_buffer.h.
1. extend buildShapeExpression for squeeze_copy/unsqueeze_copy ops.
2. patching broadcastSizes insertion point for buildShapeExpression to avoid graph::copy() linter assert.
3. adding tests
4. supports no-op squeeze (squeezing on dimension that's not size-1)

TODO (in follow up PRs):
1. extend buildShapeExpression to view_copy and reshape_copy as well
2. refactor broadcastSizesExpression to allow graceful failure instead of hard assert
* Verify vectorization eligibility for intermediate tensors
Used to avoid unnecessary parallel broadcast operations.

Needed to clean up the thread predicate info passed to Kernel. The only
necessary information is which parallel types are in fact parallel in
broadcast ops, so a map holding that information is added to
KernelSummary. ThreadPredicateMap is dropped from Kernel.


Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>
Fixes pytorch#1311

A scalar tensor is defined as rank-0, size-1 tensor. PyTorch eager (mostly TensorIterator) supports device promotion of cpu scalar tensor, where you can have cross device tensors (cpu scalar tensor and cuda tensors) feeding to a single operator, and cpu scalar tensor would be promoted to a scalar.

We extended this support to nvfuser. A few changes that's required to support this:

API to query if a given tensor is indeed a scalar tensor is_scalar. Current criteria is tensor rank and size (utils.h & utils.cpp)
Update to partition logic where the device of a cpu scalar tensor is ignored. This should avoid us accidentally merging an operator of two cpu scalar tensors.
Integration code updated:
i. maps TS cpu scalar tensor into codegen scalar;
ii. skips usual tensor checks (vectorization / valid inputs) for cpu scalar tensor
iii. kernel arguments to extract scalar value from cpu scalar tensor
cpu scalar tests. Need to verify: 1. cpu scalar tensor with gpu tensor; 2. cpu scalar tensor with cpu scalar tensor; 3. cpu scalar tensor with cpu tensor; 4. cpu tensor with gpu scalar tensor
Note that, we briefly tried the alternative approach where we move cpu scalar tensor to gpu scalar tensor. Implementation is very straight forward, but a cuda tensor creation and copy is really slow. Hence the motivation to extract it into a scalar argument. More details in the issue pytorch#1311
* Set nondet_tol to 1e-5 for gradcheck
* Fix random seed for deterministic results
* Implement operations to support AOTAutograd
@jjsjann123
Copy link
Collaborator Author

hmmm. github is sometime weird. merging master doesn't really give me any conflict 🤷

@facebook-github-bot
Copy link
Contributor

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jjsjann123
Copy link
Collaborator Author

Sorry for the repetitive ping, wondering if there's any update on the internal tests.

Trying to start our next push sooner rather than later to avoid 10k+ lines of code changes.

@davidberard98
Copy link
Contributor

davidberard98 commented Jun 6, 2022

@jjsjann123 sorry for the delays.. unfortunately I'm still trying to run the land internally, there's a few flaky jobs that have been blocking me from landing

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Merge failed due to list index out of range
Raised by https://github.com/pytorch/pytorch/actions/runs/2456816650

@jjsjann123
Copy link
Collaborator Author

jjsjann123 commented Jun 7, 2022

Merge failed due to list index out of range Raised by https://github.com/pytorch/pytorch/actions/runs/2456816650

@davidberard98 In case you missed this failure.

Strangely the job status shows success.... 😕

@osalpekar
Copy link
Member

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Merge failed due to list index out of range
Raised by https://github.com/pytorch/pytorch/actions/runs/2457813958

@osalpekar
Copy link
Member

@pytorchbot merge -f

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Merge failed due to list index out of range
Raised by https://github.com/pytorch/pytorch/actions/runs/2457838584

@jjsjann123
Copy link
Collaborator Author

Not sure if it's the same error: #77471 (comment)

@malfet
Copy link
Contributor

malfet commented Jun 7, 2022

This is weird, but I can reproduce the failure locally. Debugging

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Merge failed due to list index out of range
Raised by https://github.com/pytorch/pytorch/actions/runs/2457838584

@pytorch pytorch deleted a comment from pytorch-bot bot Jun 7, 2022
@malfet
Copy link
Contributor

malfet commented Jun 7, 2022

Not sure if it's the same error: #77471 (comment)

Yes, it is the same error, looks like GitHub API can not return us more than first 250 commits. @jjsjann123 is it possible to create those PRs with fewer commits? I will file the support tickets against GH, an will manually merge, but this is not maintainable long term

@jjsjann123
Copy link
Collaborator Author

Not sure if it's the same error: #77471 (comment)

Yes, it is the same error, looks like GitHub API can not return us more than first 250 commits. @jjsjann123 is it possible to create those PRs with fewer commits? I will file the support tickets against GH, an will manually merge, but this is not maintainable long term

errr... I personally hated divergent history but I lost the battle a while ago on that topic with our internal devs 😝

I can squash everything to a single commit for future PRs. I think that should be fine since we are logging all commits in the description as well.

@malfet malfet closed this in 9e52ad2 Jun 8, 2022
@github-actions
Copy link

github-actions bot commented Jun 8, 2022

Hey @jjsjann123.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Jun 8, 2022
Summary:
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:
1. Initial support of cp.async and cp.async.wait: csarofeen#1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440

Commits that's actually in this PR from the csarofeen branch
```
* dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f Fix missing cooperative launch (#1726)
* dc670a2 Async gmem copy support on sm80+ (#1619)
* 5e6a8da Add turing mma support and test (#1643)
* d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39 Mma op integration on ampere (#1440)
* fade8da patch python test for bfloat16 (#1724)
* 8fbd0b1 Fine-grained kernel profiling (#1720)
* 77c1b4f Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b More precise concretization analysis (#1719)
* f4d3630 Enable complex python tests (#1667)
* 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830 Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7 updating_ci_machine (#1718)
* 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453 Allow using nvFuser on CUDA extension (#1701)
* 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676)
```

Pull Request resolved: #78244

Reviewed By: ejguan

Differential Revision: D36678948

Pulled By: davidberard98

fbshipit-source-id: 0ccde965acbd31da67d99c6adb2eaaa888948105
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:
1. Initial support of cp.async and cp.async.wait: csarofeen/pytorch#1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen/pytorch#1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen/pytorch#1440

Commits that's actually in this PR from the csarofeen branch
```
* dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726)
* dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619)
* 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643)
* d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440)
* fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724)
* 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720)
* 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719)
* f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667)
* 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718)
* 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701)
* 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: pytorch/pytorch#78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:
1. Initial support of cp.async and cp.async.wait: csarofeen/pytorch#1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen/pytorch#1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen/pytorch#1440

Commits that's actually in this PR from the csarofeen branch
```
* 939e6c9 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* e4a514b Fix missing cooperative launch (#1726)
* 1bb7b65 Async gmem copy support on sm80+ (#1619)
* 69354da Add turing mma support and test (#1643)
* 7ca0fa9 Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 8c5fb93 Mma op integration on ampere (#1440)
* fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724)
* 1278624 Fine-grained kernel profiling (#1720)
* 34cb422 Adding dry run mode to skip arch dependent checks (#1702)
* 4c3cba4 More precise concretization analysis (#1719)
* 5a9ad9c Enable complex python tests (#1667)
* 8102c05 Minor bugfix in transform_rfactor.cpp (#1715)
* 2c0363c Separate root domain and rfactor domain in TransformPrinter (#1716)
* 1679226 Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718)
* acde15c Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701)
* e57cc6b Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: pytorch/pytorch#78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request cla signed oncall: jit Add this issue/PR to JIT oncall triage queue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet