[nvfuser_upstream_push] nvfuser code base bump 052422 #78244

jjsjann123 · 2022-05-25T06:55:51Z

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:

Initial support of cp.async and cp.async.wait: Async gmem copy support on sm80+ csarofeen/pytorch#1619
Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: Add turing mma support and test csarofeen/pytorch#1643
Extending the infrastructure to support mma operators on turing and ampere arch: Mma op integration on ampere csarofeen/pytorch#1440

Commits that's actually in this PR from the csarofeen branch

* dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726)
* dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619)
* 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643)
* d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440)
* fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724)
* 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720)
* 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719)
* f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667)
* 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718)
* 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701)
* 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676)

This PR relaxes the constraint so that arbitrary padding sizes can be used as long as output domains don't get larger than input domains.

) * Implement alias_copy operations only for CudaFusionGroup to support fallback path * Remove alias (a) annotation from alias_copy schema

* force segment un-connected graphs * derive heuristic on empty groups * add test * lint * handled aliased output in batchnorm * empty tensor * lint and comment * clang format * check reference tv available in pointwise scheduler * comment * cleanup test and check utils

* Have Kernel Inherit IrContainer (pytorch#1375) * Kernel<-Fusion Step 1 - Convert ExprSort to StmtSort (pytorch#1376) * Kernel<-Fusion Step 2 - Mutator refactor (pytorch#1377) * Kernel<-Fusion Step 3 - Debug print for expr_eval and type promotion fix (pytorch#1379) * Kernel<-Fusion Step 4 - Have kernel inherit Fusion (pytorch#1380) * Kernel<-Fusion Step 5 - Move lowering passes into their own files (pytorch#1382) * Kernel<-Fusion Step 6 - Remove kir::IrBuilder (pytorch#1383) * Kernel<-Fusion Step 7 - Remove kir functions from ComputeAtMap (pytorch#1384) * Kernel<-Fusion Step 8 - Clean up [lower/executor] utils (pytorch#1387) * Kernel<-Fusion Step 9 - Remove TensorView::fuserTv (pytorch#1388) * Kernel<-Fusion Step 10 - Remove lowerVal/lowerExpr (pytorch#1389) * Kernel<-Fusion Step 11 - Finish cleaning up kir (pytorch#1390)

…1395)

Adds TensorView::doubleBuffer(). See the new tests how it is used. For an overview of the lowering algorithm, please see lower_double_buffer.h.

1. extend buildShapeExpression for squeeze_copy/unsqueeze_copy ops. 2. patching broadcastSizes insertion point for buildShapeExpression to avoid graph::copy() linter assert. 3. adding tests 4. supports no-op squeeze (squeezing on dimension that's not size-1) TODO (in follow up PRs): 1. extend buildShapeExpression to view_copy and reshape_copy as well 2. refactor broadcastSizesExpression to allow graceful failure instead of hard assert

* Verify vectorization eligibility for intermediate tensors

Used to avoid unnecessary parallel broadcast operations. Needed to clean up the thread predicate info passed to Kernel. The only necessary information is which parallel types are in fact parallel in broadcast ops, so a map holding that information is added to KernelSummary. ThreadPredicateMap is dropped from Kernel. Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>

Fixes pytorch#1311 A scalar tensor is defined as rank-0, size-1 tensor. PyTorch eager (mostly TensorIterator) supports device promotion of cpu scalar tensor, where you can have cross device tensors (cpu scalar tensor and cuda tensors) feeding to a single operator, and cpu scalar tensor would be promoted to a scalar. We extended this support to nvfuser. A few changes that's required to support this: API to query if a given tensor is indeed a scalar tensor is_scalar. Current criteria is tensor rank and size (utils.h & utils.cpp) Update to partition logic where the device of a cpu scalar tensor is ignored. This should avoid us accidentally merging an operator of two cpu scalar tensors. Integration code updated: i. maps TS cpu scalar tensor into codegen scalar; ii. skips usual tensor checks (vectorization / valid inputs) for cpu scalar tensor iii. kernel arguments to extract scalar value from cpu scalar tensor cpu scalar tests. Need to verify: 1. cpu scalar tensor with gpu tensor; 2. cpu scalar tensor with cpu scalar tensor; 3. cpu scalar tensor with cpu tensor; 4. cpu tensor with gpu scalar tensor Note that, we briefly tried the alternative approach where we move cpu scalar tensor to gpu scalar tensor. Implementation is very straight forward, but a cuda tensor creation and copy is really slow. Hence the motivation to extract it into a scalar argument. More details in the issue pytorch#1311

…ump_013122

… upstream_master_bump_012622

…12622 Upstream master bump 012622

* Set nondet_tol to 1e-5 for gradcheck * Fix random seed for deterministic results

* Implement operations to support AOTAutograd

jjsjann123 · 2022-06-03T16:25:33Z

hmmm. github is sometime weird. merging master doesn't really give me any conflict 🤷

facebook-github-bot · 2022-06-03T16:43:08Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jjsjann123 · 2022-06-06T18:02:47Z

Sorry for the repetitive ping, wondering if there's any update on the internal tests.

Trying to start our next push sooner rather than later to avoid 10k+ lines of code changes.

davidberard98 · 2022-06-06T19:44:34Z

@jjsjann123 sorry for the delays.. unfortunately I'm still trying to run the land internally, there's a few flaky jobs that have been blocking me from landing

facebook-github-bot · 2022-06-07T19:02:57Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2022-06-07T19:04:11Z

@pytorchbot successfully started a merge job. Check the current status here

pytorchmergebot · 2022-06-07T19:04:16Z

Merge failed due to list index out of range
Raised by https://github.com/pytorch/pytorch/actions/runs/2456816650

jjsjann123 · 2022-06-07T19:24:03Z

Merge failed due to list index out of range Raised by https://github.com/pytorch/pytorch/actions/runs/2456816650

@davidberard98 In case you missed this failure.

Strangely the job status shows success.... 😕

osalpekar · 2022-06-07T22:22:28Z

@pytorchbot merge

pytorchmergebot · 2022-06-07T22:23:45Z

@pytorchbot successfully started a merge job. Check the current status here

pytorchmergebot · 2022-06-07T22:23:49Z

Merge failed due to list index out of range
Raised by https://github.com/pytorch/pytorch/actions/runs/2457813958

osalpekar · 2022-06-07T22:28:29Z

@pytorchbot merge -f

pytorchmergebot · 2022-06-07T22:29:48Z

@pytorchbot successfully started a merge job. Check the current status here

pytorchmergebot · 2022-06-07T22:29:52Z

Merge failed due to list index out of range
Raised by https://github.com/pytorch/pytorch/actions/runs/2457838584

jjsjann123 · 2022-06-07T22:42:16Z

Not sure if it's the same error: #77471 (comment)

malfet · 2022-06-07T22:49:48Z

This is weird, but I can reproduce the failure locally. Debugging

pytorchmergebot · 2022-06-07T22:51:41Z

@pytorchbot successfully started a merge job. Check the current status here

pytorchmergebot · 2022-06-07T22:51:45Z

Merge failed due to list index out of range
Raised by https://github.com/pytorch/pytorch/actions/runs/2457838584

malfet · 2022-06-07T23:21:53Z

Not sure if it's the same error: #77471 (comment)

Yes, it is the same error, looks like GitHub API can not return us more than first 250 commits. @jjsjann123 is it possible to create those PRs with fewer commits? I will file the support tickets against GH, an will manually merge, but this is not maintainable long term

jjsjann123 · 2022-06-07T23:47:50Z

Not sure if it's the same error: #77471 (comment)

Yes, it is the same error, looks like GitHub API can not return us more than first 250 commits. @jjsjann123 is it possible to create those PRs with fewer commits? I will file the support tickets against GH, an will manually merge, but this is not maintainable long term

errr... I personally hated divergent history but I lost the battle a while ago on that topic with our internal devs 😝

I can squash everything to a single commit for future PRs. I think that should be fine since we are logging all commits in the description as well.

github-actions · 2022-06-08T00:33:42Z

Hey @jjsjann123.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f Fix missing cooperative launch (#1726) * dc670a2 Async gmem copy support on sm80+ (#1619) * 5e6a8da Add turing mma support and test (#1643) * d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39 Mma op integration on ampere (#1440) * fade8da patch python test for bfloat16 (#1724) * 8fbd0b1 Fine-grained kernel profiling (#1720) * 77c1b4f Adding dry run mode to skip arch dependent checks (#1702) * 151d95b More precise concretization analysis (#1719) * f4d3630 Enable complex python tests (#1667) * 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7 updating_ci_machine (#1718) * 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453 Allow using nvFuser on CUDA extension (#1701) * 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: #78244 Reviewed By: ejguan Differential Revision: D36678948 Pulled By: davidberard98 fbshipit-source-id: 0ccde965acbd31da67d99c6adb2eaaa888948105

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen/pytorch#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen/pytorch#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen/pytorch#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726) * dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619) * 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643) * d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440) * fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724) * 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720) * 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702) * 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719) * f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667) * 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718) * 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701) * 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: pytorch/pytorch#78244 Approved by: https://github.com/csarofeen, https://github.com/malfet

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen/pytorch#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen/pytorch#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen/pytorch#1440 Commits that's actually in this PR from the csarofeen branch ``` * 939e6c9 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * e4a514b Fix missing cooperative launch (#1726) * 1bb7b65 Async gmem copy support on sm80+ (#1619) * 69354da Add turing mma support and test (#1643) * 7ca0fa9 Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 8c5fb93 Mma op integration on ampere (#1440) * fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724) * 1278624 Fine-grained kernel profiling (#1720) * 34cb422 Adding dry run mode to skip arch dependent checks (#1702) * 4c3cba4 More precise concretization analysis (#1719) * 5a9ad9c Enable complex python tests (#1667) * 8102c05 Minor bugfix in transform_rfactor.cpp (#1715) * 2c0363c Separate root domain and rfactor domain in TransformPrinter (#1716) * 1679226 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718) * acde15c Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701) * e57cc6b Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: pytorch/pytorch#78244 Approved by: https://github.com/csarofeen, https://github.com/malfet

naoyam and others added 30 commits January 6, 2022 18:35

Clang format (pytorch#1360)

24313d9

Support more flexible padding sizes in shift and gather (pytorch#1334)

2c40949

This PR relaxes the constraint so that arbitrary padding sizes can be used as long as output domains don't get larger than input domains.

clang-tidy (pytorch#1363)

9e0c9af

Print CA info only when FIR (pytorch#1364)

7ce469c

Transposing scalar tensor patch (pytorch#1361)

99be762

Build error fix (and clang-format) (pytorch#1368)

850200c

Fixes pytorch#1310 - alias_copy assertion in fallback path (pytorch#1335

4f6c999

) * Implement alias_copy operations only for CudaFusionGroup to support fallback path * Remove alias (a) annotation from alias_copy schema

Avoid constructing a new TV in parsing. (pytorch#1374)

d78a0c4

clang-format (pytorch#1394)

6b66dce

Pass inputs to compileFusion to avoid redundant compilation (pytorch#…

589cbca

…1395)

Double buffering support (pytorch#1381)

0da82c4

Adds TensorView::doubleBuffer(). See the new tests how it is used. For an overview of the lowering algorithm, please see lower_double_buffer.h.

Some minor fixes. (pytorch#1401)

39082d7

new clang-format binary hash (pytorch#1398)

5a4716a

Verify vectorization eligibility for intermediate tensors (pytorch#1402)

5f8de6f

* Verify vectorization eligibility for intermediate tensors

Add nullptr protection. (pytorch#1407)

08559e7

Merge commit '17540c5c80f5c6cd4e0fee42ec47d881e46f47f9' into HEAD

d730b56

Fix non-unrolled reduction scheduling. (pytorch#1409)

6c9aacf

Merge remote-tracking branch 'csarofeen/devel' into HEAD

f1c943f

Merge remote-tracking branch 'upstream/master' into HEAD

d3e5eb1

Merge remote-tracking branch 'upstream/master' into upstream_master_b…

a5d44ac

…ump_013122

Add tanh-backward support (pytorch#1420)

f9f20c7

Merge branch 'devel' of https://www.github.com/csarofeen/pytorch into…

8627b94

… upstream_master_bump_012622

Merge pull request pytorch#1414 from csarofeen/upstream_master_bump_0…

78c82c5

…12622 Upstream master bump 012622

Set nondet_tol to 1e-5 for gradcheck in test_unary_ops (pytorch#1423)

b88b0ea

* Set nondet_tol to 1e-5 for gradcheck * Fix random seed for deterministic results

Add log-softmax, mean, var, and std operations (pytorch#1417)

e0082e7

* Implement operations to support AOTAutograd

Merge remote-tracking branch 'upstream/master' into HEAD

119e698

pytorch deleted a comment from pytorch-bot bot Jun 7, 2022

malfet closed this in 9e52ad2 Jun 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nvfuser_upstream_push] nvfuser code base bump 052422 #78244

[nvfuser_upstream_push] nvfuser code base bump 052422 #78244

jjsjann123 commented May 25, 2022 •

edited

Loading

jjsjann123 commented Jun 3, 2022

facebook-github-bot commented Jun 3, 2022

jjsjann123 commented Jun 6, 2022

davidberard98 commented Jun 6, 2022 •

edited

Loading

facebook-github-bot commented Jun 7, 2022

pytorchmergebot commented Jun 7, 2022

pytorchmergebot commented Jun 7, 2022

jjsjann123 commented Jun 7, 2022 •

edited

Loading

osalpekar commented Jun 7, 2022

pytorchmergebot commented Jun 7, 2022

pytorchmergebot commented Jun 7, 2022

osalpekar commented Jun 7, 2022

pytorchmergebot commented Jun 7, 2022

pytorchmergebot commented Jun 7, 2022

jjsjann123 commented Jun 7, 2022

malfet commented Jun 7, 2022 •

edited

Loading

pytorchmergebot commented Jun 7, 2022

pytorchmergebot commented Jun 7, 2022

malfet commented Jun 7, 2022

jjsjann123 commented Jun 7, 2022

github-actions bot commented Jun 8, 2022

[nvfuser_upstream_push] nvfuser code base bump 052422 #78244

[nvfuser_upstream_push] nvfuser code base bump 052422 #78244

Conversation

jjsjann123 commented May 25, 2022 • edited Loading

jjsjann123 commented Jun 3, 2022

facebook-github-bot commented Jun 3, 2022

jjsjann123 commented Jun 6, 2022

davidberard98 commented Jun 6, 2022 • edited Loading

facebook-github-bot commented Jun 7, 2022

pytorchmergebot commented Jun 7, 2022

pytorchmergebot commented Jun 7, 2022

jjsjann123 commented Jun 7, 2022 • edited Loading

osalpekar commented Jun 7, 2022

pytorchmergebot commented Jun 7, 2022

pytorchmergebot commented Jun 7, 2022

osalpekar commented Jun 7, 2022

pytorchmergebot commented Jun 7, 2022

pytorchmergebot commented Jun 7, 2022

jjsjann123 commented Jun 7, 2022

malfet commented Jun 7, 2022 • edited Loading

pytorchmergebot commented Jun 7, 2022

pytorchmergebot commented Jun 7, 2022

malfet commented Jun 7, 2022

jjsjann123 commented Jun 7, 2022

github-actions bot commented Jun 8, 2022

jjsjann123 commented May 25, 2022 •

edited

Loading

davidberard98 commented Jun 6, 2022 •

edited

Loading

jjsjann123 commented Jun 7, 2022 •

edited

Loading

malfet commented Jun 7, 2022 •

edited

Loading