optimize sampled_addmm performance on CPU (SparseCSR) #90978

mingfeima · 2022-12-16T02:53:38Z

Stack from ghstack:

Target and Background

This PR is improving the performance of sampled_addmm on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference.

The current implementation is a reference design which converts SparseCSR tensor back to dense tensor and then do the addmm and convert back to SparseCSR again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger OOM).

Benchmarks

Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from ogb-products where:

number of nodes: 2.4 * 10^6
number of edges: 1.26 * 10^8
number of features: 128

So if we store the adjacency matrix is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, 1100x speedup:

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket.

### before: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms!

### after: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms!

### after: run the whole dataset
sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms!

cc @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10

[ghstack-poisoned]

pytorch-bot · 2022-12-16T02:53:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90978

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b3eab0f:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: b41edce7d2e1330bb2e1c00a3ec78308570e63cd Pull Request resolved: #90978

mingfeima · 2022-12-16T04:48:34Z

TODO list for 2nd stage of PyG performance optimization on CPU:

optimization of sampled_addmm on SparseCSR
enabling of sampled_addm on SparseCOO: canceled (coalesce() takes too much time, the above dataset would take 7.2s to finish this)
unify ReduceTypes: GNN would rely on a few operators who have similar ReduceTypes, such as ScatterReduce, SegmentReduce, SampledReduce, SpmmReduce
optimization of segment_reduce with lengths and offsets
migrate sampled_reduce
enable multi aggregation
New ReduceType of std (use stable alg. welford?)

### Target and background This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference. The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`). ### Benchmarks Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where: * #nodes: 2.4 * 10^6 * number of edges: 1.26 * 10^8 * number of features: 128 So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup: CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket. ``` ### before: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms! ### after: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms! ### after: run the whole dataset sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms! ``` [ghstack-poisoned]

mingfeima · 2023-01-03T03:48:25Z

@rusty1s could you please review this one? sampled_addmm is mapped to SDDMM on cuda device but we don't have equivalent in MKL.

### Target and background This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference. The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`). ### Benchmarks Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where: * #nodes: 2.4 * 10^6 * number of edges: 1.26 * 10^8 * number of features: 128 So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup: CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket. ``` ### before: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms! ### after: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms! ### after: run the whole dataset sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms! ``` [ghstack-poisoned]

aten/src/ATen/native/cpu/SampledAddmmKernel.cpp

aten/src/ATen/native/sparse/SparseBlas.cpp

### Target and Background This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference. The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`). ### Benchmarks Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where: * number of nodes: 2.4 * 10^6 * number of edges: 1.26 * 10^8 * number of features: 128 So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup: CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket. ``` ### before: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms! ### after: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms! ### after: run the whole dataset sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms! ``` [ghstack-poisoned]

pearu

LGTM! Thanks, @mingfeima!

nikitaved

Looks great! Thank you, @mingfeima !

### Target and Background This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference. The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`). ### Benchmarks Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where: * number of nodes: 2.4 * 10^6 * number of edges: 1.26 * 10^8 * number of features: 128 So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup: CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket. ``` ### before: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms! ### after: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms! ### after: run the whole dataset sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms! ``` [ghstack-poisoned]

ezyang · 2023-01-10T22:11:49Z

@pytorchbot merge

pytorchmergebot · 2023-01-10T22:13:31Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

seemethere · 2023-01-11T20:10:13Z

@pytorchbot revert -m "This broke internal builds for android due to the new file added being missing in build_variables.bzl"

pytorch-bot · 2023-01-11T20:10:15Z

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

seemethere · 2023-01-11T20:10:28Z

@pytorchbot revert -m "This broke internal builds for android due to the new file added being missing in build_variables.bzl" -c ghfirst

seemethere · 2023-01-11T20:12:01Z

Would suggest making the following change when attempting to re-land:

diff --git a/build_variables.bzl b/build_variables.bzl
index 225fb9fbc7e..0359c5123c7 100644
--- a/build_variables.bzl
+++ b/build_variables.bzl
@@ -1430,6 +1430,7 @@ aten_native_source_non_codegen_list = [
     "aten/src/ATen/native/nested/NestedTensorUnaryOps.cpp",
     "aten/src/ATen/native/nested/NestedTensorUtils.cpp",
     "aten/src/ATen/native/sparse/ParamUtils.cpp",
+    "aten/src/ATen/native/cpu/SampledAddmmKernel.cpp",
     "aten/src/ATen/native/sparse/SoftMax.cpp",
     "aten/src/ATen/native/sparse/SparseBlas.cpp",
     "aten/src/ATen/native/sparse/SparseBlasImpl.cpp",

pytorchmergebot · 2023-01-11T20:12:08Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-01-11T20:12:17Z

@mingfeima your PR has been successfully reverted.

This reverts commit 645fb21. Reverted #90978 on behalf of https://github.com/seemethere due to This broke internal builds for android due to the new file added being missing in build_variables.bzl

huydhn · 2023-01-11T20:16:16Z

Thank @seemethere for reverting this one, I'm about to do the same because this also breaks periodic buck-build-test https://hud.pytorch.org/commit/pytorch/pytorch/364f526b9cdf9818a7647b5e637efdee825d61a1

ezyang · 2023-01-11T23:56:13Z

Why is this not caught in oss build

huydhn · 2023-01-12T00:43:28Z

Why is this not caught in oss build

It was caught by periodic buck build https://github.com/pytorch/pytorch/actions/runs/3889778963/jobs/6638325506. Now with ciflow/periodic in the PR, the failure has shown up correctly

@XiaobingSuper

### Target and Background This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference. The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`). ### Benchmarks Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where: * number of nodes: 2.4 * 10^6 * number of edges: 1.26 * 10^8 * number of features: 128 So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup: CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket. ``` ### before: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms! ### after: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms! ### after: run the whole dataset sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms! ``` cc jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

mingfeima · 2023-01-12T02:22:06Z

build_variables.bzl updated!

mingfeima · 2023-01-12T12:02:13Z

@pytorchbot merge

pytorchmergebot · 2023-01-12T12:04:02Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

optimize sampled_addmm performance on CPU (SparseCSR)

3156e37

[ghstack-poisoned]

pytorch-bot bot added the release notes: sparse release notes category label Dec 16, 2022

github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Dec 16, 2022

mingfeima added a commit that referenced this pull request Dec 16, 2022

optimize sampled_addmm performance on CPU (SparseCSR)

ddadbb6

ghstack-source-id: b41edce7d2e1330bb2e1c00a3ec78308570e63cd Pull Request resolved: #90978

mingfeima marked this pull request as draft December 16, 2022 02:54

pytorchbot added the open source label Dec 16, 2022

mingfeima marked this pull request as ready for review December 16, 2022 04:42

mingfeima added release notes: gnn gnn related optimizations intel This tag is for PR from Intel labels Dec 16, 2022

mingfeima mentioned this pull request Dec 16, 2022

[Roadmap] CPU Performance Optimization for PyG pyg-team/pytorch_geometric#4891

Open

32 tasks

mingfeima added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 16, 2022

This was referenced Dec 29, 2022

unify reduction types from different operators: scatter, scatter_reduce, segment_reduce #91499

Closed

optimize segment_reduce forward and backward path on CPU #91500

Closed

implement sampled_reduce for gnn aggregation #91501

Closed

mingfeima requested review from pearu, amjames, bhosmer, malfet, nikitaved, jgong5 and ezyang January 3, 2023 03:42

pearu reviewed Jan 3, 2023

View reviewed changes

aten/src/ATen/native/cpu/SampledAddmmKernel.cpp Outdated Show resolved Hide resolved

nikitaved reviewed Jan 3, 2023

View reviewed changes

aten/src/ATen/native/sparse/SparseBlas.cpp Show resolved Hide resolved

pearu approved these changes Jan 9, 2023

View reviewed changes

nikitaved reviewed Jan 9, 2023

View reviewed changes

cpuhrsch approved these changes Jan 9, 2023

View reviewed changes

fzhao3 mentioned this pull request Jan 10, 2023

[PT2.0 Feature Proposal] GNN inference and training optimization on CPU #91951

Closed

pytorchmergebot added the Merged label Jan 10, 2023

pytorchmergebot closed this in 645fb21 Jan 10, 2023

pytorchmergebot added the Reverted label Jan 11, 2023

huydhn reopened this Jan 11, 2023

huydhn added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Jan 11, 2023

pytorchmergebot closed this in 3ab58fd Jan 12, 2023

facebook-github-bot deleted the gh/mingfeima/92/head branch June 8, 2023 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize sampled_addmm performance on CPU (SparseCSR) #90978

optimize sampled_addmm performance on CPU (SparseCSR) #90978

mingfeima commented Dec 16, 2022 •

edited by pytorch-bot bot

pytorch-bot bot commented Dec 16, 2022 •

edited

mingfeima commented Dec 16, 2022 •

edited

mingfeima commented Jan 3, 2023

pearu left a comment

nikitaved left a comment

ezyang commented Jan 10, 2023

pytorchmergebot commented Jan 10, 2023

seemethere commented Jan 11, 2023

pytorch-bot bot commented Jan 11, 2023

seemethere commented Jan 11, 2023

seemethere commented Jan 11, 2023

pytorchmergebot commented Jan 11, 2023

pytorchmergebot commented Jan 11, 2023

huydhn commented Jan 11, 2023

ezyang commented Jan 11, 2023

huydhn commented Jan 12, 2023

mingfeima commented Jan 12, 2023

mingfeima commented Jan 12, 2023

pytorchmergebot commented Jan 12, 2023

optimize sampled_addmm performance on CPU (SparseCSR) #90978

optimize sampled_addmm performance on CPU (SparseCSR) #90978

Conversation

mingfeima commented Dec 16, 2022 • edited by pytorch-bot bot

Target and Background

Benchmarks

pytorch-bot bot commented Dec 16, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90978

✅ No Failures

mingfeima commented Dec 16, 2022 • edited

mingfeima commented Jan 3, 2023

pearu left a comment

Choose a reason for hiding this comment

nikitaved left a comment

Choose a reason for hiding this comment

ezyang commented Jan 10, 2023

pytorchmergebot commented Jan 10, 2023

Merge started

seemethere commented Jan 11, 2023

pytorch-bot bot commented Jan 11, 2023

seemethere commented Jan 11, 2023

seemethere commented Jan 11, 2023

pytorchmergebot commented Jan 11, 2023

pytorchmergebot commented Jan 11, 2023

huydhn commented Jan 11, 2023

ezyang commented Jan 11, 2023

huydhn commented Jan 12, 2023

mingfeima commented Jan 12, 2023

mingfeima commented Jan 12, 2023

pytorchmergebot commented Jan 12, 2023

Merge started

mingfeima commented Dec 16, 2022 •

edited by pytorch-bot bot

pytorch-bot bot commented Dec 16, 2022 •

edited

mingfeima commented Dec 16, 2022 •

edited