Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize sampled_addmm performance on CPU (SparseCSR) #90978

Closed
wants to merge 9 commits into from

Conversation

mingfeima
Copy link
Collaborator

@mingfeima mingfeima commented Dec 16, 2022

Stack from ghstack:

Target and Background

This PR is improving the performance of sampled_addmm on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference.

The current implementation is a reference design which converts SparseCSR tensor back to dense tensor and then do the addmm and convert back to SparseCSR again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger OOM).

Benchmarks

Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from ogb-products where:

  • number of nodes: 2.4 * 10^6
  • number of edges: 1.26 * 10^8
  • number of features: 128

So if we store the adjacency matrix is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, 1100x speedup:

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket.

### before: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms!

### after: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms!

### after: run the whole dataset
sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms!

cc @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 16, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90978

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b3eab0f:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: sparse release notes category label Dec 16, 2022
@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Dec 16, 2022
mingfeima added a commit that referenced this pull request Dec 16, 2022
ghstack-source-id: b41edce7d2e1330bb2e1c00a3ec78308570e63cd
Pull Request resolved: #90978
@mingfeima mingfeima marked this pull request as draft December 16, 2022 02:54
@mingfeima mingfeima marked this pull request as ready for review December 16, 2022 04:42
@mingfeima
Copy link
Collaborator Author

mingfeima commented Dec 16, 2022

TODO list for 2nd stage of PyG performance optimization on CPU:

  • optimization of sampled_addmm on SparseCSR
  • enabling of sampled_addm on SparseCOO: canceled (coalesce() takes too much time, the above dataset would take 7.2s to finish this)
  • unify ReduceTypes: GNN would rely on a few operators who have similar ReduceTypes, such as ScatterReduce, SegmentReduce, SampledReduce, SpmmReduce
  • optimization of segment_reduce with lengths and offsets
  • migrate sampled_reduce
  • enable multi aggregation
  • New ReduceType of std (use stable alg. welford?)

@mingfeima mingfeima added release notes: gnn gnn related optimizations intel This tag is for PR from Intel labels Dec 16, 2022
@mingfeima mingfeima added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 16, 2022
### Target and background
This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference.

The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`).

### Benchmarks

Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where:

* #nodes: 2.4 * 10^6
* number of edges: 1.26 * 10^8
* number of features: 128

So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup:

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket.
```
### before: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms!

### after: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms!

### after: run the whole dataset
sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms!
```




[ghstack-poisoned]
### Target and background
This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference.

The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`).

### Benchmarks

Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where:

* #nodes: 2.4 * 10^6
* number of edges: 1.26 * 10^8
* number of features: 128

So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup:

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket.
```
### before: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms!

### after: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms!

### after: run the whole dataset
sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms!
```




[ghstack-poisoned]
@mingfeima
Copy link
Collaborator Author

@rusty1s could you please review this one? sampled_addmm is mapped to SDDMM on cuda device but we don't have equivalent in MKL.

### Target and background
This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference.

The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`).

### Benchmarks

Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where:

* #nodes: 2.4 * 10^6
* number of edges: 1.26 * 10^8
* number of features: 128

So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup:

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket.
```
### before: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms!

### after: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms!

### after: run the whole dataset
sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms!
```




[ghstack-poisoned]
### Target and Background
This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference.

The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`).

### Benchmarks

Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where:

* number of nodes: 2.4 * 10^6
* number of edges: 1.26 * 10^8
* number of features: 128

So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup:

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket.
```
### before: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms!

### after: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms!

### after: run the whole dataset
sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms!
```




[ghstack-poisoned]
Copy link
Collaborator

@pearu pearu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks, @mingfeima!

Copy link
Collaborator

@nikitaved nikitaved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you, @mingfeima !

### Target and Background
This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference.

The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`).

### Benchmarks

Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where:

* number of nodes: 2.4 * 10^6
* number of edges: 1.26 * 10^8
* number of features: 128

So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup:

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket.
```
### before: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms!

### after: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms!

### after: run the whole dataset
sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms!
```




[ghstack-poisoned]
@ezyang
Copy link
Contributor

ezyang commented Jan 10, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@seemethere
Copy link
Member

@pytorchbot revert -m "This broke internal builds for android due to the new file added being missing in build_variables.bzl"

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 11, 2023

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@seemethere
Copy link
Member

@pytorchbot revert -m "This broke internal builds for android due to the new file added being missing in build_variables.bzl" -c ghfirst

@seemethere
Copy link
Member

Would suggest making the following change when attempting to re-land:

diff --git a/build_variables.bzl b/build_variables.bzl
index 225fb9fbc7e..0359c5123c7 100644
--- a/build_variables.bzl
+++ b/build_variables.bzl
@@ -1430,6 +1430,7 @@ aten_native_source_non_codegen_list = [
     "aten/src/ATen/native/nested/NestedTensorUnaryOps.cpp",
     "aten/src/ATen/native/nested/NestedTensorUtils.cpp",
     "aten/src/ATen/native/sparse/ParamUtils.cpp",
+    "aten/src/ATen/native/cpu/SampledAddmmKernel.cpp",
     "aten/src/ATen/native/sparse/SoftMax.cpp",
     "aten/src/ATen/native/sparse/SparseBlas.cpp",
     "aten/src/ATen/native/sparse/SparseBlasImpl.cpp",

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@mingfeima your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Jan 11, 2023
This reverts commit 645fb21.

Reverted #90978 on behalf of https://github.com/seemethere due to This broke internal builds for android due to the new file added being missing in build_variables.bzl
@huydhn
Copy link
Contributor

huydhn commented Jan 11, 2023

Thank @seemethere for reverting this one, I'm about to do the same because this also breaks periodic buck-build-test https://hud.pytorch.org/commit/pytorch/pytorch/364f526b9cdf9818a7647b5e637efdee825d61a1

@huydhn huydhn reopened this Jan 11, 2023
@huydhn huydhn added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Jan 11, 2023
@ezyang
Copy link
Contributor

ezyang commented Jan 11, 2023

Why is this not caught in oss build

@huydhn
Copy link
Contributor

huydhn commented Jan 12, 2023

Why is this not caught in oss build

It was caught by periodic buck build https://github.com/pytorch/pytorch/actions/runs/3889778963/jobs/6638325506. Now with ciflow/periodic in the PR, the failure has shown up correctly

### Target and Background
This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference.

The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`).

### Benchmarks

Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where:

* number of nodes: 2.4 * 10^6
* number of edges: 1.26 * 10^8
* number of features: 128

So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup:

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket.
```
### before: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms!

### after: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms!

### after: run the whole dataset
sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms!
```




cc jgong5 @XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@mingfeima
Copy link
Collaborator Author

build_variables.bzl updated!

@mingfeima
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/mingfeima/92/head branch June 8, 2023 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: gnn gnn related optimizations release notes: sparse release notes category Reverted
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

9 participants