-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve make_tensor performance for float and complex types #85473
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85473
Note: Links to docs will display an error until the docs builds have been completed. ✅ No Failures, 1 PendingAs of commit 344d1b8: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: 16a096fe1a74619602f1033c244e9240f41bd2b4 Pull Request resolved: pytorch#85473
ghstack-source-id: 16a096fe1a74619602f1033c244e9240f41bd2b4 Pull Request resolved: pytorch#85473
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 22.5 | 1.6 | | | | | 0 | | 37.4 | 14.8 | 2.5 | | | | | 0 | 1 | 37.5 | 10.5 | 3.6 | | | | 4096 | | | 73.1 | 56.4 | 1.3 | | | | | 0 | | 73.5 | 47.1 | 1.6 | | | | | 0 | 1 | 73.6 | 42.7 | 1.7 | | | | 2**24 | | | 409,000 | 280,000 | 1.5 | | | | | 0 | | 411,000 | 219,000 | 1.9 | | | | | 0 | 1 | 409,000 | 213,000 | 1.9 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 45.0 | 1.7 | | | | | 0 | | 80.8 | 29.3 | 2.8 | | | | | 0 | 1 | 83.5 | 22.2 | 3.8 | | | | 4096 | | | 82.7 | 44.8 | 1.8 | | | | | 0 | | 83.9 | 29.4 | 2.9 | | | | | 0 | 1 | 81.5 | 22.1 | 3.7 | | | | 2**24 | | | 5,520 | 4,600 | 1.2 | | | | | 0 | | 5,520 | 2,470 | 2.2 | | | | | 0 | 1 | 5,520 | 1,410 | 3.9 | ghstack-source-id: 16a096fe1a74619602f1033c244e9240f41bd2b4 Pull Request resolved: pytorch#85473
[ghstack-poisoned]
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: b9448bf572abe41c5e63eb8ce2408712d40ce5ae Pull Request resolved: #85473
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | [ghstack-poisoned]
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: b9448bf572abe41c5e63eb8ce2408712d40ce5ae Pull Request resolved: #85473
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | [ghstack-poisoned]
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: 9c6a6e04f29e49b68718bc60ef3f3f6417415365 Pull Request resolved: #85473
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: 9c6a6e04f29e49b68718bc60ef3f3f6417415365 Pull Request resolved: pytorch#85473
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: 9c6a6e04f29e49b68718bc60ef3f3f6417415365 Pull Request resolved: pytorch#85473
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: 9c6a6e04f29e49b68718bc60ef3f3f6417415365 Pull Request resolved: pytorch#85473
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | [ghstack-poisoned]
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: 6754bb3a9cb7bf9532692760c08f7eff023a1a0f Pull Request resolved: #85473
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | [ghstack-poisoned]
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: 9422f5262f49dcb8e08f85bdbd83b7e9b4314fe9 Pull Request resolved: #85473
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: 9422f5262f49dcb8e08f85bdbd83b7e9b4314fe9 Pull Request resolved: pytorch#85473
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: 9422f5262f49dcb8e08f85bdbd83b7e9b4314fe9 Pull Request resolved: pytorch#85473
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: 9422f5262f49dcb8e08f85bdbd83b7e9b4314fe9 Pull Request resolved: pytorch#85473
For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: 9422f5262f49dcb8e08f85bdbd83b7e9b4314fe9 Pull Request resolved: pytorch#85473
For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | ghstack-source-id: 9422f5262f49dcb8e08f85bdbd83b7e9b4314fe9 Pull Request resolved: pytorch#85473
Hey @peterbell10. |
@pytorchbot revert -m 'Sorry for revert your PR, but it seems to cause a bunch of flaky test in pull an periodic' -c nosignal Here are some related failures. They looks flaky:
|
@pytorchbot successfully started a revert job. Check the current status here. |
@peterbell10 your PR has been successfully reverted. |
…85473)" This reverts commit a76995e. Reverted #85473 on behalf of https://github.com/huydhn due to Sorry for revert your PR, but it seems to cause a bunch of flaky test in pull an periodic
For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | ghstack-source-id: 7cd9c7ab7befcd0efaa4d5170fa9ee6992cbab8b Pull Request resolved: pytorch#85473
/easycla As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details. This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign. |
For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | [ghstack-poisoned]
For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | ghstack-source-id: 4d8ee1f33dbeb09d7a23d041c6465009bcc315d4 Pull Request resolved: #85473
For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | Pull Request resolved: #85473 Approved by: https://github.com/mruberry
…85473)" This reverts commit a76995e. Reverted #85473 on behalf of https://github.com/huydhn due to Sorry for revert your PR, but it seems to cause a bunch of flaky test in pull an periodic
For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | [ghstack-poisoned]
For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | ghstack-source-id: 1cb1a80a002c11752a07c82b834cb21f7ba754be Pull Request resolved: #85473
@pytorchbot merge |
@pytorchbot successfully started a merge job. Check the current status here. |
Hey @peterbell10. |
…85473) Summary: For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | Pull Request resolved: #85473 Approved by: https://github.com/mruberry Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/3ec71fce79f4e568c48796da4b18a3e6f2c6fc29 Reviewed By: seemethere Differential Revision: D40166926 Pulled By: seemethere fbshipit-source-id: 37a5d67f48328e40622dbd32488088d5d9f7ce82
…85473) For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | Pull Request resolved: pytorch#85473 Approved by: https://github.com/mruberry
…ytorch#85473)" This reverts commit a76995e. Reverted pytorch#85473 on behalf of https://github.com/huydhn due to Sorry for revert your PR, but it seems to cause a bunch of flaky test in pull an periodic
Stack from ghstack (oldest at bottom):
For floating types,
make_tensor
callsrand
and then does a linearinterpolation from
low
tohigh
. This instead callsuniform_(low, high)
to cut out the interpolation step.For complex types,
make_tensor
does therand
+ interpolation steptwice and calls
torch.complex(real, imag)
at the end. This insteaduses
view_as_real
anduniform_(low, high)
to fuse it all into oneoperation.
My benchmarks show significant speedups in all cases for float32 and
complex64.