Improve make_tensor performance for float and complex types #85473

peterbell10 · 2022-09-22T14:13:00Z

Stack from ghstack (oldest at bottom):

-> Improve make_tensor performance for float and complex types #85473

For floating types, make_tensor calls rand and then does a linear
interpolation from low to high. This instead calls uniform_(low, high) to cut out the interpolation step.

For complex types, make_tensor does the rand + interpolation step
twice and calls torch.complex(real, imag) at the end. This instead
uses view_as_real and uniform_(low, high) to fuse it all into one
operation.

My benchmarks show significant speedups in all cases for float32 and
complex64.

Device	dtype	Size	Master (us)	This PR (us)	Speedup
CPU	float32	8	19.4	6.34	3.1
		4096	36.8	21.3	1.7
		2**24	167,000	80,500	2.1
	complex32	8	37.0	7.57	4.9
		4096	73.1	37.6	1.9
		2**24	409,000	161,000	2.5
CUDA	float32	8	40.4	11.7	3.5
		4096	38.7	11.7	3.3
		2**24	2,300	238	9.7
	complex32	8	78.7	14	5.6
		4096	82.7	13.8	6.0
		2**24	5,520	489	11.3

[ghstack-poisoned]

pytorch-bot · 2022-09-22T14:13:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85473

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 1 Pending

As of commit 344d1b8:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 16a096fe1a74619602f1033c244e9240f41bd2b4 Pull Request resolved: pytorch#85473

For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 22.5 | 1.6 | | | | | 0 | | 37.4 | 14.8 | 2.5 | | | | | 0 | 1 | 37.5 | 10.5 | 3.6 | | | | 4096 | | | 73.1 | 56.4 | 1.3 | | | | | 0 | | 73.5 | 47.1 | 1.6 | | | | | 0 | 1 | 73.6 | 42.7 | 1.7 | | | | 2**24 | | | 409,000 | 280,000 | 1.5 | | | | | 0 | | 411,000 | 219,000 | 1.9 | | | | | 0 | 1 | 409,000 | 213,000 | 1.9 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 45.0 | 1.7 | | | | | 0 | | 80.8 | 29.3 | 2.8 | | | | | 0 | 1 | 83.5 | 22.2 | 3.8 | | | | 4096 | | | 82.7 | 44.8 | 1.8 | | | | | 0 | | 83.9 | 29.4 | 2.9 | | | | | 0 | 1 | 81.5 | 22.1 | 3.7 | | | | 2**24 | | | 5,520 | 4,600 | 1.2 | | | | | 0 | | 5,520 | 2,470 | 2.2 | | | | | 0 | 1 | 5,520 | 1,410 | 3.9 | ghstack-source-id: 16a096fe1a74619602f1033c244e9240f41bd2b4 Pull Request resolved: pytorch#85473

[ghstack-poisoned]

For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: b9448bf572abe41c5e63eb8ce2408712d40ce5ae Pull Request resolved: #85473

For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | [ghstack-poisoned]

For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: b9448bf572abe41c5e63eb8ce2408712d40ce5ae Pull Request resolved: #85473

For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | [ghstack-poisoned]

For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: 9c6a6e04f29e49b68718bc60ef3f3f6417415365 Pull Request resolved: #85473

For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: 9c6a6e04f29e49b68718bc60ef3f3f6417415365 Pull Request resolved: pytorch#85473

For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | [ghstack-poisoned]

For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: 6754bb3a9cb7bf9532692760c08f7eff023a1a0f Pull Request resolved: #85473

For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | [ghstack-poisoned]

For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: 9422f5262f49dcb8e08f85bdbd83b7e9b4314fe9 Pull Request resolved: #85473

For floating types, `make_tensor` calls `rand` and then does a linear interpolation (aka lerp) from `low` to `high`. This makes the lerp step faster by: - using inplace operations - using `add`'s `alpha` parameter to avoid an extra kernel - adding shortcuts for special values of `low` and `high` For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This reduce overhead by doing a single `rand` + interpolate of double the size, then calling `torch.view_as_complex` at the end. My benchmarks show speedups in all cases for float32 and complex64. | Device | dtype | Size | low | high | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-----|------|-------------|--------------|---------| | CPU | float32 | 8 | | | 19.4 | 15.1 | 1.3 | | | | | 0 | | 19.7 | 9.21 | 2.1 | | | | | 0 | 1 | 19.7 | 5.94 | 3.3 | | | | 4096 | | | 36.8 | 31.3 | 1.2 | | | | | 0 | | 37.1 | 24.7 | 1.5 | | | | | 0 | 1 | 36.9 | 21.0 | 1.8 | | | | 2**24 | | | 167,000 | 115,000 | 1.5 | | | | | 0 | | 179,000 | 85,200 | 2.1 | | | | | 0 | 1 | 180,000 | 80,800 | 2.2 | | | complex32 | 8 | | | 37.0 | 17.6 | 2.1 | | | | | 0 | | 37.4 | 11.3 | 3.3 | | | | | 0 | 1 | 37.5 | 7.66 | 4.9 | | | | 4096 | | | 73.1 | 49.9 | 1.5 | | | | | 0 | | 73.5 | 41.5 | 1.8 | | | | | 0 | 1 | 73.6 | 37.6 | 2.0 | | | | 2**24 | | | 409,000 | 229,000 | 1.8 | | | | | 0 | | 411,000 | 170,000 | 2.4 | | | | | 0 | 1 | 409,000 | 163,000 | 2.5 | | CUDA | float32 | 8 | | | 40.4 | 30.9 | 1.3 | | | | | 0 | | 39.2 | 17.6 | 2.2 | | | | | 0 | 1 | 39.2 | 11.1 | 3.5 | | | | 4096 | | | 38.7 | 32.2 | 1.2 | | | | | 0 | | 39.2 | 18.0 | 2.2 | | | | | 0 | 1 | 39.3 | 11.1 | 3.5 | | | | 2**24 | | | 2,300 | 1,840 | 1.3 | | | | | 0 | | 2,300 | 704 | 3.3 | | | | | 0 | 1 | 2,300 | 242 | 9.5 | | | complex32 | 8 | | | 78.7 | 34.7 | 2.3 | | | | | 0 | | 80.8 | 20.5 | 3.9 | | | | | 0 | 1 | 83.5 | 13.8 | 6.0 | | | | 4096 | | | 82.7 | 34.8 | 2.4 | | | | | 0 | | 83.9 | 20.5 | 4.1 | | | | | 0 | 1 | 81.5 | 13.9 | 5.9 | | | | 2**24 | | | 5,520 | 3,670 | 1.5 | | | | | 0 | | 5,520 | 1,400 | 3.9 | | | | | 0 | 1 | 5,520 | 484 | 11.4 | ghstack-source-id: 9422f5262f49dcb8e08f85bdbd83b7e9b4314fe9 Pull Request resolved: pytorch#85473

For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | ghstack-source-id: 9422f5262f49dcb8e08f85bdbd83b7e9b4314fe9 Pull Request resolved: pytorch#85473

github-actions · 2022-09-29T11:46:57Z

Hey @peterbell10.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

huydhn · 2022-09-29T20:04:56Z

@pytorchbot revert -m 'Sorry for revert your PR, but it seems to cause a bunch of flaky test in pull an periodic' -c nosignal

Here are some related failures. They looks flaky:

pytorchmergebot · 2022-09-29T20:06:45Z

@pytorchbot successfully started a revert job. Check the current status here.
Please reach out to the PyTorch DevX Team with feedback or questions!

pytorchmergebot · 2022-09-29T20:06:56Z

@peterbell10 your PR has been successfully reverted.

…85473)" This reverts commit a76995e. Reverted #85473 on behalf of https://github.com/huydhn due to Sorry for revert your PR, but it seems to cause a bunch of flaky test in pull an periodic

For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | ghstack-source-id: 7cd9c7ab7befcd0efaa4d5170fa9ee6992cbab8b Pull Request resolved: pytorch#85473

facebook-github-bot · 2022-10-04T00:16:05Z

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

linux-foundation-easycla · 2022-10-04T00:16:15Z

The committers listed above are authorized under a signed CLA.

✅ login: peterbell10 (f4eae1c, 23665cc, 2f0cccb, 9350e0e, 988327d, e0faff7, cd89f6b, 91f649d, b90469e, db416ec, d6f1acc, f04fb36, 8728172, 526168a, 59150b4, c781272, 4565183, 83aac7b, ce23b00, f1b8bb2, ee9966f, c036a73, 035d0ed, ee12ede, f19048c, 7df263d, 78390d1, 0661105, 416be76)

For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | [ghstack-poisoned]

For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | ghstack-source-id: 4d8ee1f33dbeb09d7a23d041c6465009bcc315d4 Pull Request resolved: #85473

For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | Pull Request resolved: #85473 Approved by: https://github.com/mruberry

…85473)" This reverts commit a76995e. Reverted #85473 on behalf of https://github.com/huydhn due to Sorry for revert your PR, but it seems to cause a bunch of flaky test in pull an periodic

For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | [ghstack-poisoned]

For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | ghstack-source-id: 1cb1a80a002c11752a07c82b834cb21f7ba754be Pull Request resolved: #85473

peterbell10 · 2022-10-05T16:47:35Z

@pytorchbot merge

pytorchmergebot · 2022-10-05T17:05:16Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the land checks (-l) flag. If you did not specify this flag yourself, you are likely enrolled in the land checks rollout. This means that your change will be merged once all checks on your PR have passed since you have added the ciflow/trunk label to your PR (ETA 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

github-actions · 2022-10-05T17:15:27Z

Hey @peterbell10.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

…85473) Summary: For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | Pull Request resolved: #85473 Approved by: https://github.com/mruberry Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/3ec71fce79f4e568c48796da4b18a3e6f2c6fc29 Reviewed By: seemethere Differential Revision: D40166926 Pulled By: seemethere fbshipit-source-id: 37a5d67f48328e40622dbd32488088d5d9f7ce82

…85473) For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. | Device | dtype | Size | Master (us) | This PR (us) | Speedup | |--------|-----------|-------|-------------|--------------|---------| | CPU | float32 | 8 | 19.4 | 6.34 | 3.1 | | | | 4096 | 36.8 | 21.3 | 1.7 | | | | 2**24 | 167,000 | 80,500 | 2.1 | | | complex32 | 8 | 37.0 | 7.57 | 4.9 | | | | 4096 | 73.1 | 37.6 | 1.9 | | | | 2**24 | 409,000 | 161,000 | 2.5 | | CUDA | float32 | 8 | 40.4 | 11.7 | 3.5 | | | | 4096 | 38.7 | 11.7 | 3.3 | | | | 2**24 | 2,300 | 238 | 9.7 | | | complex32 | 8 | 78.7 | 14 | 5.6 | | | | 4096 | 82.7 | 13.8 | 6.0 | | | | 2**24 | 5,520 | 489 | 11.3 | Pull Request resolved: pytorch#85473 Approved by: https://github.com/mruberry

…ytorch#85473)" This reverts commit a76995e. Reverted pytorch#85473 on behalf of https://github.com/huydhn due to Sorry for revert your PR, but it seems to cause a bunch of flaky test in pull an periodic

make_tensor lerp

f4eae1c

[ghstack-poisoned]

This was referenced Sep 22, 2022

Improve complex lerp performance #84844

Closed

Vectorize tensor lerp kernel #84845

Closed

facebook-github-bot added the cla signed label Sep 22, 2022

peterbell10 mentioned this pull request Sep 22, 2022

Add specialized lerp kernel to accelerate make_tensor #84846

Closed

pytorchbot added the open source label Sep 22, 2022

peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 22, 2022

make_tensor lerp

2297d5d

ghstack-source-id: 16a096fe1a74619602f1033c244e9240f41bd2b4 Pull Request resolved: pytorch#85473

peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 22, 2022

Improve make_tensor performance

a0d3c40

ghstack-source-id: 16a096fe1a74619602f1033c244e9240f41bd2b4 Pull Request resolved: pytorch#85473

Update on "make_tensor lerp"

23665cc

[ghstack-poisoned]

peterbell10 changed the title ~~make_tensor lerp~~ Improve make_tensor performance for float and complex types Sep 22, 2022

peterbell10 mentioned this pull request Sep 22, 2022

Simplify noncontiguous_like #85518

Closed

pytorchmergebot added the Merged label Sep 29, 2022

pytorchmergebot closed this in a76995e Sep 29, 2022

huydhn reopened this Sep 29, 2022

huydhn added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Sep 29, 2022

pytorchmergebot added the Reverted label Sep 29, 2022

pytorchmergebot closed this in 3ec71fc Oct 5, 2022

peterbell10 added the topic: not user facing topic category label Oct 5, 2022

facebook-github-bot deleted the gh/peterbell10/426/head branch June 8, 2023 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve make_tensor performance for float and complex types #85473

Improve make_tensor performance for float and complex types #85473

peterbell10 commented Sep 22, 2022 •

edited

Loading

pytorch-bot bot commented Sep 22, 2022 •

edited

Loading

github-actions bot commented Sep 29, 2022

huydhn commented Sep 29, 2022

pytorchmergebot commented Sep 29, 2022

pytorchmergebot commented Sep 29, 2022

facebook-github-bot commented Oct 4, 2022

linux-foundation-easycla bot commented Oct 4, 2022 •

edited

Loading

peterbell10 commented Oct 5, 2022

pytorchmergebot commented Oct 5, 2022

github-actions bot commented Oct 5, 2022

Improve make_tensor performance for float and complex types #85473

Improve make_tensor performance for float and complex types #85473

Conversation

peterbell10 commented Sep 22, 2022 • edited Loading

pytorch-bot bot commented Sep 22, 2022 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85473

✅ No Failures, 1 Pending

github-actions bot commented Sep 29, 2022

huydhn commented Sep 29, 2022

pytorchmergebot commented Sep 29, 2022

pytorchmergebot commented Sep 29, 2022

facebook-github-bot commented Oct 4, 2022

linux-foundation-easycla bot commented Oct 4, 2022 • edited Loading

peterbell10 commented Oct 5, 2022

pytorchmergebot commented Oct 5, 2022

github-actions bot commented Oct 5, 2022

peterbell10 commented Sep 22, 2022 •

edited

Loading

pytorch-bot bot commented Sep 22, 2022 •

edited

Loading

linux-foundation-easycla bot commented Oct 4, 2022 •

edited

Loading