Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve make_tensor performance for float and complex types #85473

Closed
wants to merge 30 commits into from

Conversation

peterbell10
Copy link
Collaborator

@peterbell10 peterbell10 commented Sep 22, 2022

Stack from ghstack (oldest at bottom):

For floating types, make_tensor calls rand and then does a linear
interpolation from low to high. This instead calls uniform_(low, high) to cut out the interpolation step.

For complex types, make_tensor does the rand + interpolation step
twice and calls torch.complex(real, imag) at the end. This instead
uses view_as_real and uniform_(low, high) to fuse it all into one
operation.

My benchmarks show significant speedups in all cases for float32 and
complex64.

Device dtype Size Master (us) This PR (us) Speedup
CPU float32 8 19.4 6.34 3.1
4096 36.8 21.3 1.7
2**24 167,000 80,500 2.1
complex32 8 37.0 7.57 4.9
4096 73.1 37.6 1.9
2**24 409,000 161,000 2.5
CUDA float32 8 40.4 11.7 3.5
4096 38.7 11.7 3.3
2**24 2,300 238 9.7
complex32 8 78.7 14 5.6
4096 82.7 13.8 6.0
2**24 5,520 489 11.3

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 22, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85473

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 1 Pending

As of commit 344d1b8:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 22, 2022
ghstack-source-id: 16a096fe1a74619602f1033c244e9240f41bd2b4
Pull Request resolved: pytorch#85473
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 22, 2022
ghstack-source-id: 16a096fe1a74619602f1033c244e9240f41bd2b4
Pull Request resolved: pytorch#85473
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 22, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 22.5         | 1.6     |
|        |           |       | 0   |      | 37.4        | 14.8         | 2.5     |
|        |           |       | 0   | 1    | 37.5        | 10.5         | 3.6     |
|        |           | 4096  |     |      | 73.1        | 56.4         | 1.3     |
|        |           |       | 0   |      | 73.5        | 47.1         | 1.6     |
|        |           |       | 0   | 1    | 73.6        | 42.7         | 1.7     |
|        |           | 2**24 |     |      | 409,000     | 280,000      | 1.5     |
|        |           |       | 0   |      | 411,000     | 219,000      | 1.9     |
|        |           |       | 0   | 1    | 409,000     | 213,000      | 1.9     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 45.0         | 1.7     |
|        |           |       | 0   |      | 80.8        | 29.3         | 2.8     |
|        |           |       | 0   | 1    | 83.5        | 22.2         | 3.8     |
|        |           | 4096  |     |      | 82.7        | 44.8         | 1.8     |
|        |           |       | 0   |      | 83.9        | 29.4         | 2.9     |
|        |           |       | 0   | 1    | 81.5        | 22.1         | 3.7     |
|        |           | 2**24 |     |      | 5,520       | 4,600        | 1.2     |
|        |           |       | 0   |      | 5,520       | 2,470        | 2.2     |
|        |           |       | 0   | 1    | 5,520       | 1,410        | 3.9     |

ghstack-source-id: 16a096fe1a74619602f1033c244e9240f41bd2b4
Pull Request resolved: pytorch#85473
[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Sep 22, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

My benchmarks show speedups in all cases for float32 and complex64.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 17.6         | 2.1     |
|        |           |       | 0   |      | 37.4        | 11.3         | 3.3     |
|        |           |       | 0   | 1    | 37.5        | 7.66         | 4.9     |
|        |           | 4096  |     |      | 73.1        | 49.9         | 1.5     |
|        |           |       | 0   |      | 73.5        | 41.5         | 1.8     |
|        |           |       | 0   | 1    | 73.6        | 37.6         | 2.0     |
|        |           | 2**24 |     |      | 409,000     | 229,000      | 1.8     |
|        |           |       | 0   |      | 411,000     | 170,000      | 2.4     |
|        |           |       | 0   | 1    | 409,000     | 163,000      | 2.5     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 34.7         | 2.3     |
|        |           |       | 0   |      | 80.8        | 20.5         | 3.9     |
|        |           |       | 0   | 1    | 83.5        | 13.8         | 6.0     |
|        |           | 4096  |     |      | 82.7        | 34.8         | 2.4     |
|        |           |       | 0   |      | 83.9        | 20.5         | 4.1     |
|        |           |       | 0   | 1    | 81.5        | 13.9         | 5.9     |
|        |           | 2**24 |     |      | 5,520       | 3,670        | 1.5     |
|        |           |       | 0   |      | 5,520       | 1,400        | 3.9     |
|        |           |       | 0   | 1    | 5,520       | 484          | 11.4    |

ghstack-source-id: b9448bf572abe41c5e63eb8ce2408712d40ce5ae
Pull Request resolved: #85473
@peterbell10 peterbell10 changed the title make_tensor lerp Improve make_tensor performance for float and complex types Sep 22, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

My benchmarks show speedups in all cases for float32 and complex64.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 17.6         | 2.1     |
|        |           |       | 0   |      | 37.4        | 11.3         | 3.3     |
|        |           |       | 0   | 1    | 37.5        | 7.66         | 4.9     |
|        |           | 4096  |     |      | 73.1        | 49.9         | 1.5     |
|        |           |       | 0   |      | 73.5        | 41.5         | 1.8     |
|        |           |       | 0   | 1    | 73.6        | 37.6         | 2.0     |
|        |           | 2**24 |     |      | 409,000     | 229,000      | 1.8     |
|        |           |       | 0   |      | 411,000     | 170,000      | 2.4     |
|        |           |       | 0   | 1    | 409,000     | 163,000      | 2.5     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 34.7         | 2.3     |
|        |           |       | 0   |      | 80.8        | 20.5         | 3.9     |
|        |           |       | 0   | 1    | 83.5        | 13.8         | 6.0     |
|        |           | 4096  |     |      | 82.7        | 34.8         | 2.4     |
|        |           |       | 0   |      | 83.9        | 20.5         | 4.1     |
|        |           |       | 0   | 1    | 81.5        | 13.9         | 5.9     |
|        |           | 2**24 |     |      | 5,520       | 3,670        | 1.5     |
|        |           |       | 0   |      | 5,520       | 1,400        | 3.9     |
|        |           |       | 0   | 1    | 5,520       | 484          | 11.4    |

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Sep 22, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

My benchmarks show speedups in all cases for float32 and complex64.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 17.6         | 2.1     |
|        |           |       | 0   |      | 37.4        | 11.3         | 3.3     |
|        |           |       | 0   | 1    | 37.5        | 7.66         | 4.9     |
|        |           | 4096  |     |      | 73.1        | 49.9         | 1.5     |
|        |           |       | 0   |      | 73.5        | 41.5         | 1.8     |
|        |           |       | 0   | 1    | 73.6        | 37.6         | 2.0     |
|        |           | 2**24 |     |      | 409,000     | 229,000      | 1.8     |
|        |           |       | 0   |      | 411,000     | 170,000      | 2.4     |
|        |           |       | 0   | 1    | 409,000     | 163,000      | 2.5     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 34.7         | 2.3     |
|        |           |       | 0   |      | 80.8        | 20.5         | 3.9     |
|        |           |       | 0   | 1    | 83.5        | 13.8         | 6.0     |
|        |           | 4096  |     |      | 82.7        | 34.8         | 2.4     |
|        |           |       | 0   |      | 83.9        | 20.5         | 4.1     |
|        |           |       | 0   | 1    | 81.5        | 13.9         | 5.9     |
|        |           | 2**24 |     |      | 5,520       | 3,670        | 1.5     |
|        |           |       | 0   |      | 5,520       | 1,400        | 3.9     |
|        |           |       | 0   | 1    | 5,520       | 484          | 11.4    |

ghstack-source-id: b9448bf572abe41c5e63eb8ce2408712d40ce5ae
Pull Request resolved: #85473
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

My benchmarks show speedups in all cases for float32 and complex64.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 17.6         | 2.1     |
|        |           |       | 0   |      | 37.4        | 11.3         | 3.3     |
|        |           |       | 0   | 1    | 37.5        | 7.66         | 4.9     |
|        |           | 4096  |     |      | 73.1        | 49.9         | 1.5     |
|        |           |       | 0   |      | 73.5        | 41.5         | 1.8     |
|        |           |       | 0   | 1    | 73.6        | 37.6         | 2.0     |
|        |           | 2**24 |     |      | 409,000     | 229,000      | 1.8     |
|        |           |       | 0   |      | 411,000     | 170,000      | 2.4     |
|        |           |       | 0   | 1    | 409,000     | 163,000      | 2.5     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 34.7         | 2.3     |
|        |           |       | 0   |      | 80.8        | 20.5         | 3.9     |
|        |           |       | 0   | 1    | 83.5        | 13.8         | 6.0     |
|        |           | 4096  |     |      | 82.7        | 34.8         | 2.4     |
|        |           |       | 0   |      | 83.9        | 20.5         | 4.1     |
|        |           |       | 0   | 1    | 81.5        | 13.9         | 5.9     |
|        |           | 2**24 |     |      | 5,520       | 3,670        | 1.5     |
|        |           |       | 0   |      | 5,520       | 1,400        | 3.9     |
|        |           |       | 0   | 1    | 5,520       | 484          | 11.4    |

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Sep 22, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

My benchmarks show speedups in all cases for float32 and complex64.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 17.6         | 2.1     |
|        |           |       | 0   |      | 37.4        | 11.3         | 3.3     |
|        |           |       | 0   | 1    | 37.5        | 7.66         | 4.9     |
|        |           | 4096  |     |      | 73.1        | 49.9         | 1.5     |
|        |           |       | 0   |      | 73.5        | 41.5         | 1.8     |
|        |           |       | 0   | 1    | 73.6        | 37.6         | 2.0     |
|        |           | 2**24 |     |      | 409,000     | 229,000      | 1.8     |
|        |           |       | 0   |      | 411,000     | 170,000      | 2.4     |
|        |           |       | 0   | 1    | 409,000     | 163,000      | 2.5     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 34.7         | 2.3     |
|        |           |       | 0   |      | 80.8        | 20.5         | 3.9     |
|        |           |       | 0   | 1    | 83.5        | 13.8         | 6.0     |
|        |           | 4096  |     |      | 82.7        | 34.8         | 2.4     |
|        |           |       | 0   |      | 83.9        | 20.5         | 4.1     |
|        |           |       | 0   | 1    | 81.5        | 13.9         | 5.9     |
|        |           | 2**24 |     |      | 5,520       | 3,670        | 1.5     |
|        |           |       | 0   |      | 5,520       | 1,400        | 3.9     |
|        |           |       | 0   | 1    | 5,520       | 484          | 11.4    |

ghstack-source-id: 9c6a6e04f29e49b68718bc60ef3f3f6417415365
Pull Request resolved: #85473
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 22, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

My benchmarks show speedups in all cases for float32 and complex64.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 17.6         | 2.1     |
|        |           |       | 0   |      | 37.4        | 11.3         | 3.3     |
|        |           |       | 0   | 1    | 37.5        | 7.66         | 4.9     |
|        |           | 4096  |     |      | 73.1        | 49.9         | 1.5     |
|        |           |       | 0   |      | 73.5        | 41.5         | 1.8     |
|        |           |       | 0   | 1    | 73.6        | 37.6         | 2.0     |
|        |           | 2**24 |     |      | 409,000     | 229,000      | 1.8     |
|        |           |       | 0   |      | 411,000     | 170,000      | 2.4     |
|        |           |       | 0   | 1    | 409,000     | 163,000      | 2.5     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 34.7         | 2.3     |
|        |           |       | 0   |      | 80.8        | 20.5         | 3.9     |
|        |           |       | 0   | 1    | 83.5        | 13.8         | 6.0     |
|        |           | 4096  |     |      | 82.7        | 34.8         | 2.4     |
|        |           |       | 0   |      | 83.9        | 20.5         | 4.1     |
|        |           |       | 0   | 1    | 81.5        | 13.9         | 5.9     |
|        |           | 2**24 |     |      | 5,520       | 3,670        | 1.5     |
|        |           |       | 0   |      | 5,520       | 1,400        | 3.9     |
|        |           |       | 0   | 1    | 5,520       | 484          | 11.4    |

ghstack-source-id: 9c6a6e04f29e49b68718bc60ef3f3f6417415365
Pull Request resolved: pytorch#85473
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 22, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

My benchmarks show speedups in all cases for float32 and complex64.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 17.6         | 2.1     |
|        |           |       | 0   |      | 37.4        | 11.3         | 3.3     |
|        |           |       | 0   | 1    | 37.5        | 7.66         | 4.9     |
|        |           | 4096  |     |      | 73.1        | 49.9         | 1.5     |
|        |           |       | 0   |      | 73.5        | 41.5         | 1.8     |
|        |           |       | 0   | 1    | 73.6        | 37.6         | 2.0     |
|        |           | 2**24 |     |      | 409,000     | 229,000      | 1.8     |
|        |           |       | 0   |      | 411,000     | 170,000      | 2.4     |
|        |           |       | 0   | 1    | 409,000     | 163,000      | 2.5     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 34.7         | 2.3     |
|        |           |       | 0   |      | 80.8        | 20.5         | 3.9     |
|        |           |       | 0   | 1    | 83.5        | 13.8         | 6.0     |
|        |           | 4096  |     |      | 82.7        | 34.8         | 2.4     |
|        |           |       | 0   |      | 83.9        | 20.5         | 4.1     |
|        |           |       | 0   | 1    | 81.5        | 13.9         | 5.9     |
|        |           | 2**24 |     |      | 5,520       | 3,670        | 1.5     |
|        |           |       | 0   |      | 5,520       | 1,400        | 3.9     |
|        |           |       | 0   | 1    | 5,520       | 484          | 11.4    |

ghstack-source-id: 9c6a6e04f29e49b68718bc60ef3f3f6417415365
Pull Request resolved: pytorch#85473
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 22, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

My benchmarks show speedups in all cases for float32 and complex64.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 17.6         | 2.1     |
|        |           |       | 0   |      | 37.4        | 11.3         | 3.3     |
|        |           |       | 0   | 1    | 37.5        | 7.66         | 4.9     |
|        |           | 4096  |     |      | 73.1        | 49.9         | 1.5     |
|        |           |       | 0   |      | 73.5        | 41.5         | 1.8     |
|        |           |       | 0   | 1    | 73.6        | 37.6         | 2.0     |
|        |           | 2**24 |     |      | 409,000     | 229,000      | 1.8     |
|        |           |       | 0   |      | 411,000     | 170,000      | 2.4     |
|        |           |       | 0   | 1    | 409,000     | 163,000      | 2.5     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 34.7         | 2.3     |
|        |           |       | 0   |      | 80.8        | 20.5         | 3.9     |
|        |           |       | 0   | 1    | 83.5        | 13.8         | 6.0     |
|        |           | 4096  |     |      | 82.7        | 34.8         | 2.4     |
|        |           |       | 0   |      | 83.9        | 20.5         | 4.1     |
|        |           |       | 0   | 1    | 81.5        | 13.9         | 5.9     |
|        |           | 2**24 |     |      | 5,520       | 3,670        | 1.5     |
|        |           |       | 0   |      | 5,520       | 1,400        | 3.9     |
|        |           |       | 0   | 1    | 5,520       | 484          | 11.4    |

ghstack-source-id: 9c6a6e04f29e49b68718bc60ef3f3f6417415365
Pull Request resolved: pytorch#85473
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

My benchmarks show speedups in all cases for float32 and complex64.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 17.6         | 2.1     |
|        |           |       | 0   |      | 37.4        | 11.3         | 3.3     |
|        |           |       | 0   | 1    | 37.5        | 7.66         | 4.9     |
|        |           | 4096  |     |      | 73.1        | 49.9         | 1.5     |
|        |           |       | 0   |      | 73.5        | 41.5         | 1.8     |
|        |           |       | 0   | 1    | 73.6        | 37.6         | 2.0     |
|        |           | 2**24 |     |      | 409,000     | 229,000      | 1.8     |
|        |           |       | 0   |      | 411,000     | 170,000      | 2.4     |
|        |           |       | 0   | 1    | 409,000     | 163,000      | 2.5     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 34.7         | 2.3     |
|        |           |       | 0   |      | 80.8        | 20.5         | 3.9     |
|        |           |       | 0   | 1    | 83.5        | 13.8         | 6.0     |
|        |           | 4096  |     |      | 82.7        | 34.8         | 2.4     |
|        |           |       | 0   |      | 83.9        | 20.5         | 4.1     |
|        |           |       | 0   | 1    | 81.5        | 13.9         | 5.9     |
|        |           | 2**24 |     |      | 5,520       | 3,670        | 1.5     |
|        |           |       | 0   |      | 5,520       | 1,400        | 3.9     |
|        |           |       | 0   | 1    | 5,520       | 484          | 11.4    |

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Sep 22, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

My benchmarks show speedups in all cases for float32 and complex64.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 17.6         | 2.1     |
|        |           |       | 0   |      | 37.4        | 11.3         | 3.3     |
|        |           |       | 0   | 1    | 37.5        | 7.66         | 4.9     |
|        |           | 4096  |     |      | 73.1        | 49.9         | 1.5     |
|        |           |       | 0   |      | 73.5        | 41.5         | 1.8     |
|        |           |       | 0   | 1    | 73.6        | 37.6         | 2.0     |
|        |           | 2**24 |     |      | 409,000     | 229,000      | 1.8     |
|        |           |       | 0   |      | 411,000     | 170,000      | 2.4     |
|        |           |       | 0   | 1    | 409,000     | 163,000      | 2.5     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 34.7         | 2.3     |
|        |           |       | 0   |      | 80.8        | 20.5         | 3.9     |
|        |           |       | 0   | 1    | 83.5        | 13.8         | 6.0     |
|        |           | 4096  |     |      | 82.7        | 34.8         | 2.4     |
|        |           |       | 0   |      | 83.9        | 20.5         | 4.1     |
|        |           |       | 0   | 1    | 81.5        | 13.9         | 5.9     |
|        |           | 2**24 |     |      | 5,520       | 3,670        | 1.5     |
|        |           |       | 0   |      | 5,520       | 1,400        | 3.9     |
|        |           |       | 0   | 1    | 5,520       | 484          | 11.4    |

ghstack-source-id: 6754bb3a9cb7bf9532692760c08f7eff023a1a0f
Pull Request resolved: #85473
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

My benchmarks show speedups in all cases for float32 and complex64.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 17.6         | 2.1     |
|        |           |       | 0   |      | 37.4        | 11.3         | 3.3     |
|        |           |       | 0   | 1    | 37.5        | 7.66         | 4.9     |
|        |           | 4096  |     |      | 73.1        | 49.9         | 1.5     |
|        |           |       | 0   |      | 73.5        | 41.5         | 1.8     |
|        |           |       | 0   | 1    | 73.6        | 37.6         | 2.0     |
|        |           | 2**24 |     |      | 409,000     | 229,000      | 1.8     |
|        |           |       | 0   |      | 411,000     | 170,000      | 2.4     |
|        |           |       | 0   | 1    | 409,000     | 163,000      | 2.5     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 34.7         | 2.3     |
|        |           |       | 0   |      | 80.8        | 20.5         | 3.9     |
|        |           |       | 0   | 1    | 83.5        | 13.8         | 6.0     |
|        |           | 4096  |     |      | 82.7        | 34.8         | 2.4     |
|        |           |       | 0   |      | 83.9        | 20.5         | 4.1     |
|        |           |       | 0   | 1    | 81.5        | 13.9         | 5.9     |
|        |           | 2**24 |     |      | 5,520       | 3,670        | 1.5     |
|        |           |       | 0   |      | 5,520       | 1,400        | 3.9     |
|        |           |       | 0   | 1    | 5,520       | 484          | 11.4    |

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Sep 23, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

My benchmarks show speedups in all cases for float32 and complex64.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 17.6         | 2.1     |
|        |           |       | 0   |      | 37.4        | 11.3         | 3.3     |
|        |           |       | 0   | 1    | 37.5        | 7.66         | 4.9     |
|        |           | 4096  |     |      | 73.1        | 49.9         | 1.5     |
|        |           |       | 0   |      | 73.5        | 41.5         | 1.8     |
|        |           |       | 0   | 1    | 73.6        | 37.6         | 2.0     |
|        |           | 2**24 |     |      | 409,000     | 229,000      | 1.8     |
|        |           |       | 0   |      | 411,000     | 170,000      | 2.4     |
|        |           |       | 0   | 1    | 409,000     | 163,000      | 2.5     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 34.7         | 2.3     |
|        |           |       | 0   |      | 80.8        | 20.5         | 3.9     |
|        |           |       | 0   | 1    | 83.5        | 13.8         | 6.0     |
|        |           | 4096  |     |      | 82.7        | 34.8         | 2.4     |
|        |           |       | 0   |      | 83.9        | 20.5         | 4.1     |
|        |           |       | 0   | 1    | 81.5        | 13.9         | 5.9     |
|        |           | 2**24 |     |      | 5,520       | 3,670        | 1.5     |
|        |           |       | 0   |      | 5,520       | 1,400        | 3.9     |
|        |           |       | 0   | 1    | 5,520       | 484          | 11.4    |

ghstack-source-id: 9422f5262f49dcb8e08f85bdbd83b7e9b4314fe9
Pull Request resolved: #85473
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 23, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

My benchmarks show speedups in all cases for float32 and complex64.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 17.6         | 2.1     |
|        |           |       | 0   |      | 37.4        | 11.3         | 3.3     |
|        |           |       | 0   | 1    | 37.5        | 7.66         | 4.9     |
|        |           | 4096  |     |      | 73.1        | 49.9         | 1.5     |
|        |           |       | 0   |      | 73.5        | 41.5         | 1.8     |
|        |           |       | 0   | 1    | 73.6        | 37.6         | 2.0     |
|        |           | 2**24 |     |      | 409,000     | 229,000      | 1.8     |
|        |           |       | 0   |      | 411,000     | 170,000      | 2.4     |
|        |           |       | 0   | 1    | 409,000     | 163,000      | 2.5     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 34.7         | 2.3     |
|        |           |       | 0   |      | 80.8        | 20.5         | 3.9     |
|        |           |       | 0   | 1    | 83.5        | 13.8         | 6.0     |
|        |           | 4096  |     |      | 82.7        | 34.8         | 2.4     |
|        |           |       | 0   |      | 83.9        | 20.5         | 4.1     |
|        |           |       | 0   | 1    | 81.5        | 13.9         | 5.9     |
|        |           | 2**24 |     |      | 5,520       | 3,670        | 1.5     |
|        |           |       | 0   |      | 5,520       | 1,400        | 3.9     |
|        |           |       | 0   | 1    | 5,520       | 484          | 11.4    |

ghstack-source-id: 9422f5262f49dcb8e08f85bdbd83b7e9b4314fe9
Pull Request resolved: pytorch#85473
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 23, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

My benchmarks show speedups in all cases for float32 and complex64.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 17.6         | 2.1     |
|        |           |       | 0   |      | 37.4        | 11.3         | 3.3     |
|        |           |       | 0   | 1    | 37.5        | 7.66         | 4.9     |
|        |           | 4096  |     |      | 73.1        | 49.9         | 1.5     |
|        |           |       | 0   |      | 73.5        | 41.5         | 1.8     |
|        |           |       | 0   | 1    | 73.6        | 37.6         | 2.0     |
|        |           | 2**24 |     |      | 409,000     | 229,000      | 1.8     |
|        |           |       | 0   |      | 411,000     | 170,000      | 2.4     |
|        |           |       | 0   | 1    | 409,000     | 163,000      | 2.5     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 34.7         | 2.3     |
|        |           |       | 0   |      | 80.8        | 20.5         | 3.9     |
|        |           |       | 0   | 1    | 83.5        | 13.8         | 6.0     |
|        |           | 4096  |     |      | 82.7        | 34.8         | 2.4     |
|        |           |       | 0   |      | 83.9        | 20.5         | 4.1     |
|        |           |       | 0   | 1    | 81.5        | 13.9         | 5.9     |
|        |           | 2**24 |     |      | 5,520       | 3,670        | 1.5     |
|        |           |       | 0   |      | 5,520       | 1,400        | 3.9     |
|        |           |       | 0   | 1    | 5,520       | 484          | 11.4    |

ghstack-source-id: 9422f5262f49dcb8e08f85bdbd83b7e9b4314fe9
Pull Request resolved: pytorch#85473
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 23, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

My benchmarks show speedups in all cases for float32 and complex64.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 17.6         | 2.1     |
|        |           |       | 0   |      | 37.4        | 11.3         | 3.3     |
|        |           |       | 0   | 1    | 37.5        | 7.66         | 4.9     |
|        |           | 4096  |     |      | 73.1        | 49.9         | 1.5     |
|        |           |       | 0   |      | 73.5        | 41.5         | 1.8     |
|        |           |       | 0   | 1    | 73.6        | 37.6         | 2.0     |
|        |           | 2**24 |     |      | 409,000     | 229,000      | 1.8     |
|        |           |       | 0   |      | 411,000     | 170,000      | 2.4     |
|        |           |       | 0   | 1    | 409,000     | 163,000      | 2.5     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 34.7         | 2.3     |
|        |           |       | 0   |      | 80.8        | 20.5         | 3.9     |
|        |           |       | 0   | 1    | 83.5        | 13.8         | 6.0     |
|        |           | 4096  |     |      | 82.7        | 34.8         | 2.4     |
|        |           |       | 0   |      | 83.9        | 20.5         | 4.1     |
|        |           |       | 0   | 1    | 81.5        | 13.9         | 5.9     |
|        |           | 2**24 |     |      | 5,520       | 3,670        | 1.5     |
|        |           |       | 0   |      | 5,520       | 1,400        | 3.9     |
|        |           |       | 0   | 1    | 5,520       | 484          | 11.4    |

ghstack-source-id: 9422f5262f49dcb8e08f85bdbd83b7e9b4314fe9
Pull Request resolved: pytorch#85473
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 23, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation (aka lerp) from `low` to `high`. This makes the lerp
step faster by:

- using inplace operations
- using `add`'s `alpha` parameter to avoid an extra kernel
- adding shortcuts for special values of `low` and `high`

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This reduce
overhead by doing a single `rand` + interpolate of double the size,
then calling `torch.view_as_complex` at the end.

My benchmarks show speedups in all cases for float32 and complex64.

| Device | dtype     | Size  | low | high | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-----|------|-------------|--------------|---------|
| CPU    | float32   | 8     |     |      | 19.4        | 15.1         | 1.3     |
|        |           |       | 0   |      | 19.7        | 9.21         | 2.1     |
|        |           |       | 0   | 1    | 19.7        | 5.94         | 3.3     |
|        |           | 4096  |     |      | 36.8        | 31.3         | 1.2     |
|        |           |       | 0   |      | 37.1        | 24.7         | 1.5     |
|        |           |       | 0   | 1    | 36.9        | 21.0         | 1.8     |
|        |           | 2**24 |     |      | 167,000     | 115,000      | 1.5     |
|        |           |       | 0   |      | 179,000     | 85,200       | 2.1     |
|        |           |       | 0   | 1    | 180,000     | 80,800       | 2.2     |
|        | complex32 | 8     |     |      | 37.0        | 17.6         | 2.1     |
|        |           |       | 0   |      | 37.4        | 11.3         | 3.3     |
|        |           |       | 0   | 1    | 37.5        | 7.66         | 4.9     |
|        |           | 4096  |     |      | 73.1        | 49.9         | 1.5     |
|        |           |       | 0   |      | 73.5        | 41.5         | 1.8     |
|        |           |       | 0   | 1    | 73.6        | 37.6         | 2.0     |
|        |           | 2**24 |     |      | 409,000     | 229,000      | 1.8     |
|        |           |       | 0   |      | 411,000     | 170,000      | 2.4     |
|        |           |       | 0   | 1    | 409,000     | 163,000      | 2.5     |
| CUDA   | float32   | 8     |     |      | 40.4        | 30.9         | 1.3     |
|        |           |       | 0   |      | 39.2        | 17.6         | 2.2     |
|        |           |       | 0   | 1    | 39.2        | 11.1         | 3.5     |
|        |           | 4096  |     |      | 38.7        | 32.2         | 1.2     |
|        |           |       | 0   |      | 39.2        | 18.0         | 2.2     |
|        |           |       | 0   | 1    | 39.3        | 11.1         | 3.5     |
|        |           | 2**24 |     |      | 2,300       | 1,840        | 1.3     |
|        |           |       | 0   |      | 2,300       | 704          | 3.3     |
|        |           |       | 0   | 1    | 2,300       | 242          | 9.5     |
|        | complex32 | 8     |     |      | 78.7        | 34.7         | 2.3     |
|        |           |       | 0   |      | 80.8        | 20.5         | 3.9     |
|        |           |       | 0   | 1    | 83.5        | 13.8         | 6.0     |
|        |           | 4096  |     |      | 82.7        | 34.8         | 2.4     |
|        |           |       | 0   |      | 83.9        | 20.5         | 4.1     |
|        |           |       | 0   | 1    | 81.5        | 13.9         | 5.9     |
|        |           | 2**24 |     |      | 5,520       | 3,670        | 1.5     |
|        |           |       | 0   |      | 5,520       | 1,400        | 3.9     |
|        |           |       | 0   | 1    | 5,520       | 484          | 11.4    |

ghstack-source-id: 9422f5262f49dcb8e08f85bdbd83b7e9b4314fe9
Pull Request resolved: pytorch#85473
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 23, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation from `low` to `high`. This instead calls `uniform_(low,
high)` to cut out the interpolation step.

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This instead
uses `view_as_real` and `uniform_(low, high)` to fuse it all into one
operation.

My benchmarks show significant speedups in all cases for float32 and
complex64.

| Device | dtype     | Size  | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-------------|--------------|---------|
| CPU    | float32   | 8     | 19.4        | 6.34         | 3.1     |
|        |           | 4096  | 36.8        | 21.3         | 1.7     |
|        |           | 2**24 | 167,000     | 80,500       | 2.1     |
|        | complex32 | 8     | 37.0        | 7.57         | 4.9     |
|        |           | 4096  | 73.1        | 37.6         | 1.9     |
|        |           | 2**24 | 409,000     | 161,000      | 2.5     |
| CUDA   | float32   | 8     | 40.4        | 11.7         | 3.5     |
|        |           | 4096  | 38.7        | 11.7         | 3.3     |
|        |           | 2**24 | 2,300       | 238          | 9.7     |
|        | complex32 | 8     | 78.7        | 14           | 5.6     |
|        |           | 4096  | 82.7        | 13.8         | 6.0     |
|        |           | 2**24 | 5,520       | 489          | 11.3    |

ghstack-source-id: 9422f5262f49dcb8e08f85bdbd83b7e9b4314fe9
Pull Request resolved: pytorch#85473
@github-actions
Copy link

Hey @peterbell10.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@huydhn
Copy link
Contributor

huydhn commented Sep 29, 2022

@huydhn huydhn reopened this Sep 29, 2022
@huydhn huydhn added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Sep 29, 2022
@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Please reach out to the PyTorch DevX Team with feedback or questions!

@pytorchmergebot
Copy link
Collaborator

@peterbell10 your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Sep 29, 2022
…85473)"

This reverts commit a76995e.

Reverted #85473 on behalf of https://github.com/huydhn due to Sorry for revert your PR, but it seems to cause a bunch of flaky test in pull an periodic
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Oct 3, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation from `low` to `high`. This instead calls `uniform_(low,
high)` to cut out the interpolation step.

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This instead
uses `view_as_real` and `uniform_(low, high)` to fuse it all into one
operation.

My benchmarks show significant speedups in all cases for float32 and
complex64.

| Device | dtype     | Size  | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-------------|--------------|---------|
| CPU    | float32   | 8     | 19.4        | 6.34         | 3.1     |
|        |           | 4096  | 36.8        | 21.3         | 1.7     |
|        |           | 2**24 | 167,000     | 80,500       | 2.1     |
|        | complex32 | 8     | 37.0        | 7.57         | 4.9     |
|        |           | 4096  | 73.1        | 37.6         | 1.9     |
|        |           | 2**24 | 409,000     | 161,000      | 2.5     |
| CUDA   | float32   | 8     | 40.4        | 11.7         | 3.5     |
|        |           | 4096  | 38.7        | 11.7         | 3.3     |
|        |           | 2**24 | 2,300       | 238          | 9.7     |
|        | complex32 | 8     | 78.7        | 14           | 5.6     |
|        |           | 4096  | 82.7        | 13.8         | 6.0     |
|        |           | 2**24 | 5,520       | 489          | 11.3    |

ghstack-source-id: 7cd9c7ab7befcd0efaa4d5170fa9ee6992cbab8b
Pull Request resolved: pytorch#85473
@facebook-github-bot
Copy link
Contributor

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 4, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

For floating types, `make_tensor` calls `rand` and then does a linear
interpolation from `low` to `high`. This instead calls `uniform_(low,
high)` to cut out the interpolation step.

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This instead
uses `view_as_real` and `uniform_(low, high)` to fuse it all into one
operation.

My benchmarks show significant speedups in all cases for float32 and
complex64.

| Device | dtype     | Size  | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-------------|--------------|---------|
| CPU    | float32   | 8     | 19.4        | 6.34         | 3.1     |
|        |           | 4096  | 36.8        | 21.3         | 1.7     |
|        |           | 2**24 | 167,000     | 80,500       | 2.1     |
|        | complex32 | 8     | 37.0        | 7.57         | 4.9     |
|        |           | 4096  | 73.1        | 37.6         | 1.9     |
|        |           | 2**24 | 409,000     | 161,000      | 2.5     |
| CUDA   | float32   | 8     | 40.4        | 11.7         | 3.5     |
|        |           | 4096  | 38.7        | 11.7         | 3.3     |
|        |           | 2**24 | 2,300       | 238          | 9.7     |
|        | complex32 | 8     | 78.7        | 14           | 5.6     |
|        |           | 4096  | 82.7        | 13.8         | 6.0     |
|        |           | 2**24 | 5,520       | 489          | 11.3    |

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Oct 4, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation from `low` to `high`. This instead calls `uniform_(low,
high)` to cut out the interpolation step.

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This instead
uses `view_as_real` and `uniform_(low, high)` to fuse it all into one
operation.

My benchmarks show significant speedups in all cases for float32 and
complex64.

| Device | dtype     | Size  | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-------------|--------------|---------|
| CPU    | float32   | 8     | 19.4        | 6.34         | 3.1     |
|        |           | 4096  | 36.8        | 21.3         | 1.7     |
|        |           | 2**24 | 167,000     | 80,500       | 2.1     |
|        | complex32 | 8     | 37.0        | 7.57         | 4.9     |
|        |           | 4096  | 73.1        | 37.6         | 1.9     |
|        |           | 2**24 | 409,000     | 161,000      | 2.5     |
| CUDA   | float32   | 8     | 40.4        | 11.7         | 3.5     |
|        |           | 4096  | 38.7        | 11.7         | 3.3     |
|        |           | 2**24 | 2,300       | 238          | 9.7     |
|        | complex32 | 8     | 78.7        | 14           | 5.6     |
|        |           | 4096  | 82.7        | 13.8         | 6.0     |
|        |           | 2**24 | 5,520       | 489          | 11.3    |

ghstack-source-id: 4d8ee1f33dbeb09d7a23d041c6465009bcc315d4
Pull Request resolved: #85473
mehtanirav pushed a commit that referenced this pull request Oct 4, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation from `low` to `high`. This instead calls `uniform_(low,
high)` to cut out the interpolation step.

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This instead
uses `view_as_real` and `uniform_(low, high)` to fuse it all into one
operation.

My benchmarks show significant speedups in all cases for float32 and
complex64.

| Device | dtype     | Size  | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-------------|--------------|---------|
| CPU    | float32   | 8     | 19.4        | 6.34         | 3.1     |
|        |           | 4096  | 36.8        | 21.3         | 1.7     |
|        |           | 2**24 | 167,000     | 80,500       | 2.1     |
|        | complex32 | 8     | 37.0        | 7.57         | 4.9     |
|        |           | 4096  | 73.1        | 37.6         | 1.9     |
|        |           | 2**24 | 409,000     | 161,000      | 2.5     |
| CUDA   | float32   | 8     | 40.4        | 11.7         | 3.5     |
|        |           | 4096  | 38.7        | 11.7         | 3.3     |
|        |           | 2**24 | 2,300       | 238          | 9.7     |
|        | complex32 | 8     | 78.7        | 14           | 5.6     |
|        |           | 4096  | 82.7        | 13.8         | 6.0     |
|        |           | 2**24 | 5,520       | 489          | 11.3    |
Pull Request resolved: #85473
Approved by: https://github.com/mruberry
mehtanirav pushed a commit that referenced this pull request Oct 4, 2022
…85473)"

This reverts commit a76995e.

Reverted #85473 on behalf of https://github.com/huydhn due to Sorry for revert your PR, but it seems to cause a bunch of flaky test in pull an periodic
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation from `low` to `high`. This instead calls `uniform_(low,
high)` to cut out the interpolation step.

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This instead
uses `view_as_real` and `uniform_(low, high)` to fuse it all into one
operation.

My benchmarks show significant speedups in all cases for float32 and
complex64.

| Device | dtype     | Size  | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-------------|--------------|---------|
| CPU    | float32   | 8     | 19.4        | 6.34         | 3.1     |
|        |           | 4096  | 36.8        | 21.3         | 1.7     |
|        |           | 2**24 | 167,000     | 80,500       | 2.1     |
|        | complex32 | 8     | 37.0        | 7.57         | 4.9     |
|        |           | 4096  | 73.1        | 37.6         | 1.9     |
|        |           | 2**24 | 409,000     | 161,000      | 2.5     |
| CUDA   | float32   | 8     | 40.4        | 11.7         | 3.5     |
|        |           | 4096  | 38.7        | 11.7         | 3.3     |
|        |           | 2**24 | 2,300       | 238          | 9.7     |
|        | complex32 | 8     | 78.7        | 14           | 5.6     |
|        |           | 4096  | 82.7        | 13.8         | 6.0     |
|        |           | 2**24 | 5,520       | 489          | 11.3    |

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Oct 4, 2022
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation from `low` to `high`. This instead calls `uniform_(low,
high)` to cut out the interpolation step.

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This instead
uses `view_as_real` and `uniform_(low, high)` to fuse it all into one
operation.

My benchmarks show significant speedups in all cases for float32 and
complex64.

| Device | dtype     | Size  | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-------------|--------------|---------|
| CPU    | float32   | 8     | 19.4        | 6.34         | 3.1     |
|        |           | 4096  | 36.8        | 21.3         | 1.7     |
|        |           | 2**24 | 167,000     | 80,500       | 2.1     |
|        | complex32 | 8     | 37.0        | 7.57         | 4.9     |
|        |           | 4096  | 73.1        | 37.6         | 1.9     |
|        |           | 2**24 | 409,000     | 161,000      | 2.5     |
| CUDA   | float32   | 8     | 40.4        | 11.7         | 3.5     |
|        |           | 4096  | 38.7        | 11.7         | 3.3     |
|        |           | 2**24 | 2,300       | 238          | 9.7     |
|        | complex32 | 8     | 78.7        | 14           | 5.6     |
|        |           | 4096  | 82.7        | 13.8         | 6.0     |
|        |           | 2**24 | 5,520       | 489          | 11.3    |

ghstack-source-id: 1cb1a80a002c11752a07c82b834cb21f7ba754be
Pull Request resolved: #85473
@peterbell10
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the land checks (-l) flag. If you did not specify this flag yourself, you are likely enrolled in the land checks rollout. This means that your change will be merged once all checks on your PR have passed since you have added the ciflow/trunk label to your PR (ETA 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@github-actions
Copy link

github-actions bot commented Oct 5, 2022

Hey @peterbell10.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@peterbell10 peterbell10 added the topic: not user facing topic category label Oct 5, 2022
facebook-github-bot pushed a commit that referenced this pull request Oct 7, 2022
…85473)

Summary:
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation from `low` to `high`. This instead calls `uniform_(low,
high)` to cut out the interpolation step.

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This instead
uses `view_as_real` and `uniform_(low, high)` to fuse it all into one
operation.

My benchmarks show significant speedups in all cases for float32 and
complex64.

| Device | dtype     | Size  | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-------------|--------------|---------|
| CPU    | float32   | 8     | 19.4        | 6.34         | 3.1     |
|        |           | 4096  | 36.8        | 21.3         | 1.7     |
|        |           | 2**24 | 167,000     | 80,500       | 2.1     |
|        | complex32 | 8     | 37.0        | 7.57         | 4.9     |
|        |           | 4096  | 73.1        | 37.6         | 1.9     |
|        |           | 2**24 | 409,000     | 161,000      | 2.5     |
| CUDA   | float32   | 8     | 40.4        | 11.7         | 3.5     |
|        |           | 4096  | 38.7        | 11.7         | 3.3     |
|        |           | 2**24 | 2,300       | 238          | 9.7     |
|        | complex32 | 8     | 78.7        | 14           | 5.6     |
|        |           | 4096  | 82.7        | 13.8         | 6.0     |
|        |           | 2**24 | 5,520       | 489          | 11.3    |

Pull Request resolved: #85473
Approved by: https://github.com/mruberry

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/3ec71fce79f4e568c48796da4b18a3e6f2c6fc29

Reviewed By: seemethere

Differential Revision: D40166926

Pulled By: seemethere

fbshipit-source-id: 37a5d67f48328e40622dbd32488088d5d9f7ce82
alvgaona pushed a commit to alvgaona/pytorch that referenced this pull request Oct 11, 2022
…85473)

For floating types, `make_tensor` calls `rand` and then does a linear
interpolation from `low` to `high`. This instead calls `uniform_(low,
high)` to cut out the interpolation step.

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This instead
uses `view_as_real` and `uniform_(low, high)` to fuse it all into one
operation.

My benchmarks show significant speedups in all cases for float32 and
complex64.

| Device | dtype     | Size  | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-------------|--------------|---------|
| CPU    | float32   | 8     | 19.4        | 6.34         | 3.1     |
|        |           | 4096  | 36.8        | 21.3         | 1.7     |
|        |           | 2**24 | 167,000     | 80,500       | 2.1     |
|        | complex32 | 8     | 37.0        | 7.57         | 4.9     |
|        |           | 4096  | 73.1        | 37.6         | 1.9     |
|        |           | 2**24 | 409,000     | 161,000      | 2.5     |
| CUDA   | float32   | 8     | 40.4        | 11.7         | 3.5     |
|        |           | 4096  | 38.7        | 11.7         | 3.3     |
|        |           | 2**24 | 2,300       | 238          | 9.7     |
|        | complex32 | 8     | 78.7        | 14           | 5.6     |
|        |           | 4096  | 82.7        | 13.8         | 6.0     |
|        |           | 2**24 | 5,520       | 489          | 11.3    |
Pull Request resolved: pytorch#85473
Approved by: https://github.com/mruberry
alvgaona pushed a commit to alvgaona/pytorch that referenced this pull request Oct 11, 2022
…ytorch#85473)"

This reverts commit a76995e.

Reverted pytorch#85473 on behalf of https://github.com/huydhn due to Sorry for revert your PR, but it seems to cause a bunch of flaky test in pull an periodic
@facebook-github-bot facebook-github-bot deleted the gh/peterbell10/426/head branch June 8, 2023 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged open source Reverted topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants