-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable oneDNN implementation in LSTM op #91158
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91158
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 836a7dd: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
0b97b20
to
6c902ea
Compare
66d3bc7
to
c3f6977
Compare
3d1b5ea
to
8f2cde3
Compare
2cdb1b1
to
03920f7
Compare
test/test_mkldnn.py
Outdated
def get_rand_seed(): | ||
return int(time.time() * 1000000000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would cause the test result indeterministic. Can we use fixed seed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Done. I fix it as 2023.
torch/_meta_registrations.py
Outdated
cy = torch.empty(0, device=input.device) | ||
else: | ||
cy = cx_.new_empty(cx_.shape) | ||
workspace = input.new_empty([hidden_size * 1024], dtype=torch.uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess workspace
doesn't matter here, just creating an empty tensor would be good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. An empty tensor works correctly here. Done.
aten/src/ATen/native/mkldnn/RNN.cpp
Outdated
auto nblks = desc.blocking_desc().inner_nblks; | ||
std::vector<int64_t> at_sizes(ndims + nblks); | ||
auto padded_dims = desc.padded_dims(); | ||
auto blk_sizes = desc.blocking_desc().inner_blks; | ||
auto blk_idxs = desc.blocking_desc().inner_idxs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really have to parse the internal blocking descriptors of onednn to get the workspace aten tensor? Can we just model it as a 1D tensor buffer from aten side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with desc.get_size()
.
aten/src/ATen/native/mkldnn/RNN.cpp
Outdated
} | ||
|
||
auto input = input_; | ||
bool is_input_packed = batch_sizes.size() != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is always false, why bother check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this replicate.
03920f7
to
f9d6257
Compare
f9d6257
to
c54ee3c
Compare
This PR depends on Meta internal ideep/oneDNN upgrade. Do not merge it before the issue of Meta internal ideep/oneDNN upgrade is fixed. |
Hi @malfet , could you please help review this PR? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address comments and add more comments explaining what this code is trying to do, though overall looks fine
d7d0523
to
1c3f210
Compare
1c3f210
to
836a7dd
Compare
Hi @malfet , I have addressed all the comments. It's much better to use the suggested changes. I will try to merge this PR then. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
We observe 40~60% speedup in tts_angular model on CPU in TorchBench: pytorch/benchmark#1376 because of this PR. Congrats! |
Description
This PR is to enable oneDNN implementation in LSTM op to improve the performance of it. Both FP32 and BF16 are supported.
Performance improvement
In CPX 28C, with setting iomp and jemalloc.
We choose 8 LSTM input options (including input_size, hidden_size, num_layers, bidirectional, bias, batch_first, dropout, batch_size, seq_len), and the final option is a real input from train-clean-100 in LibriSpeech dataset. The performance improvements are shown in the following figures. We can see that LSTM with oneDNN implementation can perform better than the original.
In single socket:
In single core:
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @mcarilli @ptrblck @leslie-fang-intel