Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable oneDNN implementation in LSTM op #91158

Closed
wants to merge 10 commits into from

Conversation

yanbing-j
Copy link
Collaborator

@yanbing-j yanbing-j commented Dec 20, 2022

Description

This PR is to enable oneDNN implementation in LSTM op to improve the performance of it. Both FP32 and BF16 are supported.

Performance improvement

In CPX 28C, with setting iomp and jemalloc.
We choose 8 LSTM input options (including input_size, hidden_size, num_layers, bidirectional, bias, batch_first, dropout, batch_size, seq_len), and the final option is a real input from train-clean-100 in LibriSpeech dataset. The performance improvements are shown in the following figures. We can see that LSTM with oneDNN implementation can perform better than the original.

In single socket:
image

image

In single core:
image

image

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @mcarilli @ptrblck @leslie-fang-intel

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 20, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91158

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 836a7dd:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Dec 20, 2022
@yanbing-j yanbing-j force-pushed the yanbing/lstm_onednn branch 5 times, most recently from 0b97b20 to 6c902ea Compare December 26, 2022 08:17
@yanbing-j yanbing-j force-pushed the yanbing/lstm_onednn branch 3 times, most recently from 66d3bc7 to c3f6977 Compare January 4, 2023 08:05
@yanbing-j yanbing-j added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 5, 2023
@yanbing-j yanbing-j force-pushed the yanbing/lstm_onednn branch 3 times, most recently from 2cdb1b1 to 03920f7 Compare January 8, 2023 06:52
@yanbing-j yanbing-j marked this pull request as ready for review January 8, 2023 11:01
Comment on lines 46 to 47
def get_rand_seed():
return int(time.time() * 1000000000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would cause the test result indeterministic. Can we use fixed seed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Done. I fix it as 2023.

cy = torch.empty(0, device=input.device)
else:
cy = cx_.new_empty(cx_.shape)
workspace = input.new_empty([hidden_size * 1024], dtype=torch.uint8)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess workspace doesn't matter here, just creating an empty tensor would be good enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. An empty tensor works correctly here. Done.

Comment on lines 241 to 245
auto nblks = desc.blocking_desc().inner_nblks;
std::vector<int64_t> at_sizes(ndims + nblks);
auto padded_dims = desc.padded_dims();
auto blk_sizes = desc.blocking_desc().inner_blks;
auto blk_idxs = desc.blocking_desc().inner_idxs;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really have to parse the internal blocking descriptors of onednn to get the workspace aten tensor? Can we just model it as a 1D tensor buffer from aten side?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with desc.get_size().

}

auto input = input_;
bool is_input_packed = batch_sizes.size() != 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is always false, why bother check?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this replicate.

@yanbing-j yanbing-j added the intel This tag is for PR from Intel label Jan 9, 2023
@yanbing-j yanbing-j requested a review from jgong5 January 9, 2023 07:55
@yanbing-j
Copy link
Collaborator Author

This PR depends on Meta internal ideep/oneDNN upgrade. Do not merge it before the issue of Meta internal ideep/oneDNN upgrade is fixed.

@atalman atalman added this to the 2.0.0 milestone Jan 11, 2023
@yanbing-j
Copy link
Collaborator Author

Hi @malfet , could you please help review this PR? Thanks!

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address comments and add more comments explaining what this code is trying to do, though overall looks fine

aten/src/ATen/native/RNN.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/RNN.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/mkldnn/RNN.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/mkldnn/RNN.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/mkldnn/RNN.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/mkldnn/RNN.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/mkldnn/RNN.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/mkldnn/RNN.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/mkldnn/RNN.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/mkldnn/RNN.cpp Outdated Show resolved Hide resolved
@yanbing-j yanbing-j force-pushed the yanbing/lstm_onednn branch 6 times, most recently from d7d0523 to 1c3f210 Compare January 12, 2023 09:39
@yanbing-j
Copy link
Collaborator Author

Hi @malfet , I have addressed all the comments. It's much better to use the suggested changes. I will try to merge this PR then.

@yanbing-j
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@xuzhao9
Copy link
Contributor

xuzhao9 commented Jan 20, 2023

We observe 40~60% speedup in tts_angular model on CPU in TorchBench: pytorch/benchmark#1376 because of this PR. Congrats!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel Merged module: amp (automated mixed precision) autocast module: cpu CPU specific problem (e.g., perf, algorithm) open source
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

7 participants