-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[Compile] Add NEON implementation for bf16->fp32 cast #134297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Let's trigger a dashboard run for this. |
Sure, https://github.com/pytorch/pytorch/actions/runs/10529131469 [Edit] Realized I did this change before the split, so alas it's not really usable. Let's test in trunk |
| int32x4_t shift = vdupq_n_s32(16); | ||
| auto u16_low1 = vget_low_u16(u16_8); | ||
| auto u16_high1 = vget_high_u16(u16_8); | ||
| float32x4_t f32x4_0 = vreinterpretq_f32_u32(vshlq_u32(vmovl_u16(u16_low1), shift)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable, but if input is interleaved then you can just do vectorized (input & 0xFF00) and the reinterpret, for upper half and save get_high, and movl instructions for upper half. For lower hafl you would still need those.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot merge -f "This is weird: workflow dispatch jobs do not show up in the signal box, but still delay the merge" |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This changes assembly generated for the following routine
from
to
And as result speeds up
python3 torchchat.py generate stories110M --num-samples 3 --compile --device cpu --dtype bfloat16from 33 to 90 tokens/seccc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10