Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unimplemented intrinsics for matrixmultiply #1405

Closed
cuviper opened this issue Oct 31, 2023 · 11 comments · Fixed by #1417
Closed

unimplemented intrinsics for matrixmultiply #1405

cuviper opened this issue Oct 31, 2023 · 11 comments · Fixed by #1417
Labels
A-core-arch Area: Necessary for full core::arch support

Comments

@cuviper
Copy link
Member

cuviper commented Oct 31, 2023

I have some code using ndarray dot products, which in turn calls matrixmultiply::sgemm or dgemm, and these trap when built with cranelift. Here's a reproducer:

cargo-features = ["codegen-backend"]

[package]
name = "dot"
edition = "2021"

[dependencies]
ndarray = "0.15.6"

[profile.dev]
codegen-backend = "cranelift"
#[test]
fn dot_f32() {
    let matrix = ndarray::Array2::<f32>::eye(10);
    let _ = matrix.dot(&matrix);
}

#[test]
fn dot_f64() {
    let matrix = ndarray::Array2::<f64>::eye(10);
    let _ = matrix.dot(&matrix);
}
$ cargo test
...
running 2 tests
trap at Instance { def: Item(DefId(2:14266 ~ core[53bd]::core_arch::x86::avx::_mm256_permute2f128_pd)), args: [3_i32] } (_ZN4core9core_arch3x863avx22_mm256_permute2f128_pd17h1ce919b2c8bdf956E): llvm.x86.avx.vperm2f128.pd.256
trap at Instance { def: Item(DefId(2:14264 ~ core[53bd]::core_arch::x86::avx::_mm256_permute2f128_ps)), args: [3_i32] } (_ZN4core9core_arch3x863avx22_mm256_permute2f128_ps17h5d0d25c7962691b9E): llvm.x86.avx.vperm2f128.ps.256
@cuviper
Copy link
Member Author

cuviper commented Oct 31, 2023

FWIW, aarch64 also fails:

trap at Instance { def: Item(DefId(2:48675 ~ core[5761]::core_arch::aarch64::neon::generated::vfmaq_laneq_f32)), args: [0_i32] } (_ZN4core9core_arch7aarch644neon9generated15vfmaq_laneq_f3217h0dd8a28605cc03d6E): llvm.fma.v4f32
trap at Instance { def: Item(DefId(2:48683 ~ core[5761]::core_arch::aarch64::neon::generated::vfmaq_laneq_f64)), args: [0_i32] } (_ZN4core9core_arch7aarch644neon9generated15vfmaq_laneq_f6417ha68b5ef2dcd31872E): llvm.fma.v2f64

@cuviper cuviper changed the title unimplemented avx::_mm256_permute2f128_ps and _pd unimplemented intrinsics for matrixmultiply Oct 31, 2023
@cuviper
Copy link
Member Author

cuviper commented Oct 31, 2023

Directly compiling matrixmultiply shows warnings about these intrinsics, but at least there are no more.

Aarch64:

warning: unsupported llvm intrinsic llvm.fma.v4f32; replacing with trap

warning: unsupported llvm intrinsic llvm.fma.v2f64; replacing with trap

x86_64:

warning: unsupported x86 llvm intrinsic llvm.x86.avx.vperm2f128.pd.256; replacing with trap

warning: unsupported x86 llvm intrinsic llvm.x86.avx.vperm2f128.ps.256; replacing with trap

bjorn3 added a commit that referenced this issue Oct 31, 2023
@bjorn3
Copy link
Member

bjorn3 commented Oct 31, 2023

Implemented llvm.fma.v* in 48ca2d9. On AArch64 with this fix the only remaining ndarray test failures are: insert_axis, insert_axis_f and test_multislice_intersecting. Based on the panic message for those remaining test failures I think there is a miscompilation of those tests though.

Edit: Seems those are actually tests that use catch_unwind, which doesn't work because of panic=abort.

@bjorn3
Copy link
Member

bjorn3 commented Oct 31, 2023

I wrote an entire comment about how I couldn't reproduce any crash on x86 and then I tried using the rustup version instead of the version built from this repo, which did indeed crash with this error message. I'm currently investigating what the difference between the two is that could have caused this.

@cuviper
Copy link
Member Author

cuviper commented Oct 31, 2023

Ah, yes I'm using the rustup component, as of:

$ rustc +nightly -Vv
rustc 1.75.0-nightly (31bc7e2c4 2023-10-30)
binary: rustc
commit-hash: 31bc7e2c47e82798a392c770611975a6883132c8
commit-date: 2023-10-30
host: x86_64-unknown-linux-gnu
release: 1.75.0-nightly
LLVM version: 17.0.3

@bjorn3
Copy link
Member

bjorn3 commented Oct 31, 2023

It seems like is_x86_feature_detected!() is broken when using a cg_clif compiled libstd, causing matrixmultiply to disable some tests because it thinks AVX and FMA are not supported.

@bjorn3
Copy link
Member

bjorn3 commented Oct 31, 2023

I think I know the issue. std_detect::detect::os::x86::detect_features depends on _xgetbv() to see if the OS supports AVX. _xgetbv is implemented using the llvm.x86.xgetbv LLVM intrinsic rather than an asm!() block. Because it isn't supported natively by Cranelift, I implemented it using a dummy value of 1.

@bjorn3
Copy link
Member

bjorn3 commented Nov 2, 2023

Just a quick update. I have _xgetbv correctly implemented now. I've been working on implementing _mm256_permute2f128_ps and _mm256_permute2f128_pd and got a miscompilation right now that I need to fix.

@bjorn3
Copy link
Member

bjorn3 commented Nov 5, 2023

Got matrixmultiply working correctly in the implement_xgetbv branch. You can download a precompiled version from https://github.com/rust-lang/rustc_codegen_cranelift/actions/runs/6763047493 once it is done. I will probably work on implementing the rest of the reported missing intrinsics from other issues before opening a PR.

@bjorn3
Copy link
Member

bjorn3 commented Nov 11, 2023

Should be fixed in the latest nightly.

@cuviper
Copy link
Member Author

cuviper commented Nov 11, 2023

Confirmed, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-core-arch Area: Necessary for full core::arch support
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants