Use blasoxide for BLAS operations #640

ghost · 2019-06-06T12:12:03Z

Hi,

I am writing a BLAS implementation in rust: https://github.com/oezgurmakkurt/blasoxide

It doesn't have row strides and ignores transpose arguments now, but performance is good.

Can someone list the features, tests, benchmarks I need to implement to make blasoxide viable for ndarray.

SuperFluffy · 2019-06-06T12:21:50Z

Great effort!

In case you haven't seen it yet, can I point you towards bluss' matrixmultiply library, which provides native *gemm implementations? The kernels are modelled after the BLIS library, and are ideally implemented with the architecture of the specific CPUs in mind that they are to be run on.

ghost · 2019-06-06T12:31:47Z

matrixmultiply is ~5x slower on my cpu which has 4 cores, I think blasoxide would be better for performance

I benchmarked using 1024x1024 f32 matrices in contiguous memory

SuperFluffy · 2019-06-06T14:54:02Z

That's awesome and a bit surprising. Can you provide some benchmarks for both cases?

…

On Thu, Jun 6, 2019, 2:31 PM Özgür Akkurt ***@***.***> wrote: matrixmultiply is ~5x slower on my cpu which has 4 cores, I think blasoxide would be better for performance — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#640?email_source=notifications&email_token=AAFLF6K67LIRA2CK3Z26YCLPZD7THA5CNFSM4HVBLAM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXCV5VQ#issuecomment-499474134>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFLF6KKV23FAKEGA76NZADPZD7THANCNFSM4HVBLAMQ> .

ghost · 2019-06-06T15:39:53Z

steps to reproduce:

clone https://github.com/oezgurmakkurt/blasoxide/tree/c78f9251bfa94561210a2a7b3f6c93e9f2395283
set RUSTFLAGS=-C target-cpu=native
cargo +nightly bench sgemm

ghost · 2019-06-08T12:47:50Z

Hi again,
I just improved the gemm functions, now sgemm_1000 is around 10.5 ms which is almost 6 times faster than matrixmultiply on my 4 core cpu

jturner314 · 2019-06-11T02:09:33Z

blasoxide is a really cool project! I'm glad to see someone writing a Rust BLAS implementation.

ndarray exposes the following linear algebra operations:

general_mat_mul
general_mat_vec_mul
Dot
the .dot() methods on ArrayBase, which are implemented in terms of Dot

The implementations are in impl_linalg.rs.

In other words, ndarray exposes interfaces to the *gemm and *gemv BLAS operations, so if you provide those operations, blasoxide could be supported by ndarray.

For matrix multiplication with arbitrary element types, ndarray has a general (slow) matrix multiplication implementation. For matrix multiplication with f32 and f64, ndarray always uses matrixmultiply unless (1) the user has enabled the BLAS backend and (2) the arrays have memory layouts compatible with BLAS. If both (1) and (2) are met, then the BLAS backend is used instead.

The blasoxide implementations need to have support for lda, ldb, and ldc (the row strides) and the transpose arguments (which are used for e.g. handling arrays in row-major layout). Basically, they should act like normal BLAS implementations.

So, to directly answer your question:

To be used as an optional BLAS backend, we need cblas_dgemm, cblas_sgemm, cblas_dgemv, and cblas_sgemv with support for all arguments (including stride and transpose arguments). (Fwiw, ndarray currently uses the cblas_* functions, but I wouldn't mind changing ndarray to use the Fortran-style functions instead (e.g. dgemm instead of cblas_dgemm).)

To simplify things, it would be nice for blasoxide to be added as one of the available backends for blas-src. If you do this, AFAIK no changes are necessary to ndarray; the user can just select blasoxide as their desired BLAS implementation.
To be usable as an alternative for matrixmultiply, we need functions analogous to dgemm and sgemm that support arbitrary strides. (See e.g. the signature for matrixmultiply::dgemm.)

I'm curious about how much of the performance improvement is due to multithreading, since matrixmultiply is single-threaded. How does the performance compare if you restrict blasoxide to a single thread?

ghost · 2019-06-11T02:40:42Z

I will probably implement the transa and transb arguments in gemm this week. I don't know how to implement both row and column strides but I might be able to do that also in next week.

currently gemm on 1000x1000 matrices take 10.9 ms on my cpu (Ryzen 2200g)
if I remove rayon and do it single threaded, it takes 40.1 ms
matrixmultiply takes around 62 ms.
I measured blasoxide with runtime simd detection

I think performance difference is because of microkernel size or block sizes, or maybe matrixmultiply doesn't use fma on my cpu

ghost · 2019-06-11T02:43:19Z

Also it would be great if you cloned the repo and benchmarked on your cpu or you can use the latest release to benchmark with your code (latest release is faster than previous ones)

rcarson3 · 2019-06-22T21:46:11Z

Here's another pure Rust BLAS implementation as well https://github.com/Schultzer/libblas if y'all want to take a look at it. It was written with nostd support in mind, so it may not be as quick as other implementations out there.

bluss · 2019-09-04T11:18:30Z

Interesting for sure, good luck with the whole project! ❤️ If ndarray can have wishes, it'd love to have 64-bit compatible interfaces for lengths and strides, and have support for both column and row stride, like matrixmultiply. It would also wish for a raw pointer interface that does not depend on contiguous slices (mentioned for completeness).

Since it's a departure from BLAS I guess it's debatable how useful it would be — at the same time that kind of flexibility is what new implementations can provide to the table.

Something to look at later could be runtime target feature detection. It's something matrixmultiply does since 0.2 and I think that's the future — no need to set target cpu feature and so on, so the performance improvements reach everyone. Example - is feature detected and #[target_feature(enable="avx")] and so on. https://docs.rs/crate/matrixmultiply/0.2.2/source/src/sgemm_kernel.rs

ghost · 2019-09-11T12:56:15Z

@bluss which type is better for indexes isize or usize?

bluss · 2021-03-29T20:58:09Z

Closing for inactivity. See @jturner314's comment in the issue for a good take on the general issue.

LukeMathWalker added the blas label Jul 20, 2019

bluss closed this as completed Mar 29, 2021

iverks mentioned this issue Jul 10, 2023

Add pure rust implementation blas-lapack-rs/blas-src#21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use blasoxide for BLAS operations #640

Use blasoxide for BLAS operations #640

ghost commented Jun 6, 2019

SuperFluffy commented Jun 6, 2019

ghost commented Jun 6, 2019 •

edited by ghost

Loading

SuperFluffy commented Jun 6, 2019 via email

ghost commented Jun 6, 2019 •

edited by ghost

Loading

ghost commented Jun 8, 2019 •

edited by ghost

Loading

jturner314 commented Jun 11, 2019

ghost commented Jun 11, 2019

ghost commented Jun 11, 2019 •

edited by ghost

Loading

rcarson3 commented Jun 22, 2019

bluss commented Sep 4, 2019 •

edited

Loading

ghost commented Sep 11, 2019

bluss commented Mar 29, 2021

Use blasoxide for BLAS operations #640

Use blasoxide for BLAS operations #640

Comments

ghost commented Jun 6, 2019

SuperFluffy commented Jun 6, 2019

ghost commented Jun 6, 2019 • edited by ghost Loading

SuperFluffy commented Jun 6, 2019 via email

ghost commented Jun 6, 2019 • edited by ghost Loading

ghost commented Jun 8, 2019 • edited by ghost Loading

jturner314 commented Jun 11, 2019

ghost commented Jun 11, 2019

ghost commented Jun 11, 2019 • edited by ghost Loading

rcarson3 commented Jun 22, 2019

bluss commented Sep 4, 2019 • edited Loading

ghost commented Sep 11, 2019

bluss commented Mar 29, 2021

ghost commented Jun 6, 2019 •

edited by ghost

Loading

ghost commented Jun 6, 2019 •

edited by ghost

Loading

ghost commented Jun 8, 2019 •

edited by ghost

Loading

ghost commented Jun 11, 2019 •

edited by ghost

Loading

bluss commented Sep 4, 2019 •

edited

Loading