Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use blasoxide for BLAS operations #640

Closed
ghost opened this issue Jun 6, 2019 · 12 comments
Closed

Use blasoxide for BLAS operations #640

ghost opened this issue Jun 6, 2019 · 12 comments
Labels

Comments

@ghost
Copy link

ghost commented Jun 6, 2019

Hi,

I am writing a BLAS implementation in rust: https://github.com/oezgurmakkurt/blasoxide

It doesn't have row strides and ignores transpose arguments now, but performance is good.

Can someone list the features, tests, benchmarks I need to implement to make blasoxide viable for ndarray.

@SuperFluffy
Copy link
Contributor

Great effort!

In case you haven't seen it yet, can I point you towards bluss' matrixmultiply library, which provides native *gemm implementations? The kernels are modelled after the BLIS library, and are ideally implemented with the architecture of the specific CPUs in mind that they are to be run on.

@ghost
Copy link
Author

ghost commented Jun 6, 2019

matrixmultiply is ~5x slower on my cpu which has 4 cores, I think blasoxide would be better for performance

I benchmarked using 1024x1024 f32 matrices in contiguous memory

@SuperFluffy
Copy link
Contributor

SuperFluffy commented Jun 6, 2019 via email

@ghost
Copy link
Author

ghost commented Jun 6, 2019

image
steps to reproduce:

@ghost
Copy link
Author

ghost commented Jun 8, 2019

Hi again,
I just improved the gemm functions, now sgemm_1000 is around 10.5 ms which is almost 6 times faster than matrixmultiply on my 4 core cpu

@jturner314
Copy link
Member

blasoxide is a really cool project! I'm glad to see someone writing a Rust BLAS implementation.

ndarray exposes the following linear algebra operations:

The implementations are in impl_linalg.rs.

In other words, ndarray exposes interfaces to the *gemm and *gemv BLAS operations, so if you provide those operations, blasoxide could be supported by ndarray.

For matrix multiplication with arbitrary element types, ndarray has a general (slow) matrix multiplication implementation. For matrix multiplication with f32 and f64, ndarray always uses matrixmultiply unless (1) the user has enabled the BLAS backend and (2) the arrays have memory layouts compatible with BLAS. If both (1) and (2) are met, then the BLAS backend is used instead.

The blasoxide implementations need to have support for lda, ldb, and ldc (the row strides) and the transpose arguments (which are used for e.g. handling arrays in row-major layout). Basically, they should act like normal BLAS implementations.

So, to directly answer your question:

  • To be used as an optional BLAS backend, we need cblas_dgemm, cblas_sgemm, cblas_dgemv, and cblas_sgemv with support for all arguments (including stride and transpose arguments). (Fwiw, ndarray currently uses the cblas_* functions, but I wouldn't mind changing ndarray to use the Fortran-style functions instead (e.g. dgemm instead of cblas_dgemm).)

    To simplify things, it would be nice for blasoxide to be added as one of the available backends for blas-src. If you do this, AFAIK no changes are necessary to ndarray; the user can just select blasoxide as their desired BLAS implementation.

  • To be usable as an alternative for matrixmultiply, we need functions analogous to dgemm and sgemm that support arbitrary strides. (See e.g. the signature for matrixmultiply::dgemm.)

I'm curious about how much of the performance improvement is due to multithreading, since matrixmultiply is single-threaded. How does the performance compare if you restrict blasoxide to a single thread?

@ghost
Copy link
Author

ghost commented Jun 11, 2019

I will probably implement the transa and transb arguments in gemm this week. I don't know how to implement both row and column strides but I might be able to do that also in next week.

currently gemm on 1000x1000 matrices take 10.9 ms on my cpu (Ryzen 2200g)
if I remove rayon and do it single threaded, it takes 40.1 ms
matrixmultiply takes around 62 ms.
I measured blasoxide with runtime simd detection

I think performance difference is because of microkernel size or block sizes, or maybe matrixmultiply doesn't use fma on my cpu

@ghost
Copy link
Author

ghost commented Jun 11, 2019

Also it would be great if you cloned the repo and benchmarked on your cpu or you can use the latest release to benchmark with your code (latest release is faster than previous ones)

@rcarson3
Copy link
Contributor

Here's another pure Rust BLAS implementation as well https://github.com/Schultzer/libblas if y'all want to take a look at it. It was written with nostd support in mind, so it may not be as quick as other implementations out there.

@bluss
Copy link
Member

bluss commented Sep 4, 2019

Interesting for sure, good luck with the whole project! ❤️ If ndarray can have wishes, it'd love to have 64-bit compatible interfaces for lengths and strides, and have support for both column and row stride, like matrixmultiply. It would also wish for a raw pointer interface that does not depend on contiguous slices (mentioned for completeness).

Since it's a departure from BLAS I guess it's debatable how useful it would be — at the same time that kind of flexibility is what new implementations can provide to the table.

Something to look at later could be runtime target feature detection. It's something matrixmultiply does since 0.2 and I think that's the future — no need to set target cpu feature and so on, so the performance improvements reach everyone. Example - is feature detected and #[target_feature(enable="avx")] and so on. https://docs.rs/crate/matrixmultiply/0.2.2/source/src/sgemm_kernel.rs

@ghost
Copy link
Author

ghost commented Sep 11, 2019

@bluss which type is better for indexes isize or usize?

@bluss
Copy link
Member

bluss commented Mar 29, 2021

Closing for inactivity. See @jturner314's comment in the issue for a good take on the general issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants