-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use blasoxide for BLAS operations #640
Comments
Great effort! In case you haven't seen it yet, can I point you towards bluss' matrixmultiply library, which provides native |
matrixmultiply is ~5x slower on my cpu which has 4 cores, I think blasoxide would be better for performance I benchmarked using 1024x1024 f32 matrices in contiguous memory |
That's awesome and a bit surprising. Can you provide some benchmarks for
both cases?
…On Thu, Jun 6, 2019, 2:31 PM Özgür Akkurt ***@***.***> wrote:
matrixmultiply is ~5x slower on my cpu which has 4 cores, I think
blasoxide would be better for performance
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#640?email_source=notifications&email_token=AAFLF6K67LIRA2CK3Z26YCLPZD7THA5CNFSM4HVBLAM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXCV5VQ#issuecomment-499474134>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFLF6KKV23FAKEGA76NZADPZD7THANCNFSM4HVBLAMQ>
.
|
|
Hi again, |
The implementations are in In other words, For matrix multiplication with arbitrary element types, The So, to directly answer your question:
I'm curious about how much of the performance improvement is due to multithreading, since |
I will probably implement the transa and transb arguments in gemm this week. I don't know how to implement both row and column strides but I might be able to do that also in next week. currently gemm on 1000x1000 matrices take 10.9 ms on my cpu (Ryzen 2200g) I think performance difference is because of microkernel size or block sizes, or maybe matrixmultiply doesn't use fma on my cpu |
Also it would be great if you cloned the repo and benchmarked on your cpu or you can use the latest release to benchmark with your code (latest release is faster than previous ones) |
Here's another pure Rust BLAS implementation as well https://github.com/Schultzer/libblas if y'all want to take a look at it. It was written with nostd support in mind, so it may not be as quick as other implementations out there. |
Interesting for sure, good luck with the whole project! ❤️ If ndarray can have wishes, it'd love to have 64-bit compatible interfaces for lengths and strides, and have support for both column and row stride, like matrixmultiply. It would also wish for a raw pointer interface that does not depend on contiguous slices (mentioned for completeness). Since it's a departure from BLAS I guess it's debatable how useful it would be — at the same time that kind of flexibility is what new implementations can provide to the table. Something to look at later could be runtime target feature detection. It's something matrixmultiply does since 0.2 and I think that's the future — no need to set target cpu feature and so on, so the performance improvements reach everyone. Example - is feature detected and |
@bluss which type is better for indexes isize or usize? |
Closing for inactivity. See @jturner314's comment in the issue for a good take on the general issue. |
Hi,
I am writing a BLAS implementation in rust: https://github.com/oezgurmakkurt/blasoxide
It doesn't have row strides and ignores transpose arguments now, but performance is good.
Can someone list the features, tests, benchmarks I need to implement to make blasoxide viable for ndarray.
The text was updated successfully, but these errors were encountered: