Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native BLAS operations slow? #854

Open
kwalcock opened this issue Nov 10, 2023 · 5 comments
Open

Native BLAS operations slow? #854

kwalcock opened this issue Nov 10, 2023 · 5 comments

Comments

@kwalcock
Copy link

So far, any computer that is able to run with the native support for BLAS that is supplied by the netlib transitive dependency (i.e., Linux-aarch64 or Linux-amd64) runs my app about 3x slower than the Scala/Java implementation. Do others notice the same? I just have some fairly simple matrix multiplications and vector additions. Is there any way to just disable use of native code, because the Scala interface is really nice and I'd like to keep using it? In order to compare performance, I remove the .so files from the blas-3.0.1.jar file so that the native code fails to load, but I can't ship my project with that hack being necessary.

@dlwh
Copy link
Member

dlwh commented Nov 10, 2023

That's surprising to me... I'd hope @luhenry would be open to adding an env variable or something to disable native code.

Is it true even for very large matrices? I'd have thought there was a size at which native is going to win. If there's a threshold I'm happy to put one into Breeze (which we already do for dot product)

@luhenry
Copy link
Contributor

luhenry commented Nov 10, 2023

@kwalcock would you have a reproducing case?

Generally, calling into native would be slower for very small matrices (overhead of the call mostly). That would only be visible if you are doing many many operations. Happy to look into any case you share! :)

@dlwh
Copy link
Member

dlwh commented Nov 10, 2023

I'll add that "lots of tiny matmuls/matvecs" is imho a valid use case to optimize for, either at the netlib level or the Breeze level

@kwalcock
Copy link
Author

I suspect the threshold would be different for everyone. Here we could run our program twice to measure and then pick native on or off. In the data I was working with, a typical problem has four multiplications of (57 * 768) x (768 * 768) and then 10,000 multiplications of (1 * 1536) x (1536 * 1536). That is then scaled to infinitely many problems. I don't know whether that is small, medium, or large.

@dlwh
Copy link
Member

dlwh commented Nov 11, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants