Native BLAS operations slow? #854

kwalcock · 2023-11-10T18:13:42Z

So far, any computer that is able to run with the native support for BLAS that is supplied by the netlib transitive dependency (i.e., Linux-aarch64 or Linux-amd64) runs my app about 3x slower than the Scala/Java implementation. Do others notice the same? I just have some fairly simple matrix multiplications and vector additions. Is there any way to just disable use of native code, because the Scala interface is really nice and I'd like to keep using it? In order to compare performance, I remove the .so files from the blas-3.0.1.jar file so that the native code fails to load, but I can't ship my project with that hack being necessary.

dlwh · 2023-11-10T18:59:44Z

That's surprising to me... I'd hope @luhenry would be open to adding an env variable or something to disable native code.

Is it true even for very large matrices? I'd have thought there was a size at which native is going to win. If there's a threshold I'm happy to put one into Breeze (which we already do for dot product)

luhenry · 2023-11-10T20:54:46Z

@kwalcock would you have a reproducing case?

Generally, calling into native would be slower for very small matrices (overhead of the call mostly). That would only be visible if you are doing many many operations. Happy to look into any case you share! :)

dlwh · 2023-11-10T20:59:53Z

I'll add that "lots of tiny matmuls/matvecs" is imho a valid use case to optimize for, either at the netlib level or the Breeze level

kwalcock · 2023-11-11T00:10:05Z

I suspect the threshold would be different for everyone. Here we could run our program twice to measure and then pick native on or off. In the data I was working with, a typical problem has four multiplications of (57 * 768) x (768 * 768) and then 10,000 multiplications of (1 * 1536) x (1536 * 1536). That is then scaled to infinitely many problems. I don't know whether that is small, medium, or large.

dlwh · 2023-11-11T00:13:11Z

That's definitely big enough I would have thought native would win out

…

On Fri, Nov 10, 2023 at 4:10 PM Keith Alcock ***@***.***> wrote: I suspect the threshold would be different for everyone. Here we could run our program twice to measure and then pick native on or off. In the data I was working with, a typical problem has four multiplications of (57 * 768) x (768 * 768) and then 10,000 multiplications of (1 * 1536) x (1536 * 1536). That is then scaled to infinitely many problems. I don't know whether that is small, medium, or large. — Reply to this email directly, view it on GitHub <#854 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAACLILVK2CNKUBBV4DJ6Y3YD266RAVCNFSM6AAAAAA7GRDZ3KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBWGU4DENZQG4> . You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native BLAS operations slow? #854

Native BLAS operations slow? #854

kwalcock commented Nov 10, 2023

dlwh commented Nov 10, 2023

luhenry commented Nov 10, 2023

dlwh commented Nov 10, 2023

kwalcock commented Nov 11, 2023

dlwh commented Nov 11, 2023 via email

Native BLAS operations slow? #854

Native BLAS operations slow? #854

Comments

kwalcock commented Nov 10, 2023

dlwh commented Nov 10, 2023

luhenry commented Nov 10, 2023

dlwh commented Nov 10, 2023

kwalcock commented Nov 11, 2023

dlwh commented Nov 11, 2023 via email