Benchmark numpy vs breeze vs BlockMatrix? #3

eric-czech · 2021-01-07T18:54:41Z

Moving this discussion here from related-sciences/gwas-analysis#4 (comment).

I have some doubts that claim v in the current outline is one we can substantiate well:

Xarray, Dask and Zarr constitute a cloud-native, distributed, array-centric data processing framework that has inherent advantages over MapReduce and MPI for genetics

Specifically, I think any empirical comparison of Dask and something like Spark attempting to justify this would be impossible to disentangle from potential technical differences between numpy and breeze. @ravwojdyla recently suggested comparing these or looking again for benchmarks. I think this, if we could find or produce it, would likely be more productive than arguing the tradeoffs of higher level design choices made in Spark and Dask. I say this because they are both attempting to shovel around array blocks to be processed by breeze/numpy (for the heavy lifting), Spark supports 2D chunking, and the need for ND (N>2) chunking support is not one we've made well with sgkit since none of the methods actually rely on chunking beyond the second dimension, afaik.

ravwojdyla · 2021-01-07T19:27:00Z

@eric-czech on the comparison point, another thing that came to my mind, that is maybe a promising data point, is the PC Relate implementation in sgkit, it has the same "algorithm" as Hail (uses the same linear algebra operations), but the performance is better (over an order of magnitude better for 1kGP dataset contains 629 samples and ~5.8e6 SNPs, see https://discourse.pystatgen.org/t/pc-relate-experiment/51). So it might be that PC Relate implementation in Hail is particularly bad, or maybe the data is too small to see the issues we experience with GWAS?

eric-czech · 2021-01-07T21:45:07Z

So it might be that PC Relate implementation in Hail is particularly bad, or maybe the data is too small to see the issues we experience with GWAS?

Oh yea good point. I don't think I'd be surprised if Dask was much better in local mode but much worse in a distributed setting given that my impression is that quite a lot of people use local Dask (for out-of-core only) while a larger fraction of Spark users run code on a cluster. Maybe it's worth simply seeing how much that PC Relate example degrades when run on a cluster vs on a single node with the same overall resources (as compared to Hail/Spark)?

hammer · 2021-01-11T15:14:26Z

🤔 I'm not sure we want to get into the single-node linear algebra benchmarking space, it's quite sophisticated and hardware-specific. It's also sensitive to row-major (Numpy) vs. column-major (Breeze), benchmarking anything on the JVM is tricky, and we'd have to contend with the zoo of Python accelerators as well.

Breeze delegates to netlib-java which delegates to BLAS/LAPACK/ARPACK. Breeze is in maintenance mode and netlib-java is not even maintained. The poor support for libraries at the base of the Scala numerics stack was a big reason we decided to move to Python at Related Sciences.

If anything let's just try to refer to an external benchmark. I think it will be more important to give representative performance numbers without trying to claim them to be perfectly designed benchmarks, then to give a sense of how those performance numbers behave in various settings like scale-out mentioned by Eric as well as scaling up on a single node with more cores or on the GPU.

hammer · 2021-01-11T15:19:44Z

Also with respect to Eric's comment about the future of Breeze in Spark: I don't think Databricks cares about Scala linear algebra since their hosted proprietary runtime does not use the JVM and most of their users write Python. The last comment I could find after a brief JIRA search was from 2017:

There's been lot of discussion around the issue of Spark providing a linear algebra lib and the consensus is generally that it's a huge amount of overhead for Spark to maintain a full-blown linear algebra lib.
https://issues.apache.org/jira/browse/SPARK-6442 and https://issues.apache.org/jira/browse/SPARK-16365

ravwojdyla · 2021-01-11T16:21:24Z

I'm not sure we want to get into the single-node linear algebra benchmarking space, it's quite sophisticated and hardware-specific. It's also sensitive to row-major (Numpy) vs. column-major (Breeze), benchmarking anything on the JVM is tricky, and we'd have to contend with the zoo of Python accelerators as well.

@hammer this came up during the last sgkit meeting, we all were not aware of a good external performance benchmark between numpy and breeze, to use as an expected proxy metric for the performance of sgkit vs Hail. I believe the goal would be not to provide a definitive benchmark, but rather a ballpark number for numpy/dask.array vs breeze/hail.BlockMatrix. The expectation is that Hail's BlockMatrix backed by Breeze should be slower. Do you think that's still too much?

hammer · 2021-01-11T16:56:08Z

I guess I don't mind if the benchmark is not much more than what Eric did in the original issue: related-sciences/gwas-analysis#4 (comment). I'm just nervous about the thinking and resources required to say anything definitive in the performance space, and I want to be sure we represent our numbers as crude.

jeromekelleher · 2023-03-03T09:44:45Z

Revisiting this, it feels like a lot of work to get into that we don't really have the resources for. I don't think we're trying to make any claims that our tech stack is the "best" (x times faster than package y on dataset z), but rather that it's generally good enough (we can do analysis x on UK Biobank in time t, costing $y).

Comparative benchmarking is a huge timesink, which I'd really rather avoid.

eric-czech mentioned this issue Jan 7, 2021

Add small representative numpy vs breeze benchmarks related-sciences/gwas-analysis#4

Closed

jeromekelleher closed this as completed Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark numpy vs breeze vs BlockMatrix? #3

Benchmark numpy vs breeze vs BlockMatrix? #3

eric-czech commented Jan 7, 2021

ravwojdyla commented Jan 7, 2021

eric-czech commented Jan 7, 2021

hammer commented Jan 11, 2021

hammer commented Jan 11, 2021

ravwojdyla commented Jan 11, 2021

hammer commented Jan 11, 2021

jeromekelleher commented Mar 3, 2023

Benchmark numpy vs breeze vs BlockMatrix? #3

Benchmark numpy vs breeze vs BlockMatrix? #3

Comments

eric-czech commented Jan 7, 2021

ravwojdyla commented Jan 7, 2021

eric-czech commented Jan 7, 2021

hammer commented Jan 11, 2021

hammer commented Jan 11, 2021

ravwojdyla commented Jan 11, 2021

hammer commented Jan 11, 2021

jeromekelleher commented Mar 3, 2023