Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark numpy vs breeze vs BlockMatrix? #3

Closed
eric-czech opened this issue Jan 7, 2021 · 7 comments
Closed

Benchmark numpy vs breeze vs BlockMatrix? #3

eric-czech opened this issue Jan 7, 2021 · 7 comments

Comments

@eric-czech
Copy link
Collaborator

Moving this discussion here from related-sciences/gwas-analysis#4 (comment).

I have some doubts that claim v in the current outline is one we can substantiate well:

Xarray, Dask and Zarr constitute a cloud-native, distributed, array-centric data processing framework that has inherent advantages over MapReduce and MPI for genetics

Specifically, I think any empirical comparison of Dask and something like Spark attempting to justify this would be impossible to disentangle from potential technical differences between numpy and breeze. @ravwojdyla recently suggested comparing these or looking again for benchmarks. I think this, if we could find or produce it, would likely be more productive than arguing the tradeoffs of higher level design choices made in Spark and Dask. I say this because they are both attempting to shovel around array blocks to be processed by breeze/numpy (for the heavy lifting), Spark supports 2D chunking, and the need for ND (N>2) chunking support is not one we've made well with sgkit since none of the methods actually rely on chunking beyond the second dimension, afaik.

@ravwojdyla
Copy link
Collaborator

@eric-czech on the comparison point, another thing that came to my mind, that is maybe a promising data point, is the PC Relate implementation in sgkit, it has the same "algorithm" as Hail (uses the same linear algebra operations), but the performance is better (over an order of magnitude better for 1kGP dataset contains 629 samples and ~5.8e6 SNPs, see https://discourse.pystatgen.org/t/pc-relate-experiment/51). So it might be that PC Relate implementation in Hail is particularly bad, or maybe the data is too small to see the issues we experience with GWAS?

@eric-czech
Copy link
Collaborator Author

So it might be that PC Relate implementation in Hail is particularly bad, or maybe the data is too small to see the issues we experience with GWAS?

Oh yea good point. I don't think I'd be surprised if Dask was much better in local mode but much worse in a distributed setting given that my impression is that quite a lot of people use local Dask (for out-of-core only) while a larger fraction of Spark users run code on a cluster. Maybe it's worth simply seeing how much that PC Relate example degrades when run on a cluster vs on a single node with the same overall resources (as compared to Hail/Spark)?

@hammer
Copy link

hammer commented Jan 11, 2021

🤔 I'm not sure we want to get into the single-node linear algebra benchmarking space, it's quite sophisticated and hardware-specific. It's also sensitive to row-major (Numpy) vs. column-major (Breeze), benchmarking anything on the JVM is tricky, and we'd have to contend with the zoo of Python accelerators as well.

Breeze delegates to netlib-java which delegates to BLAS/LAPACK/ARPACK. Breeze is in maintenance mode and netlib-java is not even maintained. The poor support for libraries at the base of the Scala numerics stack was a big reason we decided to move to Python at Related Sciences.

If anything let's just try to refer to an external benchmark. I think it will be more important to give representative performance numbers without trying to claim them to be perfectly designed benchmarks, then to give a sense of how those performance numbers behave in various settings like scale-out mentioned by Eric as well as scaling up on a single node with more cores or on the GPU.

@hammer
Copy link

hammer commented Jan 11, 2021

Also with respect to Eric's comment about the future of Breeze in Spark: I don't think Databricks cares about Scala linear algebra since their hosted proprietary runtime does not use the JVM and most of their users write Python. The last comment I could find after a brief JIRA search was from 2017:

There's been lot of discussion around the issue of Spark providing a linear algebra lib and the consensus is generally that it's a huge amount of overhead for Spark to maintain a full-blown linear algebra lib.
https://issues.apache.org/jira/browse/SPARK-6442 and https://issues.apache.org/jira/browse/SPARK-16365

@ravwojdyla
Copy link
Collaborator

I'm not sure we want to get into the single-node linear algebra benchmarking space, it's quite sophisticated and hardware-specific. It's also sensitive to row-major (Numpy) vs. column-major (Breeze), benchmarking anything on the JVM is tricky, and we'd have to contend with the zoo of Python accelerators as well.

@hammer this came up during the last sgkit meeting, we all were not aware of a good external performance benchmark between numpy and breeze, to use as an expected proxy metric for the performance of sgkit vs Hail. I believe the goal would be not to provide a definitive benchmark, but rather a ballpark number for numpy/dask.array vs breeze/hail.BlockMatrix. The expectation is that Hail's BlockMatrix backed by Breeze should be slower. Do you think that's still too much?

@hammer
Copy link

hammer commented Jan 11, 2021

I guess I don't mind if the benchmark is not much more than what Eric did in the original issue: related-sciences/gwas-analysis#4 (comment). I'm just nervous about the thinking and resources required to say anything definitive in the performance space, and I want to be sure we represent our numbers as crude.

@jeromekelleher
Copy link
Collaborator

Revisiting this, it feels like a lot of work to get into that we don't really have the resources for. I don't think we're trying to make any claims that our tech stack is the "best" (x times faster than package y on dataset z), but rather that it's generally good enough (we can do analysis x on UK Biobank in time t, costing $y).

Comparative benchmarking is a huge timesink, which I'd really rather avoid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants