-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up apache-spark benchmarks #248
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
The JMH runs of the The restriction on keeping RDDs in memory is certainly up for discussion and can be easily reverted if we don't like it, I just thought that if we can, we should try to avoid disk IO in the benchmarks and instead size heap properly. As for the brittleness of the JMH in the CI runs -- we run the JMH wrapper with |
It looks like different use cases to me. I really like the fact that we minimize disk IO for benchmarking if we have control over it. But for testing purposes, we may not want to be too memory greedy and are ok to pay a price there since performance is not the point. This typically applies for the CI use case. So my feeling is that the right solution is to expose in-memory as a parameter, which would default to true for all benchmarking configurations and to false for the |
You are right, that's something we should do, regardless of the reason for JMH failures. I'm trying to find out what is the root cause because I didn't really change On OracleJDK and OpenJDK, JMH is probably telling the JVM something that results in stack overflow. The biggest change is the parsing of CSV-like parameter string with ALS configurations that produce distinct RMSE values to produce verifiable results, but the stack overflow happens after that. |
Creating RDDs from Datasets appears to be quite consuming, so we should just read things ourselves while we are using the RDD-based API.
SparkSession appears to be a newer API and it also provides data readers (even though we don't use them for RDDs).
- Use the sparkContext inherited from SparkUtil so that we don't have to handle it explicitly. - Use the scratch directory to prepare input files and clean up the parsing of CSV files. - Use I/O API from JDK instead of Apache Commons. - Configure ALS to store intermediate RDDs in memory only and to use a fixed random seed. There are further changes coming, but these are important to finish the scratch file handling.
- Adds explicit ALS-related parameters - Configures ALS to only use memory for intermediate and final RDDs, and sets a fixed random seed. - Read CSV using the facilities from SparkUtil
- Removes the individual ALS algorithm parameters (and their cartesian-product interpretation) and instead adds a CSV-like table of ALS configurations together with expected RMSE. - Choose ALS parameters that lead to different RMSE values to make it easier to validate the results in presence of numerical instability.
Instead, let the `ensureCached()` method do it, thus relying on a policy defined in a single place.
This might be needed by some benchmarks even though we generally want to avoid that, because checkpointing leads to IO.
Without the checkpoint directory set, JMH runs of the benchmark in Travis CI tend to crash with stack overflow.
0b3b276
to
3004bce
Compare
Oddly enough, it appears that not configuring checkpoint directory was the reason for
Apparently, the problem in JMH runs disappeared when I put back checkpoint directory into Anyway, it seems we might be ready to merge this, so that we can move on. |
This was originally intended as a simple follow-up to #246 (making actual use of the scratch directory), but I ended up reorganizing the code around a common pattern so that the benchmarks all look alike. The following benchmarks are being updated:
als
chi-square
dec-tree
gauss-mix
log-regression
movie-lens
naive-bayes
page-rank
Common changes
Code organization
prepareInput()
,loadData()
andensureCached()
methods return values that need to be passed around.page-rank
, which already has validation) contains adumpResult()
method.Benchmark parameters
spark_executor_count
parameter, which determines the number of Spark executor instances. It currently defaults to4
, but making the parameter explicit will (eventually) allow setting it from outside.thread_count
tospark_executor_thread_count
(to make it clear that it represents the number of threads per Spark executor) and make it available in all Spark benchmarks.copy_count
,input_line_count
.Spark context handling
setUpSparkContext()
methodBenchmarkContext
instance to the method so that it can query thespark_executor_count
and thespark_executor_thread_count
parameters and the benchmark's scratch directory on its own.@Name
annotation to find out the benchmark's name, instead of having to duplicate it in code.sparkContext
in theSparkUtil
trait and have the benchmarks inherit the field.SparkSession
and avoid creatingSparkConf
andSparkContext
directly.Spark RDD caching
ensureCached()
method now sets theRDD
orDataset
storage level toMEMORY_ONLY
using thepersist()
method, because thecache()
method switched to usingMEMORY_AND_DISK
in newer Spark versions.Spark logging
WARN
level.INFO
level and switching toERROR
level onSparkContext
.movie-lens
benchmark may report a bit more because it used to report at theERROR
level only.Scratch file handling
target/<bench-name>
directory) for preparing the input datasets as well as for dumping the computation output after the last iteration.FileUtils
andIO
) for such simple things.Benchmark-specific changes
In addition to the common changes above, these are changes made in the individual benchmarks.
als
rating_count
touser_count
because that's what it actually is.product_count
to theals
benchmark, because it also influenced workload size.als_rank
,als_lambda
, andals_iterations
parameters of the ALS algorithmchi-square
number_count
parameter topoint_count
, because that's what it actually is.component_count
parameter (which determines the number of components in a point) to expose the other factor influencing the workload size.dec-tree
gauss-mix
number_count
parameter topoint_count
, because that's what it actually is.component_count
parameter (which determines the number of components in a point) to expose the other factor influencing the workload size.distribution_count
parameter, which determines the desired number of clusters.GaussianMixture
estimator.log-regression
max_iterations
parameter which controls the maximum number of the logistic regression algorithm iterations.max_iterations
in thegauss-mix
benchmark.movie-lens
naive-bayes
copy_count
parameter to allow controlling the workload size, replacing the hard-coded constant.8000
results in writing an 800MB input file on benchmark setup.test
configuration withcopy_count
set to 5.jmh
configuration (so far without parameter tweaks) so that it at least exists.page-rank
max_iterations
parameter which controls the number of iterations of the page rank algorithm.max_iterations
parameter in thegauss-mix
andlog-regression
benchmarks, but it does control the number of algorithm iterations.Things that came up (Spark-related)
movie-lens
, but why there and not in the others? Maybe remove it from there?movie-lens
benchmark (which is not much different fromals
) kept failing the JMH runs in CI. Super strange.ERROR
and the Eclipse jetty server root logger level toOFF
ERROR
.WARN
for a while and determine (and document) what kinds of warnings we are fine with.MLlib
API, which is in maintenance mode since Spark 2.0, with the DataFrame-basedML
API being considered to be the recommended one.