Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPC-H benchmarks #33

Merged
merged 19 commits into from
Sep 20, 2021
Merged

TPC-H benchmarks #33

merged 19 commits into from
Sep 20, 2021

Conversation

jonkeane
Copy link
Contributor

@jonkeane jonkeane commented Sep 9, 2021

  • Add answer correctness validation
  • Add an option to use IPC files as base
  • Figure out why codecov is unhappy

Possible scope creeps (that probably aren't worth doing right now, if we're going to separate out and have a separate library for managing sources)

  • refactor known_sources to use the generator pattern I used with TPC-H where the download step would live in a function and each known_source would have that function available

@jonkeane jonkeane changed the title TPC-H benchmarks [WIP] TPC-H benchmarks Sep 14, 2021
.github/workflows/R-CMD-check.yaml Outdated Show resolved Hide resolved
.github/workflows/test-coverage.yaml Show resolved Hide resolved
R/bm-tpc-h.R Outdated
# all queries take an input_func which is a function that will return a dplyr tbl
# referencing the table needed.
#' @export
tpc_h_queries <- list(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also do this as:

tpc_h_queries <- list()
tpc_h_queries[[1]] <- function(input_func) ...
tpc_h_queries[[6]] <- function(input_func) ...

Could be more intuitive than stringified number names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes of course. This is much better

R/bm-tpc-h.R Outdated
if (format == "parquet") {
input_functions[["arrow"]] <- function(name) {
file <- tpch_files[[name]]
return(arrow::read_parquet(file, as_data_frame = FALSE))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not open_dataset()? I wonder if that has different/better properties, even if it is a single file. (Hey we should benchmark that!) It at least shouldn't be memory constrained.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that as an option and see if there are better properties. You're right we probably should always do open_dataset() here even if these are typically one-file datasets.

} else if (format == "native") {
# native is different: read the feather file in first, and pass the table
tab <- list()
for (name in names(tpch_files)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have to read in all tpch_files now? Do these queries share the same dataset, or do we get 22 datasets for 22 queries?

Since each process is only running a single query (right?) it seems like you should be able to only read in the ones that the current query requires

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what this is doing, yeah. Each query uses the same dataset (well dataset of multiple tables). This query only needs one table (lineitem), so we really only need to read that in.

I can make a query to table map so that we only read in the tables that are necessary for each query.

Ultimately, I don't think this will have all that much of an impact, lineitem is the largest table by far, so reading the others in shouldn't have a huge impact even if they aren't used.

Additionally, right now what happens is that each of the files are memory mapped by this, but aren't actually paged into memory until the first run (this is why for arrow the first run in this condition is so much longer than the others). I'm planning to turn memory mapping off so that we really are starting with a truly in-memory table (though we could parameterize that if we wanted to!)

@jonkeane
Copy link
Contributor Author

Well, it turns out open_dataset("single_file.parquet") is significantly faster than reading the file in (even as an Arrow table) and operating on it. I'm a little bit surprised (especially for the feather files!), but I wonder if the dataset scans of the files are more optimized than a query against a table that is backed by a file?

I'm also pretty surprised that open_dataset(feather_file) is faster than the query engine against the table already resident in memory (the "native" format).

Any thoughts about an explanation for these oddities @bkietz?

With datasets, our performance against parquet files is in line with DuckDB's, though our native query processing is considerably longer.

Turning off memory mapping has the impacts I expected:

  • For the native workflow, the first iteration is no longer considerably longer (since the file is already resident in memory before timing starts)
  • For the feather and parquet there's mixed results — here reading the file in happens during the benchmark timing.

TPC-H.html.zip

We have a few questions that we need to anser:

  • should we keep the read_parquet|feather() along side open_dataset() or just use the dataset processing? I can leave both in and only use one as a default (though if we were to do that, I would change the way they are specified).

  • what our defaults should be (and what should be run on every commit in conbench — which should be the same thing though they could technically be different). I propose:

    • scale: 1, 10, (possibly 100 if we can get the ursa machines to generate that much data)
    • engine: arrow
    • format: native, parquet, feather (with the two files being driven by datasets)
    • mem_map: false (only applies to native, this will make the first iteration more in line with the others)

@jonkeane
Copy link
Contributor Author

@nealrichardson I'm planning to merge this today, unless you have any other things you would like to change or more time to look at it.

@jonkeane jonkeane merged commit 8153f6f into main Sep 20, 2021
@jonkeane jonkeane deleted the first-tpc-h branch September 20, 2021 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants