TPC-H benchmarks #33

jonkeane · 2021-09-09T18:35:24Z

Add answer correctness validation
Add an option to use IPC files as base
Figure out why codecov is unhappy

Possible scope creeps (that probably aren't worth doing right now, if we're going to separate out and have a separate library for managing sources)

refactor known_sources to use the generator pattern I used with TPC-H where the download step would live in a function and each known_source would have that function available

…h iteration

…r than 1

.github/workflows/R-CMD-check.yaml

.github/workflows/test-coverage.yaml

nealrichardson · 2021-09-14T20:02:04Z

R/bm-tpc-h.R

+# all queries take an input_func which is a function that will return a dplyr tbl
+# referencing the table needed.
+#' @export
+tpc_h_queries <- list(


You could also do this as:

tpc_h_queries <- list() tpc_h_queries[[1]] <- function(input_func) ... tpc_h_queries[[6]] <- function(input_func) ...

Could be more intuitive than stringified number names

Ah, yes of course. This is much better

nealrichardson · 2021-09-14T20:04:39Z

R/bm-tpc-h.R

+      if (format == "parquet") {
+        input_functions[["arrow"]] <- function(name) {
+          file <- tpch_files[[name]]
+          return(arrow::read_parquet(file, as_data_frame = FALSE))


Why not open_dataset()? I wonder if that has different/better properties, even if it is a single file. (Hey we should benchmark that!) It at least shouldn't be memory constrained.

I'll add that as an option and see if there are better properties. You're right we probably should always do open_dataset() here even if these are typically one-file datasets.

nealrichardson · 2021-09-14T20:07:04Z

R/bm-tpc-h.R

+      } else if (format == "native") {
+        # native is different: read the feather file in first, and pass the table
+        tab <- list()
+        for (name in names(tpch_files)) {


So we have to read in all tpch_files now? Do these queries share the same dataset, or do we get 22 datasets for 22 queries?

Since each process is only running a single query (right?) it seems like you should be able to only read in the ones that the current query requires

That is what this is doing, yeah. Each query uses the same dataset (well dataset of multiple tables). This query only needs one table (lineitem), so we really only need to read that in.

I can make a query to table map so that we only read in the tables that are necessary for each query.

Ultimately, I don't think this will have all that much of an impact, lineitem is the largest table by far, so reading the others in shouldn't have a huge impact even if they aren't used.

Additionally, right now what happens is that each of the files are memory mapped by this, but aren't actually paged into memory until the first run (this is why for arrow the first run in this condition is so much longer than the others). I'm planning to turn memory mapping off so that we really are starting with a truly in-memory table (though we could parameterize that if we wanted to!)

jonkeane · 2021-09-15T16:06:48Z

Well, it turns out open_dataset("single_file.parquet") is significantly faster than reading the file in (even as an Arrow table) and operating on it. I'm a little bit surprised (especially for the feather files!), but I wonder if the dataset scans of the files are more optimized than a query against a table that is backed by a file?

I'm also pretty surprised that open_dataset(feather_file) is faster than the query engine against the table already resident in memory (the "native" format).

Any thoughts about an explanation for these oddities @bkietz?

With datasets, our performance against parquet files is in line with DuckDB's, though our native query processing is considerably longer.

Turning off memory mapping has the impacts I expected:

For the native workflow, the first iteration is no longer considerably longer (since the file is already resident in memory before timing starts)
For the feather and parquet there's mixed results — here reading the file in happens during the benchmark timing.

TPC-H.html.zip

We have a few questions that we need to anser:

should we keep the read_parquet|feather() along side open_dataset() or just use the dataset processing? I can leave both in and only use one as a default (though if we were to do that, I would change the way they are specified).
what our defaults should be (and what should be run on every commit in conbench — which should be the same thing though they could technically be different). I propose:
- scale: 1, 10, (possibly 100 if we can get the ursa machines to generate that much data)
- engine: arrow
- format: native, parquet, feather (with the two files being driven by datasets)
- mem_map: false (only applies to native, this will make the first iteration more in line with the others)

…ect functions.

jonkeane · 2021-09-20T18:19:36Z

@nealrichardson I'm planning to merge this today, unless you have any other things you would like to change or more time to look at it.

jonkeane added 15 commits September 9, 2021 12:02

First step to TPC-H in arrowbench

f31b8f3

Move the view creation to the setup (So we don't have to do it on eac…

708e289

…h iteration

Actually ensure that we have our custom duckdb build.

c523723

Don't force suggests on 3.5

ca8d6b4

Bump ubuntu versions, add 3.6

30277c4

Convert all decimals to floats for now

4ce49a1

Try a more updated r-lib/actions workflow

a3fcb36

add tmate

44b5596

Skip tpch gen on coverage

c1591fb

Get the syntax for skip_on_covr() correct

30da966

Change UX for specifying tpch

3bc3625

use duckdb/duckdb

b126b72

Add answer validation

a79867e

Add options for different formats

7952772

docs, refinement, add looser checks for scale factors that are greate…

63219bd

…r than 1

jonkeane changed the title ~~TPC-H benchmarks [WIP]~~ TPC-H benchmarks Sep 14, 2021

nealrichardson reviewed Sep 14, 2021

View reviewed changes

PR comments

8297dd8

jonkeane added 3 commits September 16, 2021 15:07

add parquet-dataset

7dad63d

Set the defaults

0a63e5a

Use a slightly different method for ensuring that duckdb has the corr…

1ef0e60

…ect functions.

jonkeane merged commit 8153f6f into main Sep 20, 2021

jonkeane deleted the first-tpc-h branch September 20, 2021 20:59

jonkeane mentioned this pull request Sep 20, 2021

Add TPC-H benchmarks voltrondata-labs/benchmarks#80

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPC-H benchmarks #33

TPC-H benchmarks #33

jonkeane commented Sep 9, 2021 •

edited

Loading

nealrichardson Sep 14, 2021

jonkeane Sep 14, 2021

nealrichardson Sep 14, 2021

jonkeane Sep 14, 2021

nealrichardson Sep 14, 2021

jonkeane Sep 14, 2021

jonkeane commented Sep 15, 2021

jonkeane commented Sep 20, 2021

TPC-H benchmarks #33

TPC-H benchmarks #33

Conversation

jonkeane commented Sep 9, 2021 • edited Loading

nealrichardson Sep 14, 2021

Choose a reason for hiding this comment

jonkeane Sep 14, 2021

Choose a reason for hiding this comment

nealrichardson Sep 14, 2021

Choose a reason for hiding this comment

jonkeane Sep 14, 2021

Choose a reason for hiding this comment

nealrichardson Sep 14, 2021

Choose a reason for hiding this comment

jonkeane Sep 14, 2021

Choose a reason for hiding this comment

jonkeane commented Sep 15, 2021

jonkeane commented Sep 20, 2021

jonkeane commented Sep 9, 2021 •

edited

Loading