Run TPC-H on Databricks #315

jacobtomlinson · 2024-01-04T14:35:49Z

Reproduce the TPC-H work that @rjzamora has been looking into with Dask+cuDF and the TPC-H work Coiled has been doing with Dask but on Databricks.

https://tpch.coiled.io/

rjzamora · 2024-01-04T15:05:47Z

cc @charlesbluca

skirui-source · 2024-02-13T16:54:31Z

https://medium.com/coiled-hq/tpc-h-benchmarks-for-query-optimization-with-dask-expressions-c42921aafbd3

skirui-source · 2024-02-21T19:23:50Z

Initially I was running into this bug, which i resolved by downgrading from pytest v8.0.1, had to downgrade to v7.4.4

skirui-source · 2024-04-25T01:54:45Z

@jacobtomlinson just to clarify, is the goal here to:

run TPC-H benchmarks with dask-expr + cudf in Databricks?
run TPC-H benchmarks with dask + coiled in Databricks?

You also mentioned something about running benchmarks for cudf.pandas, is that part of this?

skirui-source · 2024-04-25T04:06:40Z

@rjzamora with your recent work merged into rapidsai/cudf, what are the steps to recreate the test environment you used to produce the benchmark results?

jacobtomlinson · 2024-04-25T08:56:42Z

The primary goal is dask-expr + cudf on Databricks. Secondary goal is cudf.pandas on Databricks.

rjzamora · 2024-04-25T14:06:55Z

Thanks for working on this @skirui-source !

what are the steps to recreate the test environment you used to produce the benchmark results?

I haven't updated my tpch-rapids branch of coiled/benchmarks to align with main, but the latest version of cudf-24.06 seems to work fine with that branch. You can also use cudf-24.04, but need to set the DASK_DATAFRAME__QUERY_PLANNING=True environment variable before running to use dask-expr under the hood (setting False like this will disable dask-expr for both 24.04 and 24.6, but it is turned on by default in 24.06).

To run locally, I started with a 24.04 or 24.06 rapids environment. Then did the following:

Cloned https://github.com/rjzamora/coiled-benchmarks/tree/tpch-rapids
Installed further conda dependencies: conda install -c canda-forge coiled s3fs altair alembic
Installed further pip dependencies: pip install pytest-timeout pytest-xdist
Change the LOCAL_SF*_PATH variables in benchmarks/tests/tpch/utils.py to point to your local data
- For sf100, feel free to use my data on the ase machines: /datasets/rzamora/tpch-data/tables_scale_100

Then, from within benchmarks, you can execute a query:

export DASK_DATAFRAME__QUERY_PLANNING=True
pytest --benchmark  tests/tpch/test_dask.py::test_query_1 --local --rapids --scale 100

The timing result will be appended to a benchmark.db file. I've been using the following python code to convert that file to a pandas DataFrame (but, I'm sure there is a better way):

import pandas as pd
import sqlite3
con = sqlite3.connect("./benchmark.db")
df = pd.read_sql_query("SELECT * from test_run", con)
print(df[["name", "duration"]])
con.close()

Important notes on the data:

I believe I needed to jump through a few annoying hoops to both generate the data and modify the code to handle differences between my data and the s3 data used by Coiled. For example, my data currently uses a single parquet file for each table, while Coiled uses a directory of files (which definitely makes more sense beyond sf100). This is why my benchmarks branch has a _read_parquet "hack" in benchmarks/tests/tpch/dask_queries.py, where I add a ".parquet" suffix to the table path.

The primary goal is dask-expr + cudf on Databricks. Secondary goal is cudf.pandas on Databricks.

I would definitely focus on dask-expr + cudf for now. I believe pandas->arrow->pandas conversion is still a problem in cudf.pandas, and dask/dask-expr will do this a lot.

skirui-source · 2024-04-25T23:04:15Z

initial batch of results (tested on dgx14):

(rapids-24.06) skirui@dgx14:/datasets/skirui/GitRepos/coiled-benchmarks$ py.test --benchmark tests/tpch/test_dask.py --local --rapids --scale 10

In [3]: print(df[["name", "duration"]])
             name  duration
1    test_query_1  9.345021
2    test_query_2       NaN
3    test_query_3  6.615860
4    test_query_4  4.828701
5    test_query_5  7.316751
6    test_query_6  5.451433
7    test_query_7  7.176501
8    test_query_8  7.961834
9    test_query_9  6.981508
10  test_query_10  6.611681
11  test_query_11  3.811042
12  test_query_12  5.117650
13  test_query_13       NaN
14  test_query_14  5.732556
15  test_query_15  5.727635
16  test_query_16  4.815220
17  test_query_17  6.078448
18  test_query_18  7.685056
19  test_query_19  5.934201
20  test_query_20  5.236853
21  test_query_21  7.247067
22  test_query_22  4.751128

(rapids-24.06) skirui@dgx14:/datasets/skirui/GitRepos/coiled-benchmarks$ py.test --benchmark tests/tpch/test_dask.py --local --rapids --scale 100

In [3]: print(df[["name", "duration"]])
             name   duration
23   test_query_1  14.323813
24   test_query_2   7.302116
25   test_query_3        NaN
26   test_query_4  18.434659
27   test_query_5  19.057925
28   test_query_6   5.927966
29   test_query_7  16.077179
30   test_query_8  16.161171
31   test_query_9  27.749285
32  test_query_10  15.439956
33  test_query_11   5.020720
34  test_query_12   9.845365
35  test_query_13        NaN
36  test_query_14   6.785045
37  test_query_15  11.023882
38  test_query_16        NaN
39  test_query_17  21.482339
40  test_query_18  28.699488
41  test_query_19   7.627631
42  test_query_20        NaN
43  test_query_21  24.507490
44  test_query_22   5.938541

(rapids-24.06) skirui@dgx14:/datasets/skirui/GitRepos/coiled-benchmarks$ py.test --benchmark tests/tpch/test_dask.py --local --rapids --scale 1

             name     duration
45   test_query_1     7.262008
46   test_query_2     4.105303
47   test_query_3     4.405016
48   test_query_4          NaN
49   test_query_5     4.382684
50   test_query_6     3.901139
51   test_query_7          NaN
52   test_query_8     4.996450
53   test_query_9     4.708150
54  test_query_10          NaN
55  test_query_11     3.718096
56  test_query_12     3.900132
57  test_query_13          NaN
58  test_query_14     4.186984
59  test_query_15     4.032958
60  test_query_16          NaN
61  test_query_17     4.383530
62  test_query_18     4.013830
63  test_query_19  3597.592139
64  test_query_20     5.858231
65  test_query_21     5.473735
66  test_query_22     5.003113

skirui-source self-assigned this Feb 13, 2024

jacobtomlinson changed the title ~~Run TCP-H on Databricks~~ Run TPC-H on Databricks Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run TPC-H on Databricks #315

Run TPC-H on Databricks #315

jacobtomlinson commented Jan 4, 2024 •

edited

Loading

rjzamora commented Jan 4, 2024

skirui-source commented Feb 13, 2024 •

edited

Loading

skirui-source commented Feb 21, 2024

skirui-source commented Apr 25, 2024 •

edited

Loading

skirui-source commented Apr 25, 2024

jacobtomlinson commented Apr 25, 2024

rjzamora commented Apr 25, 2024 •

edited

Loading

skirui-source commented Apr 25, 2024 •

edited

Loading

Run TPC-H on Databricks #315

Run TPC-H on Databricks #315

Comments

jacobtomlinson commented Jan 4, 2024 • edited Loading

rjzamora commented Jan 4, 2024

skirui-source commented Feb 13, 2024 • edited Loading

skirui-source commented Feb 21, 2024

skirui-source commented Apr 25, 2024 • edited Loading

skirui-source commented Apr 25, 2024

jacobtomlinson commented Apr 25, 2024

rjzamora commented Apr 25, 2024 • edited Loading

skirui-source commented Apr 25, 2024 • edited Loading

jacobtomlinson commented Jan 4, 2024 •

edited

Loading

skirui-source commented Feb 13, 2024 •

edited

Loading

skirui-source commented Apr 25, 2024 •

edited

Loading

rjzamora commented Apr 25, 2024 •

edited

Loading

skirui-source commented Apr 25, 2024 •

edited

Loading