Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run TPC-H on Databricks #315

Open
jacobtomlinson opened this issue Jan 4, 2024 · 8 comments
Open

Run TPC-H on Databricks #315

jacobtomlinson opened this issue Jan 4, 2024 · 8 comments
Assignees

Comments

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Jan 4, 2024

Reproduce the TPC-H work that @rjzamora has been looking into with Dask+cuDF and the TPC-H work Coiled has been doing with Dask but on Databricks.

https://tpch.coiled.io/

@rjzamora
Copy link
Member

rjzamora commented Jan 4, 2024

cc @charlesbluca

@skirui-source skirui-source self-assigned this Feb 13, 2024
@skirui-source
Copy link
Contributor

skirui-source commented Feb 13, 2024

@jacobtomlinson jacobtomlinson changed the title Run TCP-H on Databricks Run TPC-H on Databricks Feb 14, 2024
@skirui-source
Copy link
Contributor

Initially I was running into this bug, which i resolved by downgrading from pytest v8.0.1, had to downgrade to v7.4.4

@skirui-source
Copy link
Contributor

skirui-source commented Apr 25, 2024

@jacobtomlinson just to clarify, is the goal here to:

  1. run TPC-H benchmarks with dask-expr + cudf in Databricks?
  2. run TPC-H benchmarks with dask + coiled in Databricks?

You also mentioned something about running benchmarks for cudf.pandas, is that part of this?

@skirui-source
Copy link
Contributor

@rjzamora with your recent work merged into rapidsai/cudf, what are the steps to recreate the test environment you used to produce the benchmark results?

@jacobtomlinson
Copy link
Member Author

The primary goal is dask-expr + cudf on Databricks. Secondary goal is cudf.pandas on Databricks.

@rjzamora
Copy link
Member

rjzamora commented Apr 25, 2024

Thanks for working on this @skirui-source !

what are the steps to recreate the test environment you used to produce the benchmark results?

I haven't updated my tpch-rapids branch of coiled/benchmarks to align with main, but the latest version of cudf-24.06 seems to work fine with that branch. You can also use cudf-24.04, but need to set the DASK_DATAFRAME__QUERY_PLANNING=True environment variable before running to use dask-expr under the hood (setting False like this will disable dask-expr for both 24.04 and 24.6, but it is turned on by default in 24.06).

To run locally, I started with a 24.04 or 24.06 rapids environment. Then did the following:

  • Cloned https://github.com/rjzamora/coiled-benchmarks/tree/tpch-rapids
  • Installed further conda dependencies: conda install -c canda-forge coiled s3fs altair alembic
  • Installed further pip dependencies: pip install pytest-timeout pytest-xdist
  • Change the LOCAL_SF*_PATH variables in benchmarks/tests/tpch/utils.py to point to your local data
    • For sf100, feel free to use my data on the ase machines: /datasets/rzamora/tpch-data/tables_scale_100

Then, from within benchmarks, you can execute a query:

export DASK_DATAFRAME__QUERY_PLANNING=True
pytest --benchmark  tests/tpch/test_dask.py::test_query_1 --local --rapids --scale 100

The timing result will be appended to a benchmark.db file. I've been using the following python code to convert that file to a pandas DataFrame (but, I'm sure there is a better way):

import pandas as pd
import sqlite3
con = sqlite3.connect("./benchmark.db")
df = pd.read_sql_query("SELECT * from test_run", con)
print(df[["name", "duration"]])
con.close()

Important notes on the data:

I believe I needed to jump through a few annoying hoops to both generate the data and modify the code to handle differences between my data and the s3 data used by Coiled. For example, my data currently uses a single parquet file for each table, while Coiled uses a directory of files (which definitely makes more sense beyond sf100). This is why my benchmarks branch has a _read_parquet "hack" in benchmarks/tests/tpch/dask_queries.py, where I add a ".parquet" suffix to the table path.

The primary goal is dask-expr + cudf on Databricks. Secondary goal is cudf.pandas on Databricks.

I would definitely focus on dask-expr + cudf for now. I believe pandas->arrow->pandas conversion is still a problem in cudf.pandas, and dask/dask-expr will do this a lot.

@skirui-source
Copy link
Contributor

skirui-source commented Apr 25, 2024

initial batch of results (tested on dgx14):

(rapids-24.06) skirui@dgx14:/datasets/skirui/GitRepos/coiled-benchmarks$ py.test --benchmark tests/tpch/test_dask.py --local --rapids --scale 10

In [3]: print(df[["name", "duration"]])
             name  duration
1    test_query_1  9.345021
2    test_query_2       NaN
3    test_query_3  6.615860
4    test_query_4  4.828701
5    test_query_5  7.316751
6    test_query_6  5.451433
7    test_query_7  7.176501
8    test_query_8  7.961834
9    test_query_9  6.981508
10  test_query_10  6.611681
11  test_query_11  3.811042
12  test_query_12  5.117650
13  test_query_13       NaN
14  test_query_14  5.732556
15  test_query_15  5.727635
16  test_query_16  4.815220
17  test_query_17  6.078448
18  test_query_18  7.685056
19  test_query_19  5.934201
20  test_query_20  5.236853
21  test_query_21  7.247067
22  test_query_22  4.751128

(rapids-24.06) skirui@dgx14:/datasets/skirui/GitRepos/coiled-benchmarks$ py.test --benchmark tests/tpch/test_dask.py --local --rapids --scale 100

In [3]: print(df[["name", "duration"]])
             name   duration
23   test_query_1  14.323813
24   test_query_2   7.302116
25   test_query_3        NaN
26   test_query_4  18.434659
27   test_query_5  19.057925
28   test_query_6   5.927966
29   test_query_7  16.077179
30   test_query_8  16.161171
31   test_query_9  27.749285
32  test_query_10  15.439956
33  test_query_11   5.020720
34  test_query_12   9.845365
35  test_query_13        NaN
36  test_query_14   6.785045
37  test_query_15  11.023882
38  test_query_16        NaN
39  test_query_17  21.482339
40  test_query_18  28.699488
41  test_query_19   7.627631
42  test_query_20        NaN
43  test_query_21  24.507490
44  test_query_22   5.938541

(rapids-24.06) skirui@dgx14:/datasets/skirui/GitRepos/coiled-benchmarks$ py.test --benchmark tests/tpch/test_dask.py --local --rapids --scale 1

             name     duration
45   test_query_1     7.262008
46   test_query_2     4.105303
47   test_query_3     4.405016
48   test_query_4          NaN
49   test_query_5     4.382684
50   test_query_6     3.901139
51   test_query_7          NaN
52   test_query_8     4.996450
53   test_query_9     4.708150
54  test_query_10          NaN
55  test_query_11     3.718096
56  test_query_12     3.900132
57  test_query_13          NaN
58  test_query_14     4.186984
59  test_query_15     4.032958
60  test_query_16          NaN
61  test_query_17     4.383530
62  test_query_18     4.013830
63  test_query_19  3597.592139
64  test_query_20     5.858231
65  test_query_21     5.473735
66  test_query_22     5.003113

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants