[data] Add databricks table / SQL reader #39852

WeichenXu123 · 2023-09-26T10:13:27Z

Why are these changes needed?

Databricks users request a ray.data reader that can easily load data from databricks UC tables.
This reader is implemented based on Databricks statemenet execution API, the overall process is:

read_databricks_tables create request /api/2.0/sql/statements to databricks shard, request payload contains authentication token and query information.
get response that contains the query result chunks information. The query result is split into multiple chunks, each chunk data size is around 10 ~ 20 MB.
Generate a couple of Ray Read tasks, read tasks number is determined by parallelism argument. Each read task will fetch one or more chunks data. Feching chunks data uses this API.
Ray read tasks are dispatched to remote Ray workers and executed in distributed way.

API:

    """
    Read from a Databricks UC table or Databricks SQL execution result that queries
    from Databricks UC tables.
    If it is not called in databricks runtime, you need to set environment varaibles
    'DATABRICKS_HOST' and 'DATABRICKS_TOKEN' firstly.

    This reader is implemented based on
    [Databricks statemenet execution API](https://docs.databricks.com/api/workspace/statementexecution).

    Examples:
        import ray
        ds = ray.data.read_databricks_tables(
            warehouse_id='a885ad08b64951ad',
            catalog='catalog_1',
            schema='db_1',
            query='select id from table_1 limit 750000',
        )

    Args:
        warehouse_id: The id of the databricks warehouse, the query statement is
            executed on this warehouse.
        table: The name of UC table you want to read. If this argument is set,
            you can't set 'query' argument, and the reader generates query
            of 'select * from {table_name}' under the hood.
        query: The query you want to execute. If this argument is set,
            you can't set 'table_name' argument.
        catalog: (Optional) The default catalog name used by the query
        schema: (Optional) The default schema used by the query
        parallelism: The requested parallelism of the read. Defaults to -1,
            which automatically determines the optimal parallelism for your
            configuration. You should not need to manually set this value in most cases.
            For details on how the parallelism is automatically determined and guidance
            on how to tune it, see :ref:`Tuning read parallelism
            <read_parallelism>`.
        ray_remote_args: kwargs passed to :meth:`~ray.remote` in the read tasks.

    Returns:
        A :class:`Dataset` containing the queried data.
    """

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

python/ray/data/datasource/databricks_uc_datasource.py

python/ray/data/read_api.py

dbczumar · 2023-09-26T15:34:20Z

python/ray/data/read_api.py

+    catalog: str,
+    schema: str,
+    table_name: Optional[str] = None,


Can we create a single argument called table? If users have already set the current catalog / schema in their notebook environment via USE CATALOG / USE SCHEMA, can we infer the catalog and schema? If it's too tricky, not a big deal for now

we can, but I prefer to make catalog and schema arguments to be optional, in databricks shard, catalog and schema already has default values, but with the optional arguments, user can easily override them.

python/ray/data/read_api.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

python/ray/data/datasource/databricks_uc_datasource.py

harupy · 2023-09-27T12:24:38Z

python/ray/data/datasource/databricks_uc_datasource.py

+            while state in ["PENDING", "RUNNING"]:
+                time.sleep(1)
+                response = requests.get(
+                    urljoin(url_base, statement_id) + "/",
+                    auth=req_auth,
+                    headers=req_headers,
+                )
+                response.raise_for_status()
+                state = response.json()["status"]["state"]


no timeout?

I see, we don't know how long query takes.

We allow user to press Ctrl + C to cancel it, this should fulfil most cases

How long query takes is hard to estimate, it depends on sql workloads, query queues, warehouse resources, etc.

python/ray/data/read_api.py

python/ray/data/datasource/databricks_uc_datasource.py

harupy · 2023-09-27T12:32:13Z

python/ray/data/datasource/databricks_uc_datasource.py

+                    raw_response = requests.get(external_url, auth=None, headers=None)
+                    raw_response.raise_for_status()
+
+                    arrow_table = pyarrow.ipc.open_stream(raw_response.content).read_all()


Is raw_response.content large?

usually 10 ~ 20MB size per chunk

python/ray/data/datasource/databricks_uc_datasource.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2023-10-02T08:22:11Z

Gentle ping @c21

bveeramani · 2023-10-02T16:49:41Z

Hey @WeichenXu123, thanks for contributing! I'll take a look at this PR sometime this week

bveeramani

Implementation overall LGTM.

Two concerns:

How do we handle retries, especially for rate limiting errors?
IIRC Databricks marks a statement as closed when the last statement has been read. Is this still an issue, and if so, do how do we workaround it?

python/ray/data/read_api.py

bveeramani · 2023-10-02T22:13:57Z

python/ray/data/read_api.py

+        else:
+            raise ValueError(
+                "You are not in databricks runtime, please set environment variable "
+                "'DATABRICKS_HOST' and 'DATABRICKS_TOKEN'."


Is there any relevant documentation we could link to? Specifically, how can would a user find DATABRICKS_HOST and DATABRICKS_TOKEN?

It is already in this API doc.

As in the docstring? I think we just link to https://docs.databricks.com/api/workspace/statementexecution, which I don't think covers how to find the environment variables?

Also, to make the error message more actionable, I think it'd be great if we could say something like:

"Please set environment variable ... To get these values, see [relevant Databricks page]"

Oh I got it, these 2 environment variable keys are defined in read_databricks_tables code, so we don't have databricks doc for them.

I refined read_databricks_tables API doc to add description for the 2 environment variables:

'DATABRICKS_HOST' means databricks workspace URL, like adb-<workspace-id>.<random-number>.azuredatabricks.net,

'DATABRICKS_TOKEN' means databricks workspace access token, user knows how to get it if it saw the doc description.

python/ray/data/datasource/databricks_uc_datasource.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2023-10-03T16:45:46Z

@bveeramani

Your comments are all adressed or answered.

How do we handle retries, especially for rate limiting errors?

Without considering rate limiting, I think we don't need to add retry code, because Ray task should automatically retry when failed if I understand ray correctly.

If we considered rate limit, I need to ask our warehouse engineer and then I will get back to you.

How do we handle retries, especially for rate limiting errors?
IIRC Databricks marks a statement as closed when the last statement has been read. Is this still an issue, and if so, do how do we workaround it?

I replied it in #39852 (comment)

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2023-10-04T08:04:13Z

@bveeramani

I got replied from our warehouse engineer about rate limit:

For concurrent request (concurrent fetching chunk data) limiters of EXTERNAL_LINKS type (the type used in my PR), it's around 750 requests per second per workspace.

So I think we rarely hit the rate limit , because one chunk data is at least 10MB size and each request takes some time to complete, if hit rate limit, we can just let ray task retry.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

python/ray/data/read_api.py

bveeramani

Awesome! Overall LGTM!

Once we've resolved the remaining review comments, this should be good to merge.

For concurrent request (concurrent fetching chunk data) limiters of EXTERNAL_LINKS type (the type used in my PR), it's around 750 requests per second per workspace.

Got it. Think this should be okay in most cases, but if you have a cluster with more than 750 cores I can see this being an issue. IIRC we ran into some rate limiting errors when we performed internal testing.

In any case, I think we can defer this for now and address this as a follow up if people run into issues.

if hit rate limit, we can just let ray task retry.

Ray Tasks don't retry application-level exceptions by default. So, if you get a rate limiting error, I don't think it'd get retried.

python/ray/data/datasource/databricks_uc_datasource.py

python/ray/data/read_api.py

bveeramani · 2023-10-05T22:13:07Z

python/ray/data/read_api.py

+        else:
+            raise ValueError(
+                "You are not in databricks runtime, please set environment variable "
+                "'DATABRICKS_HOST' and 'DATABRICKS_TOKEN'."


As in the docstring? I think we just link to https://docs.databricks.com/api/workspace/statementexecution, which I don't think covers how to find the environment variables?

Also, to make the error message more actionable, I think it'd be great if we could say something like:

"Please set environment variable ... To get these values, see [relevant Databricks page]"

python/ray/data/read_api.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2023-10-06T02:53:07Z

if you have a cluster with more than 750 cores I can see this being an issue. IIRC we ran into some rate limiting errors when we performed internal testing.

Ray Tasks don't retry application-level exceptions by default. So, if you get a rate limiting error, I don't think it'd get retried.

We can address them in follow-up PRs, since this case rarely happens

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

bveeramani · 2023-10-06T17:23:16Z

We can address them in follow-up PRs, since this case rarely happens

Sounds good. Thanks for your work on this PR!

bveeramani · 2023-10-06T19:16:21Z

@WeichenXu123 When you're back from vacation, could add tests? My bad -- I should've caught this in an earlier review and before merging

WeichenXu123 · 2023-10-07T00:34:13Z

@WeichenXu123 When you're back from vacation, could add tests? My bad -- I should've caught this in an earlier review and before merging

The e2e test requires databricks environment, we will add e2e test in our databricks code repo to monitor the Ray reader works.

For unit test in Ray repo, we can only add mocking test, we can add it if you need it.

Databricks users request a ray.data reader that can easily load data from databricks UC tables. This reader is implemented based on Databricks statemenet execution API, the overall process is: 1. read_databricks_tables create request /api/2.0/sql/statements to databricks shard, request payload contains authentication token and query information. 2. get response that contains the query result chunks information. The query result is split into multiple chunks, each chunk data size is around 10 ~ 20 MB. 3. Generate a couple of Ray Read tasks, read tasks number is determined by parallelism argument. Each read task will fetch one or more chunks data. Feching chunks data uses this API. 4. Ray read tasks are dispatched to remote Ray workers and executed in distributed way. --------- Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Databricks users request a ray.data reader that can easily load data from databricks UC tables. This reader is implemented based on Databricks statemenet execution API, the overall process is: 1. read_databricks_tables create request /api/2.0/sql/statements to databricks shard, request payload contains authentication token and query information. 2. get response that contains the query result chunks information. The query result is split into multiple chunks, each chunk data size is around 10 ~ 20 MB. 3. Generate a couple of Ray Read tasks, read tasks number is determined by parallelism argument. Each read task will fetch one or more chunks data. Feching chunks data uses this API. 4. Ray read tasks are dispatched to remote Ray workers and executed in distributed way. --------- Signed-off-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Victor <vctr.y.m@example.com>

bveeramani · 2023-10-12T01:02:42Z

@WeichenXu123 ah, gotcha! Didn't realize you were running E2E tests on your side.

Given that we'd likely need extensive mocks to unit tests this feature, I think the additional complexity may outweigh the benefit of the tests.

WeichenXu123 added 3 commits September 26, 2023 17:05

update

7f24559

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

75355d5

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

init

bff207b

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen and stephanie-wang as code owners September 26, 2023 10:13

WeichenXu123 mentioned this pull request Sep 26, 2023

[Data] Add read_delta API to read Delta format files #38813

Open

8 tasks

WeichenXu123 commented Sep 26, 2023

View reviewed changes

python/ray/data/datasource/databricks_uc_datasource.py Outdated Show resolved Hide resolved

WeichenXu123 commented Sep 26, 2023

View reviewed changes

python/ray/data/datasource/databricks_uc_datasource.py Outdated Show resolved Hide resolved

WeichenXu123 commented Sep 26, 2023

View reviewed changes

python/ray/data/datasource/databricks_uc_datasource.py Show resolved Hide resolved

dbczumar reviewed Sep 26, 2023

View reviewed changes

python/ray/data/read_api.py Show resolved Hide resolved

dbczumar reviewed Sep 26, 2023

View reviewed changes

python/ray/data/read_api.py Outdated Show resolved Hide resolved

dbczumar reviewed Sep 26, 2023

View reviewed changes

python/ray/data/read_api.py Outdated Show resolved Hide resolved

WeichenXu123 added 4 commits September 27, 2023 16:50

update

9d3d3f6

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

address comments

6926697

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update doc

3c68066

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update doc

f597360

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

harupy reviewed Sep 27, 2023

View reviewed changes

python/ray/data/datasource/databricks_uc_datasource.py Show resolved Hide resolved

harupy reviewed Sep 27, 2023

View reviewed changes

python/ray/data/read_api.py Outdated Show resolved Hide resolved

harupy reviewed Sep 27, 2023

View reviewed changes

python/ray/data/datasource/databricks_uc_datasource.py Outdated Show resolved Hide resolved

harupy reviewed Sep 27, 2023

View reviewed changes

python/ray/data/datasource/databricks_uc_datasource.py Show resolved Hide resolved

update

fdfe533

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

c21 assigned c21 and bveeramani Sep 28, 2023

bveeramani reviewed Oct 2, 2023

View reviewed changes

address comments

acc86f5

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 requested a review from bveeramani October 3, 2023 16:45

WeichenXu123 added 2 commits October 4, 2023 07:20

format

f2c71b8

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Merge branch 'master' into ray-db-uc-reader

2936c65

WeichenXu123 added 2 commits October 5, 2023 11:53

nit update

c030cdd

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

force user set token

4da4b52

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 commented Oct 5, 2023

View reviewed changes

python/ray/data/read_api.py Outdated Show resolved Hide resolved

bveeramani approved these changes Oct 5, 2023

View reviewed changes

WeichenXu123 added 2 commits October 6, 2023 10:41

update

eb727bb

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

6da5d4f

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 requested a review from bveeramani October 6, 2023 02:48

format

0079f9c

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 added 2 commits October 6, 2023 11:39

update

1c41972

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

format

ca68487

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

bveeramani merged commit 9b3d63d into ray-project:master Oct 6, 2023
41 of 44 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Add databricks table / SQL reader #39852

[data] Add databricks table / SQL reader #39852

WeichenXu123 commented Sep 26, 2023 •

edited

Loading

dbczumar Sep 26, 2023

WeichenXu123 Sep 27, 2023

harupy Sep 27, 2023

harupy Sep 27, 2023

WeichenXu123 Sep 27, 2023

WeichenXu123 Sep 27, 2023

harupy Sep 27, 2023

WeichenXu123 Sep 27, 2023

WeichenXu123 commented Oct 2, 2023

bveeramani commented Oct 2, 2023

bveeramani left a comment

bveeramani Oct 2, 2023

WeichenXu123 Oct 3, 2023

bveeramani Oct 5, 2023

WeichenXu123 Oct 6, 2023 •

edited

Loading

WeichenXu123 commented Oct 3, 2023

WeichenXu123 commented Oct 4, 2023

bveeramani left a comment

bveeramani Oct 5, 2023

WeichenXu123 commented Oct 6, 2023

bveeramani commented Oct 6, 2023

bveeramani commented Oct 6, 2023

WeichenXu123 commented Oct 7, 2023

bveeramani commented Oct 12, 2023

[data] Add databricks table / SQL reader #39852

[data] Add databricks table / SQL reader #39852

Conversation

WeichenXu123 commented Sep 26, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 commented Oct 2, 2023

bveeramani commented Oct 2, 2023

bveeramani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

WeichenXu123 commented Oct 3, 2023

WeichenXu123 commented Oct 4, 2023

bveeramani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 commented Oct 6, 2023

bveeramani commented Oct 6, 2023

bveeramani commented Oct 6, 2023

WeichenXu123 commented Oct 7, 2023

bveeramani commented Oct 12, 2023

WeichenXu123 commented Sep 26, 2023 •

edited

Loading

WeichenXu123 Oct 6, 2023 •

edited

Loading