[Data] Add `Dataset.write_sql` #38544

bveeramani · 2023-08-17T03:23:32Z

Why are these changes needed?

Writing data back to databases is common for many applications like LLMs. For example, you might want to write vector indices back to a database like https://github.com/pgvector/pgvector. To support this use case, this PR adds an API to write Datasets to SQL databases.

Related issue number

Closes #38242

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

c21

Looks solid to me as a starting point. Can you publish and trigger the wheel build, so we can test it out in real workload?

python/ray/data/tests/test_sql.py

python/ray/data/datasource/sql_datasource.py

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

…/ray into bveeramani/write-sql Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Use ray-project/ray#38544 to write out the embeddings. This has some advantages: The user doesn't need to specify the parallelism, they don't need to trigger the computation with `.count()`, since a write is an action -- so this is less error prone. It is also less code. We should wait merging the PR until Ray 2.7 is released with ray-project/ray#38544 merged. Signed-off-by: Goku Mohandas <gokumd@gmail.com> Co-authored-by: Goku Mohandas <gokumd@gmail.com>

pcmoritz · 2023-08-19T03:18:07Z

So this is now working for me, thanks a lot for fixing it!

One thing we should improve before merging it is the error handling. I noticed if the table does not exist, it will just silently fail and write nothing which is not good and can lead to data loss. To fix it I think we should do two things: Improve the error handling and catch exceptions better, and also potentially return some info about how many rows have been written (e.g. in the form of a status / info dict that we could add more stuff to in the future). The former we should do for sure, the latter only if it doesn't involve too many changes for now :)

Here is a repro of the silent failure in sqlite3 -- let's also add a test for it:

In [1]: import sqlite3

In [2]: import ray

In [3]: connection = sqlite3.connect("/tmp/db.sqlite")

In [4]: dataset = ray.data.from_items(
   ...:         [{"string": "spam", "number": 0}, {"string": "ham", "number": 1}]
   ...:     )

In [5]: dataset.write_sql(
   ...:         "INSERT INTO test VALUES(?, ?)", lambda: sqlite3.connect("/tmp/db.sqlite"))

Add chunking Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Remove extra file Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

bveeramani · 2023-08-21T20:05:58Z

@pcmoritz fixed the issue with error messages. Here's what the error looks like now:

import sqlite3
import ray

connection = sqlite3.connect("/tmp/eggs.db")
dataset = ray.data.range(1)

dataset.write_sql(
    "INSERT INTO test VALUES(?)", lambda: sqlite3.connect("/tmp/eggs.db")
)

Traceback (most recent call last):
  ...
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/datasource/sql_datasource.py", line 55, in write
    cursor.executemany(sql, values)
sqlite3.OperationalError: no such table: test

Add docstring Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

python/ray/data/datasource/sql_datasource.py

Address review comments Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

python/ray/data/datasource/sql_datasource.py

Remove max_rows_per_write parameter Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Writing data back to databases is common for many applications like LLMs. For example, you might want to write vector indices back to a database like https://github.com/pgvector/pgvector. To support this use case, this PR adds an API to write Datasets to SQL databases. Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

Writing data back to databases is common for many applications like LLMs. For example, you might want to write vector indices back to a database like https://github.com/pgvector/pgvector. To support this use case, this PR adds an API to write Datasets to SQL databases. Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>

Dataset.write_sql was added a while ago in #38544, but it wasn't documented in the API reference. This PR adds it. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

) Dataset.write_sql was added a while ago in ray-project#38544, but it wasn't documented in the API reference. This PR adds it. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Use ray-project/ray#38544 to write out the embeddings. This has some advantages: The user doesn't need to specify the parallelism, they don't need to trigger the computation with `.count()`, since a write is an action -- so this is less error prone. It is also less code. We should wait merging the PR until Ray 2.7 is released with ray-project/ray#38544 merged. Signed-off-by: Goku Mohandas <gokumd@gmail.com> Co-authored-by: Goku Mohandas <gokumd@gmail.com>

Initial commit

704f309

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

bveeramani requested review from ericl, scv119, c21, amogkam, scottjlee and raulchen as code owners August 17, 2023 03:23

bveeramani marked this pull request as draft August 17, 2023 03:23

bveeramani assigned c21 Aug 17, 2023

c21 reviewed Aug 17, 2023

View reviewed changes

python/ray/data/tests/test_sql.py Show resolved Hide resolved

bveeramani marked this pull request as ready for review August 17, 2023 06:34

Merge branch 'master' into bveeramani/write-sql

d6b7631

raulchen reviewed Aug 17, 2023

View reviewed changes

python/ray/data/datasource/sql_datasource.py Outdated Show resolved Hide resolved

pcmoritz mentioned this pull request Aug 18, 2023

Use write_sql to write the embeddings ray-project/llm-applications#9

Merged

bveeramani added 2 commits August 17, 2023 22:26

Address review comments

6ec0117

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Merge branch 'bveeramani/write-sql' of https://github.com/ray-project…

cff64e4

…/ray into bveeramani/write-sql Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

bveeramani added 2 commits August 21, 2023 13:03

Update files

7f68881

Add chunking Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Update files

8b9e0f8

Remove extra file Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

bveeramani added 2 commits August 21, 2023 13:21

Update files

46492ba

Add docstring Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Delete temp.py

0f301ae

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

c21 reviewed Aug 21, 2023

View reviewed changes

python/ray/data/datasource/sql_datasource.py Outdated Show resolved Hide resolved

python/ray/data/datasource/sql_datasource.py Outdated Show resolved Hide resolved

c21 approved these changes Aug 21, 2023

View reviewed changes

Update files

9697468

Address review comments Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

c21 reviewed Aug 21, 2023

View reviewed changes

python/ray/data/datasource/sql_datasource.py Outdated Show resolved Hide resolved

Update files

aa2e911

Remove max_rows_per_write parameter Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

c21 approved these changes Aug 21, 2023

View reviewed changes

raulchen approved these changes Aug 21, 2023

View reviewed changes

pcmoritz approved these changes Aug 22, 2023

View reviewed changes

bveeramani merged commit ef0e80b into master Aug 23, 2023
51 of 53 checks passed

bveeramani deleted the bveeramani/write-sql branch August 23, 2023 00:24

bveeramani mentioned this pull request Oct 18, 2023

[Data][Docs] Add Dataset.write_sql to API reference #40473

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add `Dataset.write_sql` #38544

[Data] Add `Dataset.write_sql` #38544

bveeramani commented Aug 17, 2023 •

edited

Loading

c21 left a comment

pcmoritz commented Aug 19, 2023 •

edited

Loading

bveeramani commented Aug 21, 2023

[Data] Add Dataset.write_sql #38544

[Data] Add Dataset.write_sql #38544

Conversation

bveeramani commented Aug 17, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

c21 left a comment

Choose a reason for hiding this comment

pcmoritz commented Aug 19, 2023 • edited Loading

bveeramani commented Aug 21, 2023

[Data] Add `Dataset.write_sql` #38544

[Data] Add `Dataset.write_sql` #38544

bveeramani commented Aug 17, 2023 •

edited

Loading

pcmoritz commented Aug 19, 2023 •

edited

Loading