Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Add Dataset.write_sql #38544

Merged
merged 10 commits into from
Aug 23, 2023
Merged

[Data] Add Dataset.write_sql #38544

merged 10 commits into from
Aug 23, 2023

Conversation

bveeramani
Copy link
Member

@bveeramani bveeramani commented Aug 17, 2023

Why are these changes needed?

Writing data back to databases is common for many applications like LLMs. For example, you might want to write vector indices back to a database like https://github.com/pgvector/pgvector. To support this use case, this PR adds an API to write Datasets to SQL databases.

Related issue number

Closes #38242

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid to me as a starting point. Can you publish and trigger the wheel build, so we can test it out in real workload?

python/ray/data/tests/test_sql.py Show resolved Hide resolved
@bveeramani bveeramani marked this pull request as ready for review August 17, 2023 06:34
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
…/ray into bveeramani/write-sql

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
GokuMohandas added a commit to ray-project/llm-applications that referenced this pull request Aug 18, 2023
Use ray-project/ray#38544 to write out the
embeddings. This has some advantages: The user doesn't need to specify
the parallelism, they don't need to trigger the computation with
`.count()`, since a write is an action -- so this is less error prone.
It is also less code.

We should wait merging the PR until Ray 2.7 is released with
ray-project/ray#38544 merged.

Signed-off-by: Goku Mohandas <gokumd@gmail.com>
Co-authored-by: Goku Mohandas <gokumd@gmail.com>
@pcmoritz
Copy link
Contributor

pcmoritz commented Aug 19, 2023

So this is now working for me, thanks a lot for fixing it!

One thing we should improve before merging it is the error handling. I noticed if the table does not exist, it will just silently fail and write nothing which is not good and can lead to data loss. To fix it I think we should do two things: Improve the error handling and catch exceptions better, and also potentially return some info about how many rows have been written (e.g. in the form of a status / info dict that we could add more stuff to in the future). The former we should do for sure, the latter only if it doesn't involve too many changes for now :)

Here is a repro of the silent failure in sqlite3 -- let's also add a test for it:

In [1]: import sqlite3

In [2]: import ray

In [3]: connection = sqlite3.connect("/tmp/db.sqlite")

In [4]: dataset = ray.data.from_items(
   ...:         [{"string": "spam", "number": 0}, {"string": "ham", "number": 1}]
   ...:     )

In [5]: dataset.write_sql(
   ...:         "INSERT INTO test VALUES(?, ?)", lambda: sqlite3.connect("/tmp/db.sqlite"))

Add chunking

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Remove extra file

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
@bveeramani
Copy link
Member Author

@pcmoritz fixed the issue with error messages. Here's what the error looks like now:

import sqlite3
import ray

connection = sqlite3.connect("/tmp/eggs.db")
dataset = ray.data.range(1)

dataset.write_sql(
    "INSERT INTO test VALUES(?)", lambda: sqlite3.connect("/tmp/eggs.db")
)
Traceback (most recent call last):
  ...
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/datasource/sql_datasource.py", line 55, in write
    cursor.executemany(sql, values)
sqlite3.OperationalError: no such table: test

Add docstring

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Address review comments

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Remove max_rows_per_write parameter

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
@bveeramani bveeramani merged commit ef0e80b into master Aug 23, 2023
51 of 53 checks passed
@bveeramani bveeramani deleted the bveeramani/write-sql branch August 23, 2023 00:24
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
Writing data back to databases is common for many applications like LLMs. For example, you might want to write vector indices back to a database like https://github.com/pgvector/pgvector. To support this use case, this PR adds an API to write Datasets to SQL databases.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Writing data back to databases is common for many applications like LLMs. For example, you might want to write vector indices back to a database like https://github.com/pgvector/pgvector. To support this use case, this PR adds an API to write Datasets to SQL databases.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
bveeramani added a commit that referenced this pull request Oct 18, 2023
Dataset.write_sql was added a while ago in #38544, but it wasn't documented in the API reference. This PR adds it.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
jonathan-anyscale pushed a commit to jonathan-anyscale/ray that referenced this pull request Oct 26, 2023
)

Dataset.write_sql was added a while ago in ray-project#38544, but it wasn't documented in the API reference. This PR adds it.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
jonathan-anyscale pushed a commit to jonathan-anyscale/ray that referenced this pull request Oct 26, 2023
)

Dataset.write_sql was added a while ago in ray-project#38544, but it wasn't documented in the API reference. This PR adds it.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
PatchouliTIS pushed a commit to PatchouliTIS/RAG-Bot that referenced this pull request Jul 9, 2024
Use ray-project/ray#38544 to write out the
embeddings. This has some advantages: The user doesn't need to specify
the parallelism, they don't need to trigger the computation with
`.count()`, since a write is an action -- so this is less error prone.
It is also less code.

We should wait merging the PR until Ray 2.7 is released with
ray-project/ray#38544 merged.

Signed-off-by: Goku Mohandas <gokumd@gmail.com>
Co-authored-by: Goku Mohandas <gokumd@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data] Support write_sql()
4 participants