API for bulk inserting records into a table #1866

simonw · 2022-10-27T17:19:25Z

Similar to https://github.com/simonw/datasette-insert/blob/0.8/README.md#inserting-data-and-creating-tables

I expect this to become by far the most common way that data gets into a Datasette instance - more so than the individual row API in:

API to insert a single record into an existing table #1851

simonw · 2022-10-27T18:07:02Z

Error handling is really important here.

What should happen if you submit 100 records and one of them has some kind of validation error? How should that error be reported back to you?

I'm inclined to say that it defaults to all-or-nothing in a transaction - but there should be a "continue_on_error": true option (or similar) which causes it to insert the ones that are valid while reporting back the ones that are invalid.

simonw · 2022-10-27T18:09:52Z

Should this API accept CSV/TSV etc in addition to JSON?

I'm torn on this one. My initial instinct is that it should not - and there should instead be a Datasette client library / CLI tool you can use that knows how to turn CSV into batches of JSON calls for when you want to upload a CSV file.

I don't think the usability of curl https://datasette/db/table -F 'data=@path/to/file.csv' -H 'Authentication: Bearer xxx' is particularly great compared to something likedatasette client insert https://datasette/ db table file.csv --csv (where the command version could store API tokens for you too).

simonw · 2022-10-27T18:10:22Z

So for the moment I'm just going to concentrate on the JSON API. I can consider CSV variants later on, or as plugins, or both.

simonw · 2022-10-27T18:11:05Z

Likewise for newline-delimited JSON. While it's tempting to want to accept that as an ingest format (because it's nice to generate and stream) I think it's better to have a client application that can turn a stream of newline-delimited JSON into batched JSON inserts.

simonw · 2022-10-27T18:12:02Z

There's one catch with batched inserts: if your CLI tool fails half way through you could end up with a partially populated table - since a bunch of batches will have succeeded first.

I think that's OK. In the future I may want to come up with a way to run multiple batches of inserts inside a single transaction, but I can ignore that for the first release of this feature.

simonw · 2022-10-27T18:13:00Z

If people care about that kind of thing they could always push all of their inserts to a table called _tablename and then atomically rename that once they've uploaded all of the data (assuming I provide an atomic-rename-this-table mechanism).

simonw · 2022-10-28T01:00:42Z

I'm going to set the limit at 1,000 rows inserted at a time. I'll make this configurable using a new max_insert_rows setting (for consistency with max_returned_rows).

simonw · 2022-10-28T01:22:25Z

Nasty catch on this one: I wanted to return the IDs of the freshly inserted rows. But... the insert_all() method I was planning to use from sqlite-utils doesn't appear to have a way of doing that:

https://github.com/simonw/sqlite-utils/blob/529110e7d8c4a6b1bbf5fb61f2e29d72aa95a611/sqlite_utils/db.py#L2813-L2835

SQLite itself added a RETURNING statement which might help, but that is only available from version 3.35 released in March 2021: https://www.sqlite.org/lang_returning.html - which isn't commonly available yet. https://latest.datasette.io/-/versions right now shows 3.34, and https://lite.datasette.io/#/-/versions shows 3.27.2 (from Feb 2019).

Two options then:

Even for bulk inserts do one insert at a time so I can use cursor.lastrowid to get the ID of the inserted record. This isn't terrible since SQLite is very fast, but it may still be a big performance hit for large inserts.
Don't return the list of inserted rows for bulk inserts
Default to not returning the list of inserted rows for bulk inserts, but allow the user to request that - in which case we use the slower path

That third option might be the way to go here.

I should benchmark first to figure out how much of a difference this actually makes.

simonw · 2022-10-28T01:37:14Z

Quick crude benchmark:

import sqlite3

db = sqlite3.connect(":memory:")

def create_table(db, name):
    db.execute(f"create table {name} (id integer primary key, title text)")

create_table(db, "single")
create_table(db, "multi")
create_table(db, "bulk")

def insert_singles(titles):
    inserted = []
    for title in titles:
        cursor = db.execute(f"insert into single (title) values (?)", [title])
        inserted.append((cursor.lastrowid, title))
    return inserted


def insert_many(titles):
    db.executemany(f"insert into multi (title) values (?)", ((t,) for t in titles))


def insert_bulk(titles):
    db.execute("insert into bulk (title) values {}".format(
        ", ".join("(?)" for _ in titles)
    ), titles)

titles = ["title {}".format(i) for i in range(1, 10001)]

Then in iPython I ran these:

In [14]: %timeit insert_singles(titles)
23.8 ms ± 535 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [13]: %timeit insert_many(titles)
12 ms ± 520 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [12]: %timeit insert_bulk(titles)
2.59 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So the bulk insert really is a lot faster - 3ms compared to 24ms for single inserts, so ~8x faster.

simonw · 2022-10-28T01:51:40Z

This needs to support the following:

Rows do not include a primary key - one is assigned by the database
Rows provide their own primary key, any clashes are errors
Rows provide their own primary key, clashes are silently ignored
Rows provide their own primary key, replacing any existing records

simonw · 2022-10-28T16:29:55Z

I wonder if there's something clever I could do here within a transaction?

Start a transaction. Write out a temporary in-memory table with all of the existing primary keys in the table. Run the bulk insert. Then run select pk from table where pk not in (select pk from old_pks) to see what has changed.

I don't think that's going to work well for large tables.

I'm going to go with not returning inserted rows by default, unless you pass a special option requesting that.

Also improved API explorer to show HTTP status of response, refs #1871

simonw · 2022-11-14T05:48:22Z

I changed my mind about the "return_rows": true option - I'm going to rename it to "return": true.

Refs #1866 (comment)

simonw mentioned this issue Oct 27, 2022

Write API in Datasette core #1850

Closed

17 tasks

simonw added the json-api label Oct 27, 2022

simonw added this to the Datasette 1.0 milestone Oct 27, 2022

simonw mentioned this issue Oct 27, 2022

Create a new table from one or more records, sqlite-utils style #1862

Closed

This was referenced Oct 27, 2022

/db/table/-/rename API (also allows atomic replace) #1867

Open

API to insert a single record into an existing table #1851

Closed

simonw added a commit that referenced this issue Oct 30, 2022

API for bulk inserts, closes #1866

c35859a

simonw closed this as completed Oct 30, 2022

This was referenced Oct 30, 2022

API explorer tool #1871

Closed

Ensure insert API has good tests for rowid and compound primark key tables #1873

Open

Make cursor.rowcount accessible (wontfix) simonw/sqlite-utils#506

Closed

simonw added a commit that referenced this issue Nov 1, 2022

Handle database errors in /-/insert, refs #1866, #1873

497290b

Also improved API explorer to show HTTP status of response, refs #1871

simonw added a commit that referenced this issue Nov 14, 2022

Renamed return_rows to return in insert API

264d0ab

Refs #1866 (comment)

simonw modified the milestones: Datasette 1.0, Datasette 1.0a0 Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API for bulk inserting records into a table #1866

API for bulk inserting records into a table #1866

simonw commented Oct 27, 2022 •

edited

Loading

simonw commented Oct 27, 2022

simonw commented Oct 27, 2022

simonw commented Oct 27, 2022

simonw commented Oct 27, 2022

simonw commented Oct 27, 2022

simonw commented Oct 27, 2022

simonw commented Oct 28, 2022

simonw commented Oct 28, 2022 •

edited

Loading

simonw commented Oct 28, 2022 •

edited

Loading

simonw commented Oct 28, 2022

simonw commented Oct 28, 2022

simonw commented Nov 14, 2022

API for bulk inserting records into a table #1866

API for bulk inserting records into a table #1866

Comments

simonw commented Oct 27, 2022 • edited Loading

simonw commented Oct 27, 2022

simonw commented Oct 27, 2022

simonw commented Oct 27, 2022

simonw commented Oct 27, 2022

simonw commented Oct 27, 2022

simonw commented Oct 27, 2022

simonw commented Oct 28, 2022

simonw commented Oct 28, 2022 • edited Loading

simonw commented Oct 28, 2022 • edited Loading

simonw commented Oct 28, 2022

simonw commented Oct 28, 2022

simonw commented Nov 14, 2022

simonw commented Oct 27, 2022 •

edited

Loading

simonw commented Oct 28, 2022 •

edited

Loading

simonw commented Oct 28, 2022 •

edited

Loading