-
-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add insert --truncate option #118
Conversation
Hmm, while tests pass, this may not work as intended on larger datasets. Looking into it. |
Ah, I see the problem. The truncate is inside a loop I didn't realize was there. |
595acc3
to
3cf1e8e
Compare
I fixed my original oops by moving the I wanted to make the DELETE + INSERT happen all in the same transaction so it was robust, but that was more complicated than I expected. The transaction handling in the Database/Table classes isn't systematic, and this poses big hurdles to making For example, I wanted to do this (whitespace ignored in diff, so indentation change not highlighted): diff --git a/sqlite_utils/db.py b/sqlite_utils/db.py
index d6b9ecf..4107ceb 100644
--- a/sqlite_utils/db.py
+++ b/sqlite_utils/db.py
@@ -1028,6 +1028,11 @@ class Table(Queryable):
batch_size = max(1, min(batch_size, SQLITE_MAX_VARS // num_columns))
self.last_rowid = None
self.last_pk = None
+ with self.db.conn:
+ # Explicit BEGIN is necessary because Python's sqlite3 doesn't
+ # issue implicit BEGINs for DDL, only DML. We mix DDL and DML
+ # below and might execute DDL first, e.g. for table creation.
+ self.db.conn.execute("BEGIN")
if truncate and self.exists():
self.db.conn.execute("DELETE FROM [{}];".format(self.name))
for chunk in chunks(itertools.chain([first_record], records), batch_size):
@@ -1038,7 +1043,11 @@ class Table(Queryable):
# Use the first batch to derive the table names
column_types = suggest_column_types(chunk)
column_types.update(columns or {})
- self.create(
+ # Not self.create() because that is wrapped in its own
+ # transaction and Python's sqlite3 doesn't support
+ # nested transactions.
+ self.db.create_table(
+ self.name,
column_types,
pk,
foreign_keys,
@@ -1139,7 +1148,6 @@ class Table(Queryable):
flat_values = list(itertools.chain(*values))
queries_and_params = [(sql, flat_values)]
- with self.db.conn:
for query, params in queries_and_params:
try:
result = self.db.conn.execute(query, params) but that fails in tests because other methods call Stepping back, it would be nice to make the transaction handling systematic and predictable. One way to do this is to make the There is also the caveat that for each transaction, an explicit |
…file This moves the update from the filesystem layer into the SQL layer, thus allowing multiple processes to coordinate. Datasette holds open SQLite connections and was keeping references to the deleted files. The new --truncate option to `sqlite-utils insert` is added in a PR I submitted <simonw/sqlite-utils#118>. For now, sqlite-utils is installed from our fork. When/if --truncate is released officially, we can switch back to installing from PyPI. Resolves #10.
This is a really good idea - and thank you for the detailed discussion in the pull request. I'm keen to discuss how transactions can work better. I tend to use this pattern in my own code:
But it's not documented and I've not though very hard about it! I like having inserts that handle 10,000+ rows commit on every chunk so I can watch their progress from another process, but the library should absolutely support people who want to commit all of the rows in a single transaction - or combine changes with DML. Lots to discuss here. I'll start a new issue. |
Thoughts on transactions would be much appreciated in #121 |
Oops didn't mean to click "close" there. |
The only thing missing from this PR is updates to the documentation. Those need to go in two places:
Here's an example of a previous commit that includes updates to both CLI and API documentation: f9473ac#diff-e3e2a9bfd88566b05001b02a3f51d286 |
Deletes all rows in the table (if it exists) before inserting new rows. SQLite doesn't implement a TRUNCATE TABLE statement but does optimize an unqualified DELETE FROM. This can be handy if you want to refresh the entire contents of a table but a) don't have a PK (so can't use --replace), b) don't want the table to disappear (even briefly) for other connections, and c) have to handle records that used to exist being deleted. Ideally the replacement of rows would appear instantaneous to other connections by putting the DELETE + INSERT in a transaction, but this is very difficult without breaking other code as the current transaction handling is inconsistent and non-systematic. There exists the possibility for the DELETE to succeed but the INSERT to fail, leaving an empty table. This is not much worse, however, than the current possibility of one chunked INSERT succeeding and being committed while the next chunked INSERT fails, leaving a partially complete operation.
Ah, yes, thanks for this reminder! I've repushed with doc bits added. |
Awesome, thank you very much. |
…file This moves the update from the filesystem layer into the SQL layer, thus allowing multiple processes to coordinate. Datasette holds open SQLite connections and was keeping references to the deleted files. The new --truncate option to `sqlite-utils insert` is added in sqlite-utils 2.11, from a PR I submitted.¹ Resolves #10. ¹ simonw/sqlite-utils#118
Deletes all rows in the table (if it exists) before inserting new rows.
SQLite doesn't implement a TRUNCATE TABLE statement but does optimize an
unqualified DELETE FROM.
This can be handy if you want to refresh the entire contents of a table
but a) don't have a PK (so can't use --replace), b) don't want the table
to disappear (even briefly) for other connections, and c) have to handle
records that used to exist being deleted.
Ideally the replacement of rows would appear instantaneous to other
connections by putting the DELETE + INSERT in a transaction, but this is
very difficult without breaking other code as the current transaction
handling is inconsistent and non-systematic. There exists the
possibility for the DELETE to succeed but the INSERT to fail, leaving an
empty table. This is not much worse, however, than the current
possibility of one chunked INSERT succeeding and being committed while
the next chunked INSERT fails, leaving a partially complete operation.