insert_all(..., alter=True) should work for new columns introduced after the first 100 records #139

simonwiles · 2020-08-27T06:25:25Z

Is there a way to make .insert_all() work properly when new columns are introduced outside the first 100 records (with or without the alter=True argument)?

I'm using .insert_all() to bulk insert ~3-4k records at a time and it is common for records to need to introduce new columns. However, if new columns are introduced after the first 100 records, sqlite_utils doesn't even raise the OperationalError: table ... has no column named ... exception; it just silently drops the extra data and moves on.

It took me a while to find this little snippet in the documentation for .insert_all() (it's not mentioned under Adding columns automatically on insert/update):

The column types used in the CREATE TABLE statement are automatically derived from the types of data in that first batch of rows. Any additional or missing columns in subsequent batches will be ignored.

I tried changing the batch_size argument to the total number of records, but it seems only to effect the number of rows that are committed at a time, and has no influence on this problem.

Is there a way around this that you would suggest? It seems like it should raise an exception at least.

The text was updated successfully, but these errors were encountered:

simonwiles · 2020-08-27T20:46:18Z

I tried changing the batch_size argument to the total number of records, but it seems only to effect the number of rows that are committed at a time, and has no influence on this problem.

So the reason for this is that the batch_size for import is limited (of necessity) here: https://github.com/simonw/sqlite-utils/blob/main/sqlite_utils/db.py#L1048

With regard to the issue of ignoring columns, however, I made a fork and hacked a temporary fix that looks like this:
simonwiles@3901f43

It doesn't seem to affect performance enormously (but I've not tested it thoroughly), and it now does what I need (and would expect, tbh), but it now fails the test here:
https://github.com/simonw/sqlite-utils/blob/main/tests/test_create.py#L710-L716

The existence of this test suggests that insert_all() is behaving as intended, of course. It seems odd to me that this would be a desirable default behaviour (let alone the only behaviour), and its not very prominently flagged-up, either.

@simonw is this something you'd be willing to look at a PR for? I assume you wouldn't want to change the default behaviour at this point, but perhaps an option could be provided, or at least a bit more of a warning in the docs. Are there oversights in the implementation that I've made?

Would be grateful for your thoughts! Thanks!

simonw · 2020-08-28T02:11:40Z

This is deliberate behaviour, but I'm not at all attached to it - you're right in pointing out that it's actually pretty unexpected.

I'd be happy to change this behaviour so if you pass alter=True and then use .insert_all() on more than 100 rows it works as you would expect, instead of silently ignoring new columns past the first 100 rows. I don't expect that anyone would be depending on the current behaviour (ignore new columns after the first 100) such that this should be considered a backwards incompatible change.

simonw · 2020-08-28T02:12:51Z

I'd be happy to accept a PR for this, provided it included updated unit tests that illustrate it working. I think this is a really good improvement.

simonw · 2020-08-28T15:54:57Z

Here's a suggested test update:

diff --git a/sqlite_utils/db.py b/sqlite_utils/db.py
index a8791c3..12fa2f2 100644
--- a/sqlite_utils/db.py
+++ b/sqlite_utils/db.py
@@ -1074,6 +1074,13 @@ class Table(Queryable):
                 all_columns = list(sorted(all_columns))
                 if hash_id:
                     all_columns.insert(0, hash_id)
+            else:
+                all_columns += [
+                    column
+                    for record in chunk
+                    for column in record
+                    if column not in all_columns
+                ]
             validate_column_names(all_columns)
             first = False
             # values is the list of insert data that is passed to the
diff --git a/tests/test_create.py b/tests/test_create.py
index a84eb8d..3a7fafc 100644
--- a/tests/test_create.py
+++ b/tests/test_create.py
@@ -707,13 +707,15 @@ def test_insert_thousands_using_generator(fresh_db):
     assert 10000 == fresh_db["test"].count
 
 
-def test_insert_thousands_ignores_extra_columns_after_first_100(fresh_db):
+def test_insert_thousands_adds_extra_columns_after_first_100(fresh_db):
+    # https://github.com/simonw/sqlite-utils/issues/139
     fresh_db["test"].insert_all(
         [{"i": i, "word": "word_{}".format(i)} for i in range(100)]
-        + [{"i": 101, "extra": "This extra column should cause an exception"}]
+        + [{"i": 101, "extra": "Should trigger ALTER"}],
+        alter=True,
     )
     rows = fresh_db.execute_returning_dicts("select * from test where i = 101")
-    assert [{"i": 101, "word": None}] == rows
+    assert [{"i": 101, "word": None, "extra": "Should trigger ALTER"}] == rows
 
 
 def test_insert_ignore(fresh_db):

simonw · 2020-08-28T15:57:42Z

That pull request should update this section of the documentation too:

If you have more than one record to insert, the insert_all() method is a much more efficient way of inserting them. Just like insert() it will automatically detect the columns that should be created, but it will inspect the first batch of 100 items to help decide what those column types should be.

sqlite-utils/docs/python-api.rst

Line 393 in ea87c2b

    
           If you have more than one record to insert, the ``insert_all()`` method is a much more efficient way of inserting them. Just like ``insert()`` it will automatically detect the columns that should be created, but it will inspect the first batch of 100 items to help decide what those column types should be.

simonwiles · 2020-08-28T16:14:58Z

Thanks! And yeah, I had updating the docs on my list too :) Will try to get to it this afternoon (budgeting time is fraught with uncertainty at the moment!).

@simonw

Exactly as suggested by @simonw at simonw#139.

…100 records * Insert all columns for every chunk * Update unit test to reflect new behaviour * Test that exception is raised * Update documentation Closes #139. Thanks, Simon Wiles!

Refs #139, #142, #143

simonw · 2020-08-28T22:48:51Z

Thanks @simonwiles, this is now released in 2.16.1: https://sqlite-utils.readthedocs.io/en/stable/changelog.html

simonw changed the title ~~insert_all() silently drops data?~~ insert_all(..., alter=True) should work for new columns introduced after the first 100 records Aug 28, 2020

simonw added the enhancement New feature or request label Aug 28, 2020

simonwiles added a commit to simonwiles/sqlite-utils that referenced this issue Aug 28, 2020

Update unit test to reflect new behaviour

222bbb1

Exactly as suggested by @simonw at simonw#139.

simonwiles mentioned this issue Aug 28, 2020

insert_all(..., alter=True) should work for new columns introduced after the first 100 records #142

Merged

simonw closed this as completed in #142 Aug 28, 2020

simonw added a commit that referenced this issue Aug 28, 2020

Release 2.16.1

1a9dab8

Refs #139, #142, #143

simonwiles mentioned this issue Aug 30, 2020

Bug when first record contains fewer columns than subsequent records #145

Closed

rathboma mentioned this issue Apr 19, 2021

inserting with --nl errors with: sqlite3.OperationalError: table <table> has no column named <column> #256

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

insert_all(..., alter=True) should work for new columns introduced after the first 100 records #139

insert_all(..., alter=True) should work for new columns introduced after the first 100 records #139

simonwiles commented Aug 27, 2020

simonwiles commented Aug 27, 2020

simonw commented Aug 28, 2020

simonw commented Aug 28, 2020

simonw commented Aug 28, 2020 •

edited

Loading

simonw commented Aug 28, 2020

simonwiles commented Aug 28, 2020

simonw commented Aug 28, 2020

insert_all(..., alter=True) should work for new columns introduced after the first 100 records #139

insert_all(..., alter=True) should work for new columns introduced after the first 100 records #139

Comments

simonwiles commented Aug 27, 2020

simonwiles commented Aug 27, 2020

simonw commented Aug 28, 2020

simonw commented Aug 28, 2020

simonw commented Aug 28, 2020 • edited Loading

simonw commented Aug 28, 2020

simonwiles commented Aug 28, 2020

simonw commented Aug 28, 2020

simonw commented Aug 28, 2020 •

edited

Loading