Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FTS table with 7 rows has _fts_docsize table with 9,141 rows #149

Closed
simonw opened this issue Sep 7, 2020 · 10 comments
Closed

FTS table with 7 rows has _fts_docsize table with 9,141 rows #149

simonw opened this issue Sep 7, 2020 · 10 comments
Labels
bug Something isn't working

Comments

@simonw
Copy link
Owner

simonw commented Sep 7, 2020

I'm seeing a weird issue with some of the SQLite databases that I am using with the FTS5 module.

I have a database with a licenses table that contains 7 rows: https://github-to-sqlite.dogsheep.net/github/licenses

The FTS table also has 7 rows: https://github-to-sqlite.dogsheep.net/github/licenses_fts

Somehow the accompanying licenses_fts_docsize shadow table now has 9,141 rows in it! https://github-to-sqlite.dogsheep.net/github/licenses_fts_docsize

And licenses_fts_data has 41 rows - should I expect that to have 7 rows? https://github-to-sqlite.dogsheep.net/github/licenses_fts_data

I have a hunch that it might be a problem with the triggers. These are the triggers that are updating that FTS table: https://github-to-sqlite.dogsheep.net/github?sql=select+*+from+sqlite_master+where+type+%3D+%27trigger%27+and+tbl_name+%3D+%27licenses%27

type name tbl_name rootpage sql
trigger licenses_ai licenses 0 CREATE TRIGGER [licenses_ai] AFTER INSERT ON [licenses] BEGIN INSERT INTO [licenses_fts] (rowid, [name]) VALUES (new.rowid, new.[name]); END
trigger licenses_ad licenses 0 CREATE TRIGGER [licenses_ad] AFTER DELETE ON [licenses] BEGIN INSERT INTO [licenses_fts] ([licenses_fts], rowid, [name]) VALUES('delete', old.rowid, old.[name]); END
trigger licenses_au licenses 0 CREATE TRIGGER [licenses_au] AFTER UPDATE ON [licenses] BEGIN INSERT INTO [licenses_fts] ([licenses_fts], rowid, [name]) VALUES('delete', old.rowid, old.[name]); INSERT INTO [licenses_fts] (rowid, [name]) VALUES (new.rowid, new.[name]); END
@simonw simonw added the bug Something isn't working label Sep 7, 2020
@simonw
Copy link
Owner Author

simonw commented Sep 7, 2020

First posted on SQLite forum here but I'm pretty sure this is a bug in how sqlite-utils created those tables: https://sqlite.org/forum/forumpost/51aada1b45

@simonw
Copy link
Owner Author

simonw commented Sep 7, 2020

Another likely culprit: licenses has a text primary key, so it's not using rowid:

CREATE TABLE [licenses] (
   [key] TEXT PRIMARY KEY,
   [name] TEXT,
   [spdx_id] TEXT,
   [url] TEXT,
   [node_id] TEXT
);

@simonw
Copy link
Owner Author

simonw commented Sep 7, 2020

Even though that table doesn't declare an integer primary key it does have a rowid column: https://github-to-sqlite.dogsheep.net/github?sql=select+rowid%2C+%5Bkey%5D%2C+name%2C+spdx_id%2C+url%2C+node_id+from+licenses+order+by+%5Bkey%5D+limit+101

rowid key name spdx_id url node_id
9150 apache-2.0 Apache License 2.0 Apache-2.0 https://api.github.com/licenses/apache-2.0 MDc6TGljZW5zZTI=
112 bsd-3-clause BSD 3-Clause "New" or "Revised" License BSD-3-Clause https://api.github.com/licenses/bsd-3-clause MDc6TGljZW5zZTU=

https://www.sqlite.org/rowidtable.html explains has this clue:

If the rowid is not aliased by INTEGER PRIMARY KEY then it is not persistent and might change. In particular the VACUUM command will change rowids for tables that do not declare an INTEGER PRIMARY KEY. Therefore, applications should not normally access the rowid directly, but instead use an INTEGER PRIMARY KEY.

@simonw
Copy link
Owner Author

simonw commented Sep 7, 2020

Aha! I have managed to replicate the bug:

(github-to-sqlite) /tmp % sqlite-utils tables --counts github.db | grep licenses
 {"table": "licenses", "count": 7},
 {"table": "licenses_fts_data", "count": 35},
 {"table": "licenses_fts_idx", "count": 16},
 {"table": "licenses_fts_docsize", "count": 9151},
 {"table": "licenses_fts_config", "count": 1},
 {"table": "licenses_fts", "count": 7},
(github-to-sqlite) /tmp % github-to-sqlite repos github.db dogsheep       
(github-to-sqlite) /tmp % sqlite-utils tables --counts github.db | grep licenses
 {"table": "licenses", "count": 7},
 {"table": "licenses_fts_data", "count": 45},
 {"table": "licenses_fts_idx", "count": 26},
 {"table": "licenses_fts_docsize", "count": 9161},
 {"table": "licenses_fts_config", "count": 1},
 {"table": "licenses_fts", "count": 7},

Note that the number of records in licenses_fts_docsize went from 9151 to 9161.

@simonw
Copy link
Owner Author

simonw commented Sep 7, 2020

reading through the code for github-to-sqlite repos - one of the things it does is calls save_license for each repo:

https://github.com/dogsheep/github-to-sqlite/blob/39b2234253096bd579feed4e25104698b8ccd2ba/github_to_sqlite/utils.py#L259-L262

def save_license(db, license):
    if license is None:
        return None
    return db["licenses"].insert(license, pk="key", replace=True).last_pk

@simonw
Copy link
Owner Author

simonw commented Sep 7, 2020

Using replace=True there executes INSERT OR REPLACE - and Dan Kennedy (SQLite maintainer) on the SQLite forums said this:

Are you using "REPLACE INTO", or "UPDATE OR REPLACE" on the "licenses" table without having first executed "PRAGMA recursive_triggers = 1"? The docs note that delete triggers will not be fired in this case, which would explain things. Second paragraph under "REPLACE" here:

https://www.sqlite.org/lang_conflict.html

@simonw
Copy link
Owner Author

simonw commented Sep 7, 2020

And the SQLite documentation says:

When the REPLACE conflict resolution strategy deletes rows in order to satisfy a constraint, delete triggers fire if and only if recursive triggers are enabled.

@simonw simonw changed the title FTS5 table with 7 rows has _fts_docsize table with 9,141 rows FTS table with 7 rows has _fts_docsize table with 9,141 rows Sep 7, 2020
@simonw
Copy link
Owner Author

simonw commented Sep 7, 2020

This replicates the problem:

(github-to-sqlite) /tmp % sqlite-utils tables --counts github.db | grep licenses
 {"table": "licenses", "count": 7},
 {"table": "licenses_fts_data", "count": 35},
 {"table": "licenses_fts_idx", "count": 16},
 {"table": "licenses_fts_docsize", "count": 9151},
 {"table": "licenses_fts_config", "count": 1},
 {"table": "licenses_fts", "count": 7},
(github-to-sqlite) /tmp % github-to-sqlite repos github.db dogsheep       
(github-to-sqlite) /tmp % sqlite-utils tables --counts github.db | grep licenses
 {"table": "licenses", "count": 7},
 {"table": "licenses_fts_data", "count": 45},
 {"table": "licenses_fts_idx", "count": 26},
 {"table": "licenses_fts_docsize", "count": 9161},
 {"table": "licenses_fts_config", "count": 1},
 {"table": "licenses_fts", "count": 7},

Note how the number of rows in licenses_fts_docsize goes from 9151 to 9161.

The number went up by ten. I used tracing from #151 to show that the following SQL executed ten times:

INSERT OR REPLACE INTO [licenses] ([key], [name], [node_id], [spdx_id], [url]) VALUES 
    (?, ?, ?, ?, ?);

Then I tried executing PRAGMA recursive_triggers=on; at the start of the script. This fixed the problem - running the script did not increase the number of rows in licenses_fts_docsize.

@simonw
Copy link
Owner Author

simonw commented Sep 7, 2020

https://www.sqlite.org/pragma.html#pragma_recursive_triggers says:

Prior to SQLite version 3.6.18 (2009-09-11), recursive triggers were not supported. The behavior of SQLite was always as if this pragma was set to OFF. Support for recursive triggers was added in version 3.6.18 but was initially turned OFF by default, for compatibility. Recursive triggers may be turned on by default in future versions of SQLite.

So I think the fix is to turn on recursive_triggers globally by default for sqlite-utils.

@simonw
Copy link
Owner Author

simonw commented Sep 7, 2020

The second challenge here is cleaning up all of those junk rows in existing *_fts_docsize tables. Doing that just to the demo database from https://github-to-sqlite.dogsheep.net/github.db dropped its size from 22MB to 16MB! Here's the SQL:

DELETE FROM [licenses_fts_docsize] WHERE id NOT IN (
  SELECT rowid FROM [licenses_fts]);

I can do that as part of the existing table.optimize() method, which optimizes FTS tables.

simonw added a commit that referenced this issue Sep 7, 2020
@simonw simonw closed this as completed in 3e87500 Sep 7, 2020
simonw added a commit that referenced this issue Sep 7, 2020
simonw added a commit to dogsheep/github-to-sqlite that referenced this issue Sep 7, 2020
simonw added a commit to dogsheep/github-to-sqlite that referenced this issue Sep 11, 2020
simonw added a commit that referenced this issue Dec 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant