Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

table.transform() method for advanced alter table #114

Closed
simonw opened this issue May 20, 2020 · 26 comments · Fixed by #161
Closed

table.transform() method for advanced alter table #114

simonw opened this issue May 20, 2020 · 26 comments · Fixed by #161
Labels
enhancement New feature or request
Milestone

Comments

@simonw
Copy link
Owner

simonw commented May 20, 2020

SQLite's ALTER TABLE can only do the following:

  • Rename a table
  • Rename a column
  • Add a column

Notably, it cannot drop columns - so tricks like "add a float version of this text column, populate it, then drop the old one and rename" won't work.

The docs here https://www.sqlite.org/lang_altertable.html#making_other_kinds_of_table_schema_changes describe a way of implementing full alters safely within a transaction, but it's fiddly.

  1. Create new table
  2. Copy data
  3. Drop old table
  4. Rename new into old

It would be great if sqlite-utils provided an abstraction to help make these kinds of changes safely.

@simonw simonw added the enhancement New feature or request label May 20, 2020
@simonw
Copy link
Owner Author

simonw commented May 30, 2020

Idea: use a chained API to define a complex transition and then execute it all at once. For example:

db["mytable"].transform().rename("col1", "col_1") \
  .change_type("col1", float) \
  .execute()

@simonw
Copy link
Owner Author

simonw commented Jul 8, 2020

Ideally this would all happen in a single transaction, such that other processes talking to the database would not see any inconsistent state while the table copy was taking place. Need to confirm that this is possible. Also refs transactions thoughts in #121.

@simonw
Copy link
Owner Author

simonw commented Jul 8, 2020

I'm not so keen on that chained API - it's pretty complicated.

Here's an idea for a much simpler interface. Essentially it lets you say "take table X and migrate its contents to a new table with this structure - then atomically rename the tables to switch them":

db["mytable"].migrate_table({"id": int, "name": str"}, pk="id")

The migrate_table() method would take the same exact signature as the table.create() method:

def create(
self,
columns,
pk=None,
foreign_keys=None,
column_order=None,
not_null=None,
defaults=None,
hash_id=None,
extracts=None,
):

@simonw
Copy link
Owner Author

simonw commented Jul 8, 2020

Alternative possible names:

  • .transform_table()
  • .migrate()
  • .transform()

I'm torn between .migrate_table() and .transform_table().

@simonw
Copy link
Owner Author

simonw commented Jul 8, 2020

Since neither the term "transform" or "migrate" are used in the codebase at the moment, I think I'll go with .transform_table() - that leaves the term "migrate" available for any future database migrations system (similar to Django's).

@simonw
Copy link
Owner Author

simonw commented Jul 8, 2020

Don't forget this step:

If foreign key constraints are enabled, disable them using PRAGMA foreign_keys=OFF.

@simonw
Copy link
Owner Author

simonw commented Jul 8, 2020

Thinking about the method signature:

    def transform_table(
        self,
        columns,
        pk=None,
        foreign_keys=None,
        column_order=None,
        not_null=None,
        defaults=None,
        hash_id=None,
        extracts=None,
    ):

This requires the caller to provide the exact set of columns for the new table.

It would be useful if this was optional - if you could omit the columns and have it automatically use the previous columns. This would let you change things like the primary key or the column order using the other arguments.

Even better: allow column renaming using an optional rename={...} argument:

db["dogs"].transform_table(rename={"name": "dog_name"})

@simonw
Copy link
Owner Author

simonw commented Jul 8, 2020

I can have a convenient change_type={...} parameter for changing column types too.

@simonw
Copy link
Owner Author

simonw commented Jul 8, 2020

Work in progress: not quite right yet, I need smarter logic for how renamed columns are reflected in the generated INSERT INTO ... SELECT ... query:

    def transform_table(
        self,
        columns=None,
        rename=None,
        change_type=None,
        pk=None,
        foreign_keys=None,
        column_order=None,
        not_null=None,
        defaults=None,
        hash_id=None,
        extracts=None,
    ):
        assert self.exists(), "Cannot transform a table that doesn't exist yet"
        columns = columns or self.columns_dict
        if rename is not None or change_type is not None:
            columns = {rename.get(key, key): change_type.get(key, value) for key, value in columns.items()}
        new_table_name = "{}_new_{}".format(self.name, os.urandom(6).hex())
        previous_columns = set(self.columns_dict.keys())
        with self.db.conn:
            columns = {name: value for (name, value) in columns.items()}
            new_table = self.db.create_table(
                new_table_name,
                columns,
                pk=pk,
                foreign_keys=foreign_keys,
                column_order=column_order,
                not_null=not_null,
                defaults=defaults,
                hash_id=hash_id,
                extracts=extracts,
            )
            # Copy across data - but only for columns that exist in both
            new_columns = set(columns.keys())
            columns_to_copy = new_columns.intersection(previous_columns)
            copy_sql = "INSERT INTO [{new_table}] ({new_cols}) SELECT {old_cols} FROM [{old_table}]".format(
                new_table=new_table_name,
                old_table=self.name,
                old_cols=", ".join("[{}]".format(col) for col in columns_to_copy),
                new_cols=", ".join("[{}]".format(rename.get(col, col)) for col in columns_to_copy),
            )
            self.db.conn.execute(copy_sql)
            # Drop the old table
            self.db.conn.execute("DROP TABLE [{}]".format(self.name))
            # Rename the new one
            self.db.conn.execute(
                "ALTER TABLE [{}] RENAME TO [{}]".format(new_table_name, self.name)
            )
        return self

@simonw simonw changed the title Advanced alter table support table.transform_table() method for advanced alter table Jul 8, 2020
@simonw
Copy link
Owner Author

simonw commented Jul 8, 2020

According to https://www.sqlite.org/lang_altertable.html#making_other_kinds_of_table_schema_changes the hardest bits to consider are how to deal with existing foreign key relationships, triggers and views.

I'm OK leaving views as an exercise for the caller - many of these transformations may not need any view changes at all.

Foreign key relationships are important: it should handle these automatically as effectively as possible.

Likewise trigger changes: need to think about what this means.

@simonw
Copy link
Owner Author

simonw commented Jul 9, 2020

I'm going to add a second method .transform_table_sql(...) - which returns the SQL that would have been executed but does NOT execute it.

Advanced callers can use this to include their own additional steps in the same transaction - e.g. recreating views or triggers.

More importantly it gives me a useful hook for writing some unit tests against the generated SQL.

@simonw
Copy link
Owner Author

simonw commented Jul 27, 2020

Work in progress in transform branch here: https://github.com/simonw/sqlite-utils/tree/transform

@simonw
Copy link
Owner Author

simonw commented Sep 21, 2020

I've decided to call this table.transform() - I was over-thinking whether people would remember that .transform() actually transforms the table, but that's what documentation is for.

@simonw simonw changed the title table.transform_table() method for advanced alter table table.transform() method for advanced alter table Sep 21, 2020
@simonw
Copy link
Owner Author

simonw commented Sep 21, 2020

I'm rethinking the API design now. Maybe it could look like this:

To change the type of the author_id column from text to int:

books.transform({"author_id": int})

This would leave the existing columns alone, but would change the type of this column.

To rename author_id to author_identifier:

books.transform(rename={"author_id": "author_identifier"})

To drop a column:

books.transform(drop=["author_id"])

Since the parameters all operate on columns they don't need to be called drop_column and rename_column.

@simonw
Copy link
Owner Author

simonw commented Sep 21, 2020

I'm going to sketch out a prototype of this new API design in that branch.

@simonw
Copy link
Owner Author

simonw commented Sep 21, 2020

For FTS tables associated with the table that is being transformed, should I automatically drop the old FTS table and recreate it against the new one or will it just magically continue to work after the table is renamed?

@simonw
Copy link
Owner Author

simonw commented Sep 21, 2020

I may need to do something special for rowid tables to ensure that the rowid values in the transformed table match those from the old table.

@simonw
Copy link
Owner Author

simonw commented Sep 21, 2020

To expand on what that first argument - the columns argument - does. Say you have a table like this:

id integer
name text
age text

Any columns omitted from the columns= argument are left alone - they have to be explicitly dropped using drop= if you want to drop them.

Any new columns are added (at the end of the table):

table.tranform({"size": float})

Any columns that have their type changed will have their type changed:

table.tranform({"age": int})

Should I also re-order columns if the order doesn't match? I think so. Open question as to what happens to columns that aren't mentioned at all in the dictionary though - what order should they go in?

@simonw
Copy link
Owner Author

simonw commented Sep 21, 2020

If you want to both change the type of a column AND rename it in the same operation, how would you do that? I think like this:

table.transform({"age": int}, rename={"age": "dog_age"})

So any rename logic is applied at the end, after the type transformation or re-ordering logic.

@simonw
Copy link
Owner Author

simonw commented Sep 21, 2020

The columns= argument is optional - so you can do just a rename operation like so:

table.transform(rename={"age": "dog_age"})

@simonw
Copy link
Owner Author

simonw commented Sep 21, 2020

Since I have a column_order=None argument already, maybe I can ignore the order of the columns in that first argument and use that instead?

@simonw
Copy link
Owner Author

simonw commented Sep 21, 2020

Does it make sense to support the pk= argument for changing the primary key?

If the user requests a primary key that doesn't make sense I think an integrity error will be raised when the SQL is being executed, which should hopefully cancel the transaction and raise an error. Need to check that this is what happens.

@simonw
Copy link
Owner Author

simonw commented Sep 21, 2020

A test that confirms that this mechanism can turn a rowid into a non-rowid table would be good too.

@simonw
Copy link
Owner Author

simonw commented Sep 21, 2020

I think the fiddliest part of the implementation here is code that takes the existing columns_dict of the table and the incoming columns= and drop= and rename= parameters and produces the columns dictionary for the new table, ready to be fed to .create_table().

This logic probably also needs to return a structure that can be used to build the INSERT INTO ... SELECT ... FROM query.

@simonw
Copy link
Owner Author

simonw commented Sep 22, 2020

The reason I'm working on this now is that I'd like to support many more options for data cleanup in the Datasette ecosystem - so being able to do things like convert the type of existing columns becomes increasingly important.

@simonw simonw linked a pull request Sep 22, 2020 that will close this issue
6 tasks
simonw added a commit that referenced this issue Sep 22, 2020
@simonw
Copy link
Owner Author

simonw commented Sep 22, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant