-
Notifications
You must be signed in to change notification settings - Fork 938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate keywords data structure to Postgres ARRAY type #12996
base: main
Are you sure you want to change the base?
Conversation
Meant to replace `Release.keywords`, add the column and populate with the existing data. Signed-off-by: Mike Fiedler <miketheman@gmail.com>
To preserve the previous behavior, provide a mechanism for callers to get what they came for. Signed-off-by: Mike Fiedler <miketheman@gmail.com>
Uses some nifty SQLAlchemy behavior to help suss out usages - both in creation and calling. Back this out before we ship, since by then we should have 0 warnings. Signed-off-by: Mike Fiedler <miketheman@gmail.com>
Signed-off-by: Mike Fiedler <miketheman@gmail.com>
Signed-off-by: Mike Fiedler <miketheman@gmail.com>
This reverts commit 125f3cb.
Will drop the column via migration at some later date. Signed-off-by: Mike Fiedler <miketheman@gmail.com>
stripped_keywords = [keyword.strip() for keyword in split_keywords] | ||
slimmed_keywords = [keyword for keyword in stripped_keywords if keyword] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about making this a nested comprehension to save a variable, but figured that might hamper readability so did this instead.
UPDATE releases | ||
SET keywords_array = ( | ||
SELECT ARRAY( | ||
SELECT TRIM( | ||
UNNEST( | ||
STRING_TO_ARRAY(keywords, ',') | ||
) | ||
) | ||
) | ||
) | ||
WHERE keywords IS NOT NULL AND keywords != '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the data migration I'm a little wary about - since it'll need to update somewhere around 4M records (depending on the WHERE
clause in prod). Any timing estimates that could be run on TestPyPI DB would be helpful to gauge production impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a statement timeout and lock timeout in migrations to prevent long running migrations from blocking the production database for a long period of time (without the migration explicitly opting into that by setting the statement timeout to something else). See:
warehouse/warehouse/migrations/env.py
Lines 53 to 54 in 61a14ce
connection.execute("SET statement_timeout = 5000") | |
connection.execute("SET lock_timeout = 4000") |
In the past what we've done is "break" Alembic's migration isolation by chunking up the data migration and manually calling COMMIT
in the migration, see:
warehouse/warehouse/migrations/versions/4490777c984f_migrate_existing_data_for_release_is_.py
Lines 28 to 58 in 61a14ce
def _get_num_rows(conn): | |
return list( | |
conn.execute( | |
sa.text("SELECT COUNT(id) FROM releases WHERE is_prerelease IS NULL") | |
) | |
)[0][0] | |
def upgrade(): | |
conn = op.get_bind() | |
conn.execute("SET statement_timeout = 60000") | |
total_rows = _get_num_rows(conn) | |
max_loops = total_rows / 100000 * 2 | |
loops = 0 | |
while _get_num_rows(conn) > 0 and loops < max_loops: | |
loops += 1 | |
conn.execute( | |
sa.text( | |
""" | |
UPDATE releases | |
SET is_prerelease = pep440_is_prerelease(version) | |
WHERE id IN ( | |
SELECT id | |
FROM releases | |
WHERE is_prerelease IS NULL | |
LIMIT 100000 | |
) | |
""" | |
) | |
) | |
conn.execute("COMMIT") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks for the example.
I haven't read the entire PR yet, but you probably want this split over multiple deploys, currently what's going to happen is:
The important bit of that is that the old code will still be running after the data migration has completed, so anything uploaded in that window of time via the old code will have So I would refactor this into multiple PRs, that we can deploy in separate deployments:
|
Yep! Going to dismantle and rebuild, now that all the pieces are known. |
Meant to replace `Release.keywords`, add the column and populate with newly uploaded data. Part 1 of a few - see pypi#12996 for background. Signed-off-by: Mike Fiedler <miketheman@gmail.com>
Meant to replace `Release.keywords`, add the column and populate with newly uploaded data. Part 1 of a few - see pypi#12996 for background. Signed-off-by: Mike Fiedler <miketheman@gmail.com>
This will be taken apart and shipped in parts.
馃毃 Note: Data migration contained within
Does not drop existing column, but removes references to it.
Once merged, no new Releases will populate their
keywords
column, only thekeywords_array
column will be filled.