Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate keywords data structure to Postgres ARRAY type #12996

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

miketheman
Copy link
Member

@miketheman miketheman commented Feb 15, 2023

This will be taken apart and shipped in parts.

馃毃 Note: Data migration contained within

Does not drop existing column, but removes references to it.

Once merged, no new Releases will populate their keywords column, only the keywords_array column will be filled.

Meant to replace `Release.keywords`, add the column and populate with
the existing data.

Signed-off-by: Mike Fiedler <miketheman@gmail.com>
To preserve the previous behavior, provide a mechanism for callers to
get what they came for.

Signed-off-by: Mike Fiedler <miketheman@gmail.com>
Uses some nifty SQLAlchemy behavior to help suss out usages - both in
creation and calling.

Back this out before we ship, since by then we should have 0 warnings.

Signed-off-by: Mike Fiedler <miketheman@gmail.com>
Signed-off-by: Mike Fiedler <miketheman@gmail.com>
Signed-off-by: Mike Fiedler <miketheman@gmail.com>
Will drop the column via migration at some later date.

Signed-off-by: Mike Fiedler <miketheman@gmail.com>
Comment on lines +24 to +25
stripped_keywords = [keyword.strip() for keyword in split_keywords]
slimmed_keywords = [keyword for keyword in stripped_keywords if keyword]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about making this a nested comprehension to save a variable, but figured that might hamper readability so did this instead.

Comment on lines +38 to +48
UPDATE releases
SET keywords_array = (
SELECT ARRAY(
SELECT TRIM(
UNNEST(
STRING_TO_ARRAY(keywords, ',')
)
)
)
)
WHERE keywords IS NOT NULL AND keywords != ''
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the data migration I'm a little wary about - since it'll need to update somewhere around 4M records (depending on the WHERE clause in prod). Any timing estimates that could be run on TestPyPI DB would be helpful to gauge production impact.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a statement timeout and lock timeout in migrations to prevent long running migrations from blocking the production database for a long period of time (without the migration explicitly opting into that by setting the statement timeout to something else). See:

connection.execute("SET statement_timeout = 5000")
connection.execute("SET lock_timeout = 4000")

In the past what we've done is "break" Alembic's migration isolation by chunking up the data migration and manually calling COMMIT in the migration, see:

def _get_num_rows(conn):
return list(
conn.execute(
sa.text("SELECT COUNT(id) FROM releases WHERE is_prerelease IS NULL")
)
)[0][0]
def upgrade():
conn = op.get_bind()
conn.execute("SET statement_timeout = 60000")
total_rows = _get_num_rows(conn)
max_loops = total_rows / 100000 * 2
loops = 0
while _get_num_rows(conn) > 0 and loops < max_loops:
loops += 1
conn.execute(
sa.text(
"""
UPDATE releases
SET is_prerelease = pep440_is_prerelease(version)
WHERE id IN (
SELECT id
FROM releases
WHERE is_prerelease IS NULL
LIMIT 100000
)
"""
)
)
conn.execute("COMMIT")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks for the example.

@dstufft
Copy link
Member

dstufft commented Feb 15, 2023

I haven't read the entire PR yet, but you probably want this split over multiple deploys, currently what's going to happen is:

  1. Column will be added, data migration will be run.
  2. New deployment will occur with the new code.
  3. Old deployment will be spun down, shutting down old code.

The important bit of that is that the old code will still be running after the data migration has completed, so anything uploaded in that window of time via the old code will have keywords populated, but not keywords_array.

So I would refactor this into multiple PRs, that we can deploy in separate deployments:

  1. Adds the keywords_array column, sets up the upload code to write both columns.
  2. Does the data migration, guarding not to run against rows that (1) would have already written, sets up the code to only use the keywords_array column.
    • Optional: This could also drop the keywords column from the code, but not from the database, making SQLAlchemy completely ignore that column.
  3. If the keywords column has been dropped from the code, drop it from the database as well.

@miketheman
Copy link
Member Author

probably want this split over multiple deploys

Yep! Going to dismantle and rebuild, now that all the pieces are known.

@miketheman miketheman added the blocked Issues we can't or shouldn't get to yet label Feb 15, 2023
miketheman added a commit to miketheman/warehouse that referenced this pull request Feb 15, 2023
Meant to replace `Release.keywords`, add the column and populate with
newly uploaded data.

Part 1 of a few - see pypi#12996 for background.

Signed-off-by: Mike Fiedler <miketheman@gmail.com>
miketheman added a commit to miketheman/warehouse that referenced this pull request Feb 22, 2023
Meant to replace `Release.keywords`, add the column and populate with
newly uploaded data.

Part 1 of a few - see pypi#12996 for background.

Signed-off-by: Mike Fiedler <miketheman@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Issues we can't or shouldn't get to yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants