Migrate keywords data structure to Postgres ARRAY type #12996

miketheman · 2023-02-15T02:19:09Z

This will be taken apart and shipped in parts.

🚨 Note: Data migration contained within

Does not drop existing column, but removes references to it.

Once merged, no new Releases will populate their keywords column, only the keywords_array column will be filled.

Meant to replace `Release.keywords`, add the column and populate with the existing data. Signed-off-by: Mike Fiedler <miketheman@gmail.com>

To preserve the previous behavior, provide a mechanism for callers to get what they came for. Signed-off-by: Mike Fiedler <miketheman@gmail.com>

Uses some nifty SQLAlchemy behavior to help suss out usages - both in creation and calling. Back this out before we ship, since by then we should have 0 warnings. Signed-off-by: Mike Fiedler <miketheman@gmail.com>

Signed-off-by: Mike Fiedler <miketheman@gmail.com>

This reverts commit 125f3cb.

Will drop the column via migration at some later date. Signed-off-by: Mike Fiedler <miketheman@gmail.com>

miketheman · 2023-02-15T02:20:41Z

warehouse/utils/release.py

+    stripped_keywords = [keyword.strip() for keyword in split_keywords]
+    slimmed_keywords = [keyword for keyword in stripped_keywords if keyword]


I thought about making this a nested comprehension to save a variable, but figured that might hamper readability so did this instead.

miketheman · 2023-02-15T02:22:50Z

warehouse/migrations/versions/677b8c232e17_add_release_keywords_array_column.py

+        UPDATE releases
+        SET keywords_array = (
+            SELECT ARRAY(
+                SELECT TRIM(
+                    UNNEST(
+                        STRING_TO_ARRAY(keywords, ',')
+                    )
+                )
+            )
+        )
+        WHERE keywords IS NOT NULL AND keywords != ''


This is the data migration I'm a little wary about - since it'll need to update somewhere around 4M records (depending on the WHERE clause in prod). Any timing estimates that could be run on TestPyPI DB would be helpful to gauge production impact.

We have a statement timeout and lock timeout in migrations to prevent long running migrations from blocking the production database for a long period of time (without the migration explicitly opting into that by setting the statement timeout to something else). See:

warehouse/warehouse/migrations/env.py

Lines 53 to 54 in 61a14ce

connection.execute("SET statement_timeout = 5000")

connection.execute("SET lock_timeout = 4000")

In the past what we've done is "break" Alembic's migration isolation by chunking up the data migration and manually calling COMMIT in the migration, see:

warehouse/warehouse/migrations/versions/4490777c984f_migrate_existing_data_for_release_is_.py

Lines 28 to 58 in 61a14ce

def _get_num_rows(conn):

return list(

conn.execute(

sa.text("SELECT COUNT(id) FROM releases WHERE is_prerelease IS NULL")

)

)[0][0]

def upgrade():

conn = op.get_bind()

conn.execute("SET statement_timeout = 60000")

total_rows = _get_num_rows(conn)

max_loops = total_rows / 100000 * 2

loops = 0

while _get_num_rows(conn) > 0 and loops < max_loops:

loops += 1

conn.execute(

sa.text(

"""

UPDATE releases

SET is_prerelease = pep440_is_prerelease(version)

WHERE id IN (

SELECT id

FROM releases

WHERE is_prerelease IS NULL

LIMIT 100000

)

"""

)

)

conn.execute("COMMIT")

Perfect, thanks for the example.

dstufft · 2023-02-15T13:06:01Z

I haven't read the entire PR yet, but you probably want this split over multiple deploys, currently what's going to happen is:

Column will be added, data migration will be run.
New deployment will occur with the new code.
Old deployment will be spun down, shutting down old code.

The important bit of that is that the old code will still be running after the data migration has completed, so anything uploaded in that window of time via the old code will have keywords populated, but not keywords_array.

So I would refactor this into multiple PRs, that we can deploy in separate deployments:

Adds the keywords_array column, sets up the upload code to write both columns.
Does the data migration, guarding not to run against rows that (1) would have already written, sets up the code to only use the keywords_array column.
- Optional: This could also drop the keywords column from the code, but not from the database, making SQLAlchemy completely ignore that column.
If the keywords column has been dropped from the code, drop it from the database as well.

miketheman · 2023-02-15T15:43:38Z

probably want this split over multiple deploys

Yep! Going to dismantle and rebuild, now that all the pieces are known.

Meant to replace `Release.keywords`, add the column and populate with newly uploaded data. Part 1 of a few - see pypi#12996 for background. Signed-off-by: Mike Fiedler <miketheman@gmail.com>

miketheman added 8 commits February 14, 2023 18:13

feat: introduce Release.keywords_array

07581e7

Meant to replace `Release.keywords`, add the column and populate with the existing data. Signed-off-by: Mike Fiedler <miketheman@gmail.com>

feat: add convenience property to create string

630762f

To preserve the previous behavior, provide a mechanism for callers to get what they came for. Signed-off-by: Mike Fiedler <miketheman@gmail.com>

chore: mark Release.keywords as deprecated

125f3cb

Uses some nifty SQLAlchemy behavior to help suss out usages - both in creation and calling. Back this out before we ship, since by then we should have 0 warnings. Signed-off-by: Mike Fiedler <miketheman@gmail.com>

refactor: use keywords_array instead of keywords

0e98b5e

Signed-off-by: Mike Fiedler <miketheman@gmail.com>

refactor: use keywords_csv instead of keywords

b952562

Signed-off-by: Mike Fiedler <miketheman@gmail.com>

feat: convert keywords on upload

cd83ebe

Revert "chore: mark Release.keywords as deprecated"

de112fd

This reverts commit 125f3cb.

chore: remove Release.keywords from model

42d90cd

Will drop the column via migration at some later date. Signed-off-by: Mike Fiedler <miketheman@gmail.com>

miketheman commented Feb 15, 2023

View reviewed changes

miketheman added the blocked Issues we can't or shouldn't get to yet label Feb 15, 2023

miketheman mentioned this pull request Feb 15, 2023

feat: introduce Release.keywords_array #13001

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate keywords data structure to Postgres ARRAY type #12996

Migrate keywords data structure to Postgres ARRAY type #12996

miketheman commented Feb 15, 2023 •

edited

miketheman Feb 15, 2023

miketheman Feb 15, 2023

dstufft Feb 15, 2023

miketheman Feb 15, 2023

dstufft commented Feb 15, 2023

miketheman commented Feb 15, 2023

		stripped_keywords = [keyword.strip() for keyword in split_keywords]
		slimmed_keywords = [keyword for keyword in stripped_keywords if keyword]

	connection.execute("SET statement_timeout = 5000")
	connection.execute("SET lock_timeout = 4000")

	def _get_num_rows(conn):
	return list(
	conn.execute(
	sa.text("SELECT COUNT(id) FROM releases WHERE is_prerelease IS NULL")
	)
	)[0][0]


	def upgrade():
	conn = op.get_bind()
	conn.execute("SET statement_timeout = 60000")
	total_rows = _get_num_rows(conn)
	max_loops = total_rows / 100000 * 2
	loops = 0
	while _get_num_rows(conn) > 0 and loops < max_loops:
	loops += 1
	conn.execute(
	sa.text(
	"""
	UPDATE releases
	SET is_prerelease = pep440_is_prerelease(version)
	WHERE id IN (
	SELECT id
	FROM releases
	WHERE is_prerelease IS NULL
	LIMIT 100000
	)
	"""
	)
	)
	conn.execute("COMMIT")

Migrate keywords data structure to Postgres ARRAY type #12996

Are you sure you want to change the base?

Migrate keywords data structure to Postgres ARRAY type #12996

Conversation

miketheman commented Feb 15, 2023 • edited

This will be taken apart and shipped in parts.

miketheman Feb 15, 2023

Choose a reason for hiding this comment

miketheman Feb 15, 2023

Choose a reason for hiding this comment

dstufft Feb 15, 2023

Choose a reason for hiding this comment

miketheman Feb 15, 2023

Choose a reason for hiding this comment

dstufft commented Feb 15, 2023

miketheman commented Feb 15, 2023

miketheman commented Feb 15, 2023 •

edited