v4 migration: App level notifications, queue status tracking, pause/resume #301

bgentry · 2024-04-15T02:23:34Z

There's a lot in this PR, all of it somewhat related and dependent on the new v4 database migration.

v4 migration summary

Make river_job.args non-null (there should be no null values in here anyway)
Make river_job.metadata non-null (there should be no null values in here anyway)
Drop job insert trigger notifications so this can be done at the application level
Stricter finalized_at constraint
Add pending job state. Currently unused, but will be used to build higher level functionality by staging jobs that are not yet ready to run (for some reason other than their scheduled time being in the future).
Add a river_queue table for queue state tracking, pause/resume.

I believe all of these are totally safe to perform on an actively running River installation with no interruptions. We should carefully consider this though. The removal of insert triggers does increase new job execution latency until a subsequent deploy of updated River clients with app-level notification logic, but this is a minor and temporary degradation.

Application level insert notifications

The initial design for River utilized a trigger on job insert that issued notifications (NOTIFY) so that listening clients could quickly pick up the work if they were idle. While this is good for lowering latency, it does have the side effect of emitting a large amount of notifications any time there are lots of jobs being inserted. This adds overhead, particularly to high-throughput installations.

To improve this situation and reduce overhead in high-throughput installations, the notifications have been refactored to be emitted at the application level. A client-level debouncer ensures that these notifications are not emitted more often than they could be useful. If a queue is due for an insert notification (on a particular Postgres schema), the notification is piggy-backed onto the insert query within the transaction. While this has the impact of increasing insert latency for a certain percentage of cases, the effect should be small.

Additionally, the initial release of River did not properly scope notification topics within the global LISTEN/NOTIFY namespace. If two River installations were operating on the same Postgres database but within different schemas (search paths), their notifications would be emitted on a shared topic name. This is no longer the case and all notifications are prefixed with a {schema_name}. string.

Queue status tracking, pause and resume

A useful operational lever is the ability to pause and resume a queue without shutting down clients. In addition to pause/resume being a feature request from #54, as part of my work on River's UI I've encountered the need to list out the active queues so that they can be displayed and manipulated.

A new river_queue table is introduced in the migration for this purpose. Upon startup, every producer in each River Client will make an UPSERT query to the database to either register the queue as being active, or if it already exists it will instead bump the timestamp to keep it active. This query will be run periodically as long as the Client is alive, even if the queue is paused. A separate query will delete/purge any queues which have not been active in awhile (currently fixed to 24 hours). This currently happens independently in each producer, but it really doesn't need to (see TODO below).

This table provides a place to pause and resume a single queue by name, or all queues using the special * value. Each producer will watch for notifications on the relevant LISTEN/NOTIFY topic unless operating in poll-only mode, in which case they will periodically poll for changes to this record in the database.

TODOs

Make queue purging a standalone maintenance process not connected to the producers.
Changelog entry
update rivertype.JobStates to include pending state

Closes #54.

bgentry · 2024-04-15T02:28:49Z

Nice, I made some race conditions 😆 Well actually they appear to be just in the tests as a result of the SharedTx thing we were just discussing, not necessarily a problem with the actual code.

bgentry · 2024-04-21T19:08:49Z

@brandur alright, I think your fix in #311 did the trick on all the SharedTx issues, so this is good for review.

Some notes: we did get a single flaky failure on this run on the Client start/stop stress test. And there are some other issues to work through in this latest run.

I have to break for food but will pick back up on those shortly. Feel free to review meanwhile if you're available.

brandur

Great!!! Lots of useful new stuff in here. River's going to be looking pretty feature complete with this in ...

rivertype/river_type.go

client.go

insert_opts.go

client.go

queue_pause_opts.go

riverdriver/riverpgxv5/internal/dbsqlc/pg_misc.sql

riverdriver/riverpgxv5/internal/dbsqlc/river_job.sql

internal/riverinternaltest/riverdrivertest/riverdrivertest.go

internal/maintenance/queue_cleaner.go

client.go

* Make job args non-null * Make job metadata non-null * Drop insert notification trigger and function as this will be done at the application level going forward. * Add 'pending' state

bgentry · 2024-04-28T21:40:20Z

There are just a few items left here, mainly the name of the pubsub topic(s) we'll use for job and queue controls. Please lmk your thoughts on all the unresolved items! 🙏

brandur

There are just a few items left here, mainly the name of the pubsub topic(s) we'll use for job and queue controls. Please lmk your thoughts on all the unresolved items! 🙏

Thanks for getting through all that! I added a couple more tiny comments, but looks solid.

Changelog looks great.

Let's get this shipped.

brandur · 2024-04-30T01:29:08Z

CHANGELOG.md

+- Add `pending` job state. This is currently unused, but will be used to build higher level functionality for staging jobs that are not yet ready to run (for some reason other than their scheduled time being in the future). Pending jobs will never be run or deleted and must first be moved to another state by external code. [PR #301](https://github.com/riverqueue/river/pull/301).
+- Queue status tracking, pause and resume. [PR #301](https://github.com/riverqueue/river/pull/301).
+
+  A useful operational lever is the ability to pause and resume a queue without shutting down clients. In addition to pause/resume being a feature request from #54, as part of the work on River's UI it's been useful to list out the active queues so that they can be displayed and manipulated.


Sorry for the super nit, but mind linking up 54 since it won't be in the rendered changelog.

Suggested change

A useful operational lever is the ability to pause and resume a queue without shutting down clients. In addition to pause/resume being a feature request from #54, as part of the work on River's UI it's been useful to list out the active queues so that they can be displayed and manipulated.

A useful operational lever is the ability to pause and resume a queue without shutting down clients. In addition to pause/resume being a feature request from [#54](https://github.com/riverqueue/river/pull/54), as part of the work on River's UI it's been useful to list out the active queues so that they can be displayed and manipulated.

brandur · 2024-04-30T01:30:40Z

CHANGELOG.md

@@ -7,6 +7,32 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 ## [Unreleased]

+⚠️ Version 0.5.0 contains a new database migration, version 4. This migration is backward compatible with any River installation running the v3 migration. Be sure to run the v4 migration prior to deploying the code from this release.


We may want to put together a little upgrade guide on the website because people's familiarity with the last time they migrated may have become long atrophied at this point (i.e. how to use the CLI, the fact that the CLI will need to be updated separately to know about the new migration, etc.).

Instead of emitting notifications for all inserts with a database trigger, use application-level notification logic with a per-queue debouncer. This enables us to drastically reduce the amount of notifications emitted to 1 per inserting client per cooldown interval per queue. The `JobScheduler` has been reworked to utilize this same mechanism. In addition, the scheduler has been updated to use a "look ahead" concept, meaning any jobs that are scheduled before its next planned run should be preemptively marked as available (while preserving their `ScheduledAt` time). This makes jobs available for scheduled work sooner and without delay, with the caveat that a notification may not be emitted right when a job is scheduled. On queues without any ongoing insert activity, this could potentially increase the latency before a scheduled job gets picked up by a worker, which is a worthwhile tradeoff for the performance optimization we gain on high throughput queues. This can be counteracted if desired by lowering the client's polling interval.

This is used for both controlling jobs and controlling queues, so a more generalized name is appropriate.

Small bit of follow up from #301.

…mn to single value With the addition of #301, and more specifically the schema-based namespacing that it brings in around listen/notify, we're fully moving towards a world where the recommendation for running multiple Rivers in a single database is very definitive: isolate them by schema. The elector's had a long-standing parameter called "instance name" that's stored into the `river_leader` table, and which was prospectively going to be used for River namespacing, but something we never made use of it. In a world of schema isolation, each schema will have its own `river_leader` table, and no kind of additional namespacing is needed within the table. I was originally going to try and approach herein we drop `name` out of `river_leader` completely, but looking more closely, it's the table's primary key. We could add a new column that'd act as a primary key instead (e.g. imagine a boolean primary key column with a check constraint that makes sure it's always true), but nothing we'd add would be that much better. Instead, I elected to give `name` a default value of `default` (matching the previous default instance name), and add a check constraint verifying that it's always `default`, making it effectively a single row table. The nice part about this approach is that we can put these changes into the V4 migration in #301, and we won't require any additional changes in any future migration. With `name` now constrained to a single value, we can simplify all the `river_leader`-based queries by removing their name parameters, then remove instance name completely from elector code and drivers, giving us a thorough overall cleanup.

…mn to single value (#325) With the addition of #301, and more specifically the schema-based namespacing that it brings in around listen/notify, we're fully moving towards a world where the recommendation for running multiple Rivers in a single database is very definitive: isolate them by schema. The elector's had a long-standing parameter called "instance name" that's stored into the `river_leader` table, and which was prospectively going to be used for River namespacing, but something we never made use of it. In a world of schema isolation, each schema will have its own `river_leader` table, and no kind of additional namespacing is needed within the table. I was originally going to try and approach herein we drop `name` out of `river_leader` completely, but looking more closely, it's the table's primary key. We could add a new column that'd act as a primary key instead (e.g. imagine a boolean primary key column with a check constraint that makes sure it's always true), but nothing we'd add would be that much better. Instead, I elected to give `name` a default value of `default` (matching the previous default instance name), and add a check constraint verifying that it's always `default`, making it effectively a single row table. The nice part about this approach is that we can put these changes into the V4 migration in #301, and we won't require any additional changes in any future migration. With `name` now constrained to a single value, we can simplify all the `river_leader`-based queries by removing their name parameters, then remove instance name completely from elector code and drivers, giving us a thorough overall cleanup.

Tees up version `v0.5.0`, which mainly contains the changes in #301, but is notably because it also contains the first ever new database migration beyond the original line.

bgentry requested a review from brandur April 15, 2024 02:23

bgentry force-pushed the bg-app-level-notify-limiter branch from 053c288 to b69cc03 Compare April 15, 2024 02:24

bgentry mentioned this pull request Apr 15, 2024

Add rivertype.JobStates with full list of job states #297

Merged

bgentry force-pushed the bg-app-level-notify-limiter branch from b69cc03 to 16b2f05 Compare April 15, 2024 14:04

bgentry changed the base branch from master to brandur-job-states April 15, 2024 14:04

bgentry mentioned this pull request Apr 15, 2024

Web UI #127

Closed

brandur force-pushed the brandur-job-states branch from e2b098b to 59acac6 Compare April 16, 2024 00:16

Base automatically changed from brandur-job-states to master April 16, 2024 00:19

bgentry force-pushed the bg-app-level-notify-limiter branch 5 times, most recently from 55d08f6 to 5396448 Compare April 21, 2024 19:03

bgentry marked this pull request as ready for review April 21, 2024 19:06

brandur reviewed Apr 21, 2024

View reviewed changes

bgentry force-pushed the bg-app-level-notify-limiter branch 7 times, most recently from 179cf03 to 1da690b Compare April 25, 2024 23:42

brandur reviewed Apr 26, 2024

View reviewed changes

client.go Outdated Show resolved Hide resolved

bgentry force-pushed the bg-app-level-notify-limiter branch 2 times, most recently from bf44a3b to 93fa962 Compare April 28, 2024 20:41

bgentry added 2 commits April 28, 2024 15:50

extract testfactory.Job_Build for building job params

d4bb5dc

version 4 migration

047f975

* Make job args non-null * Make job metadata non-null * Drop insert notification trigger and function as this will be done at the application level going forward. * Add 'pending' state

bgentry force-pushed the bg-app-level-notify-limiter branch from 93fa962 to 6236fce Compare April 28, 2024 20:50

bgentry force-pushed the bg-app-level-notify-limiter branch 3 times, most recently from 15cde97 to 2e27de9 Compare April 28, 2024 21:38

bgentry requested a review from brandur April 28, 2024 21:39

bgentry force-pushed the bg-app-level-notify-limiter branch from 2e27de9 to dc110d4 Compare April 29, 2024 02:18

brandur approved these changes Apr 30, 2024

View reviewed changes

bgentry force-pushed the bg-app-level-notify-limiter branch from 8dcf9f4 to 1442de8 Compare April 30, 2024 02:26

bgentry added 8 commits April 29, 2024 21:27

comprehensive tests for finalized_at constraint

090c038

fix tests broken by strict db constraints

e6be89e

add a notifylimiter for debouncing app-level insert notifications

cd150d4

queue status tracking, pause + resume

678b759

fix other minor preexisting lint issues

31a49d6

changelog entry

cf0c205

rename "job control" topic to "control"

f5332c7

This is used for both controlling jobs and controlling queues, so a more generalized name is appropriate.

bgentry force-pushed the bg-app-level-notify-limiter branch from 1442de8 to f5332c7 Compare April 30, 2024 02:27

bgentry merged commit 167ab69 into master Apr 30, 2024
10 checks passed

bgentry deleted the bg-app-level-notify-limiter branch April 30, 2024 02:41

bgentry added a commit that referenced this pull request Apr 30, 2024

rename all instances of execTx to tx

b8f86b6

Small bit of follow up from #301.

bgentry mentioned this pull request Apr 30, 2024

rename all instances of execTx to tx #323

Merged

bgentry added a commit that referenced this pull request Apr 30, 2024

rename all instances of execTx to tx (#323)

5a1a3e0

Small bit of follow up from #301.

brandur mentioned this pull request May 2, 2024

Remove all in-code use of leadership "instance name" + constrain column to single value #325

Merged

brandur added a commit that referenced this pull request May 3, 2024

Prepare version v0.5.0

daf59d3

Tees up version `v0.5.0`, which mainly contains the changes in #301, but is notably because it also contains the first ever new database migration beyond the original line.

brandur mentioned this pull request May 3, 2024

Prepare version v0.5.0 #333

Merged

brandur added a commit that referenced this pull request May 3, 2024

Prepare version v0.5.0 (#333)

0ff658c

Tees up version `v0.5.0`, which mainly contains the changes in #301, but is notably because it also contains the first ever new database migration beyond the original line.

bgentry mentioned this pull request May 7, 2024

V4 Job Completion Duration Performance Hit #338

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v4 migration: App level notifications, queue status tracking, pause/resume #301

v4 migration: App level notifications, queue status tracking, pause/resume #301

bgentry commented Apr 15, 2024 •

edited

Loading

bgentry commented Apr 15, 2024

bgentry commented Apr 21, 2024

brandur left a comment

bgentry commented Apr 28, 2024

brandur left a comment

brandur Apr 30, 2024

brandur Apr 30, 2024

		@@ -7,6 +7,32 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

		## [Unreleased]

		⚠️ Version 0.5.0 contains a new database migration, version 4. This migration is backward compatible with any River installation running the v3 migration. Be sure to run the v4 migration prior to deploying the code from this release.

v4 migration: App level notifications, queue status tracking, pause/resume #301

v4 migration: App level notifications, queue status tracking, pause/resume #301

Conversation

bgentry commented Apr 15, 2024 • edited Loading

v4 migration summary

Application level insert notifications

Queue status tracking, pause and resume

TODOs

bgentry commented Apr 15, 2024

bgentry commented Apr 21, 2024

brandur left a comment

Choose a reason for hiding this comment

bgentry commented Apr 28, 2024

brandur left a comment

Choose a reason for hiding this comment

brandur Apr 30, 2024

Choose a reason for hiding this comment

brandur Apr 30, 2024

Choose a reason for hiding this comment

bgentry commented Apr 15, 2024 •

edited

Loading