Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4 migration: App level notifications, queue status tracking, pause/resume #301

Merged
merged 10 commits into from
Apr 30, 2024

Conversation

bgentry
Copy link
Contributor

@bgentry bgentry commented Apr 15, 2024

There's a lot in this PR, all of it somewhat related and dependent on the new v4 database migration.

v4 migration summary

  • Make river_job.args non-null (there should be no null values in here anyway)
  • Make river_job.metadata non-null (there should be no null values in here anyway)
  • Drop job insert trigger notifications so this can be done at the application level
  • Stricter finalized_at constraint
  • Add pending job state. Currently unused, but will be used to build higher level functionality by staging jobs that are not yet ready to run (for some reason other than their scheduled time being in the future).
  • Add a river_queue table for queue state tracking, pause/resume.

I believe all of these are totally safe to perform on an actively running River installation with no interruptions. We should carefully consider this though. The removal of insert triggers does increase new job execution latency until a subsequent deploy of updated River clients with app-level notification logic, but this is a minor and temporary degradation.

Application level insert notifications

The initial design for River utilized a trigger on job insert that issued notifications (NOTIFY) so that listening clients could quickly pick up the work if they were idle. While this is good for lowering latency, it does have the side effect of emitting a large amount of notifications any time there are lots of jobs being inserted. This adds overhead, particularly to high-throughput installations.

To improve this situation and reduce overhead in high-throughput installations, the notifications have been refactored to be emitted at the application level. A client-level debouncer ensures that these notifications are not emitted more often than they could be useful. If a queue is due for an insert notification (on a particular Postgres schema), the notification is piggy-backed onto the insert query within the transaction. While this has the impact of increasing insert latency for a certain percentage of cases, the effect should be small.

Additionally, the initial release of River did not properly scope notification topics within the global LISTEN/NOTIFY namespace. If two River installations were operating on the same Postgres database but within different schemas (search paths), their notifications would be emitted on a shared topic name. This is no longer the case and all notifications are prefixed with a {schema_name}. string.

Queue status tracking, pause and resume

A useful operational lever is the ability to pause and resume a queue without shutting down clients. In addition to pause/resume being a feature request from #54, as part of my work on River's UI I've encountered the need to list out the active queues so that they can be displayed and manipulated.

A new river_queue table is introduced in the migration for this purpose. Upon startup, every producer in each River Client will make an UPSERT query to the database to either register the queue as being active, or if it already exists it will instead bump the timestamp to keep it active. This query will be run periodically as long as the Client is alive, even if the queue is paused. A separate query will delete/purge any queues which have not been active in awhile (currently fixed to 24 hours). This currently happens independently in each producer, but it really doesn't need to (see TODO below).

This table provides a place to pause and resume a single queue by name, or all queues using the special * value. Each producer will watch for notifications on the relevant LISTEN/NOTIFY topic unless operating in poll-only mode, in which case they will periodically poll for changes to this record in the database.

TODOs

  • Make queue purging a standalone maintenance process not connected to the producers.
  • Changelog entry
  • update rivertype.JobStates to include pending state

Closes #54.

@bgentry
Copy link
Contributor Author

bgentry commented Apr 15, 2024

Nice, I made some race conditions 😆 Well actually they appear to be just in the tests as a result of the SharedTx thing we were just discussing, not necessarily a problem with the actual code.

@bgentry bgentry force-pushed the bg-app-level-notify-limiter branch from b69cc03 to 16b2f05 Compare April 15, 2024 14:04
@bgentry bgentry changed the base branch from master to brandur-job-states April 15, 2024 14:04
@bgentry bgentry mentioned this pull request Apr 15, 2024
Base automatically changed from brandur-job-states to master April 16, 2024 00:19
@bgentry bgentry force-pushed the bg-app-level-notify-limiter branch 5 times, most recently from 55d08f6 to 5396448 Compare April 21, 2024 19:03
@bgentry bgentry marked this pull request as ready for review April 21, 2024 19:06
@bgentry
Copy link
Contributor Author

bgentry commented Apr 21, 2024

@brandur alright, I think your fix in #311 did the trick on all the SharedTx issues, so this is good for review.

Some notes: we did get a single flaky failure on this run on the Client start/stop stress test. And there are some other issues to work through in this latest run.

I have to break for food but will pick back up on those shortly. Feel free to review meanwhile if you're available.

Copy link
Contributor

@brandur brandur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!!! Lots of useful new stuff in here. River's going to be looking pretty feature complete with this in ...

rivertype/river_type.go Show resolved Hide resolved
client.go Outdated Show resolved Hide resolved
insert_opts.go Show resolved Hide resolved
client.go Outdated Show resolved Hide resolved
client.go Show resolved Hide resolved
queue_pause_opts.go Outdated Show resolved Hide resolved
riverdriver/riverpgxv5/internal/dbsqlc/river_job.sql Outdated Show resolved Hide resolved
internal/maintenance/queue_cleaner.go Show resolved Hide resolved
@bgentry bgentry force-pushed the bg-app-level-notify-limiter branch 7 times, most recently from 179cf03 to 1da690b Compare April 25, 2024 23:42
client.go Outdated Show resolved Hide resolved
@bgentry bgentry force-pushed the bg-app-level-notify-limiter branch 2 times, most recently from bf44a3b to 93fa962 Compare April 28, 2024 20:41
* Make job args non-null
* Make job metadata non-null
* Drop insert notification trigger and function as this will be done at
  the application level going forward.
* Add 'pending' state
@bgentry bgentry force-pushed the bg-app-level-notify-limiter branch from 93fa962 to 6236fce Compare April 28, 2024 20:50
@bgentry bgentry force-pushed the bg-app-level-notify-limiter branch 3 times, most recently from 15cde97 to 2e27de9 Compare April 28, 2024 21:38
@bgentry bgentry requested a review from brandur April 28, 2024 21:39
@bgentry
Copy link
Contributor Author

bgentry commented Apr 28, 2024

There are just a few items left here, mainly the name of the pubsub topic(s) we'll use for job and queue controls. Please lmk your thoughts on all the unresolved items! 🙏

@bgentry bgentry force-pushed the bg-app-level-notify-limiter branch from 2e27de9 to dc110d4 Compare April 29, 2024 02:18
Copy link
Contributor

@brandur brandur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are just a few items left here, mainly the name of the pubsub topic(s) we'll use for job and queue controls. Please lmk your thoughts on all the unresolved items! 🙏

Thanks for getting through all that! I added a couple more tiny comments, but looks solid.

Changelog looks great.

Let's get this shipped.

CHANGELOG.md Outdated
- Add `pending` job state. This is currently unused, but will be used to build higher level functionality for staging jobs that are not yet ready to run (for some reason other than their scheduled time being in the future). Pending jobs will never be run or deleted and must first be moved to another state by external code. [PR #301](https://github.com/riverqueue/river/pull/301).
- Queue status tracking, pause and resume. [PR #301](https://github.com/riverqueue/river/pull/301).

A useful operational lever is the ability to pause and resume a queue without shutting down clients. In addition to pause/resume being a feature request from #54, as part of the work on River's UI it's been useful to list out the active queues so that they can be displayed and manipulated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the super nit, but mind linking up 54 since it won't be in the rendered changelog.

Suggested change
A useful operational lever is the ability to pause and resume a queue without shutting down clients. In addition to pause/resume being a feature request from #54, as part of the work on River's UI it's been useful to list out the active queues so that they can be displayed and manipulated.
A useful operational lever is the ability to pause and resume a queue without shutting down clients. In addition to pause/resume being a feature request from [#54](https://github.com/riverqueue/river/pull/54), as part of the work on River's UI it's been useful to list out the active queues so that they can be displayed and manipulated.

@@ -7,6 +7,32 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

⚠️ Version 0.5.0 contains a new database migration, version 4. This migration is backward compatible with any River installation running the v3 migration. Be sure to run the v4 migration prior to deploying the code from this release.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to put together a little upgrade guide on the website because people's familiarity with the last time they migrated may have become long atrophied at this point (i.e. how to use the CLI, the fact that the CLI will need to be updated separately to know about the new migration, etc.).

@bgentry bgentry force-pushed the bg-app-level-notify-limiter branch from 8dcf9f4 to 1442de8 Compare April 30, 2024 02:26
Instead of emitting notifications for all inserts with a database
trigger, use application-level notification logic with a per-queue
debouncer. This enables us to drastically reduce the amount of
notifications emitted to 1 per inserting client per cooldown interval
per queue.

The `JobScheduler` has been reworked to utilize this same mechanism. In
addition, the scheduler has been updated to use a "look ahead" concept,
meaning any jobs that are scheduled before its next planned run should
be preemptively marked as available (while preserving their
`ScheduledAt` time). This makes jobs available for scheduled work sooner
and without delay, with the caveat that a notification may not be
emitted right when a job is scheduled. On queues without any ongoing
insert activity, this could potentially increase the latency before a
scheduled job gets picked up by a worker, which is a worthwhile tradeoff
for the performance optimization we gain on high throughput queues. This
can be counteracted if desired by lowering the client's polling
interval.
This is used for both controlling jobs and controlling queues, so a more
generalized name is appropriate.
@bgentry bgentry force-pushed the bg-app-level-notify-limiter branch from 1442de8 to f5332c7 Compare April 30, 2024 02:27
@bgentry bgentry merged commit 167ab69 into master Apr 30, 2024
10 checks passed
@bgentry bgentry deleted the bg-app-level-notify-limiter branch April 30, 2024 02:41
bgentry added a commit that referenced this pull request Apr 30, 2024
Small bit of follow up from #301.
bgentry added a commit that referenced this pull request Apr 30, 2024
brandur added a commit that referenced this pull request May 2, 2024
…mn to single value

With the addition of #301, and more specifically the schema-based
namespacing that it brings in around listen/notify, we're fully moving
towards a world where the recommendation for running multiple Rivers in
a single database is very definitive: isolate them by schema.

The elector's had a long-standing parameter called "instance name"
that's stored into the `river_leader` table, and which was prospectively
going to be used for River namespacing, but something we never made use
of it. In a world of schema isolation, each schema will have its own
`river_leader` table, and no kind of additional namespacing is needed
within the table.

I was originally going to try and approach herein we drop `name` out of
`river_leader` completely, but looking more closely, it's the table's
primary key. We could add a new column that'd act as a primary key
instead (e.g. imagine a boolean primary key column with a check
constraint that makes sure it's always true), but nothing we'd add would
be that much better. Instead, I elected to give `name` a default value
of `default` (matching the previous default instance name), and add a
check constraint verifying that it's always `default`, making it
effectively a single row table. The nice part about this approach is
that we can put these changes into the V4 migration in #301, and we
won't require any additional changes in any future migration.

With `name` now constrained to a single value, we can simplify all the
`river_leader`-based queries by removing their name parameters, then
remove instance name completely from elector code and drivers, giving us
a thorough overall cleanup.
brandur added a commit that referenced this pull request May 2, 2024
…mn to single value

With the addition of #301, and more specifically the schema-based
namespacing that it brings in around listen/notify, we're fully moving
towards a world where the recommendation for running multiple Rivers in
a single database is very definitive: isolate them by schema.

The elector's had a long-standing parameter called "instance name"
that's stored into the `river_leader` table, and which was prospectively
going to be used for River namespacing, but something we never made use
of it. In a world of schema isolation, each schema will have its own
`river_leader` table, and no kind of additional namespacing is needed
within the table.

I was originally going to try and approach herein we drop `name` out of
`river_leader` completely, but looking more closely, it's the table's
primary key. We could add a new column that'd act as a primary key
instead (e.g. imagine a boolean primary key column with a check
constraint that makes sure it's always true), but nothing we'd add would
be that much better. Instead, I elected to give `name` a default value
of `default` (matching the previous default instance name), and add a
check constraint verifying that it's always `default`, making it
effectively a single row table. The nice part about this approach is
that we can put these changes into the V4 migration in #301, and we
won't require any additional changes in any future migration.

With `name` now constrained to a single value, we can simplify all the
`river_leader`-based queries by removing their name parameters, then
remove instance name completely from elector code and drivers, giving us
a thorough overall cleanup.
brandur added a commit that referenced this pull request May 2, 2024
…mn to single value (#325)

With the addition of #301, and more specifically the schema-based
namespacing that it brings in around listen/notify, we're fully moving
towards a world where the recommendation for running multiple Rivers in
a single database is very definitive: isolate them by schema.

The elector's had a long-standing parameter called "instance name"
that's stored into the `river_leader` table, and which was prospectively
going to be used for River namespacing, but something we never made use
of it. In a world of schema isolation, each schema will have its own
`river_leader` table, and no kind of additional namespacing is needed
within the table.

I was originally going to try and approach herein we drop `name` out of
`river_leader` completely, but looking more closely, it's the table's
primary key. We could add a new column that'd act as a primary key
instead (e.g. imagine a boolean primary key column with a check
constraint that makes sure it's always true), but nothing we'd add would
be that much better. Instead, I elected to give `name` a default value
of `default` (matching the previous default instance name), and add a
check constraint verifying that it's always `default`, making it
effectively a single row table. The nice part about this approach is
that we can put these changes into the V4 migration in #301, and we
won't require any additional changes in any future migration.

With `name` now constrained to a single value, we can simplify all the
`river_leader`-based queries by removing their name parameters, then
remove instance name completely from elector code and drivers, giving us
a thorough overall cleanup.
brandur added a commit that referenced this pull request May 3, 2024
Tees up version `v0.5.0`, which mainly contains the changes in #301, but
is notably because it also contains the first ever new database
migration beyond the original line.
@brandur brandur mentioned this pull request May 3, 2024
brandur added a commit that referenced this pull request May 3, 2024
Tees up version `v0.5.0`, which mainly contains the changes in #301, but
is notably because it also contains the first ever new database
migration beyond the original line.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants