Skip to content

Improve database fallbacks #4316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
Closed

Conversation

Turbo87
Copy link
Member

@Turbo87 Turbo87 commented Dec 22, 2021

As suggested on Discord (see https://discord.com/channels/442252698964721669/835156566746595386/923210089085665381), this PR:

  • implements a primary database fallback if the replica is unavailable
  • renames the existing connection pool methods to db_read() and db_write()
  • implements an additional db_read_prefer_primary() method
  • uses that new method for endpoints that prefer the primary database, but don't need to write anything to the database

This should improve our resilience a little bit in case of database issues.

@Turbo87 Turbo87 added A-backend ⚙️ C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear labels Dec 22, 2021
@Turbo87 Turbo87 requested review from jtgeibel and pietroalbini and removed request for jtgeibel December 22, 2021 21:24
@Turbo87 Turbo87 changed the title Database pools Improve database fallbacks Dec 22, 2021
@Turbo87
Copy link
Member Author

Turbo87 commented Dec 23, 2021

comments should all be addressed. I've removed the warning for now. we can still add the metrics thing you mentioned later.

@pietroalbini
Copy link
Member

Could you add a metric as part of this PR? Otherwise we have no visibility when this happens.


fn db_read_prefer_primary(&self) -> Result<DieselPooledConn<'_>, PoolError> {
match (
self.app().primary_database.get(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, just noticed that this returns a read/write connection, not a read-only one. We should make sure connections returned by this method are read-only even when using the primary database, otherwise we might start relying on those being read/write accidentally, causing bugs during failover.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we could do is store a ReadOnlyConnection unit struct in the connection extensions, indicating whether that connection is currently read-only (unit struct in the extensions) or not. Then, whenever we get a connection from the database we check whether the presence of the struct is correct (present for a RO conn, missing for a RW conn), and otherwise issue the SQL query to switch the connection mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other notes that come through mind:

  • We need to make sure this doesn't get into an inconsistent state, so we should also have a LastSwitchFailed or something to indicate whether the previous attempt to switch between RO and RW failed, indicating it should be performed unconditionally in the next attempt.
  • This could replace the "read-only" section in the pool.
  • Do we need to cache whether the connection is RW or RO, or can we just send the query to switch the connection mode every time we checkout the connection from the pool? That could be way simpler and less failure prone, but I don't know what the performance impact of it would be off my head.

Copy link
Member Author

@Turbo87 Turbo87 Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I generally like the idea of having a type safe way to distinguishing between read-only and writeable database connections, this request is increasing the complexity of this PR by multiple orders of magnitude. I guess ideally this sort of thing would be supported by Diesel itself, instead of us having to monkey patch something like this on top.

I'd appreciate it if we could merge this as is for now and then work on further improvements afterwards, instead of blocking this on achieving absolute perfection first. https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While in general I agree, for something this crucial to our reliability I'd like to get it right before we merge it.

src/db.rs Outdated
Some(Ok(connection)) => Ok(connection),

// Replica is not available, but primary might be available
Some(Err(PoolError::UnhealthyPool)) => self.app().primary_database.get(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also return a read-only connection.

@Turbo87 Turbo87 force-pushed the database-pools branch 4 times, most recently from 20b9c27 to 633187e Compare December 31, 2021 12:15
@pietroalbini
Copy link
Member

Thinking more about it, this should also have a chaosproxy test to make sure the fallback actually happens as we want.

@Turbo87 Turbo87 closed this Jan 2, 2022
@Turbo87
Copy link
Member Author

Turbo87 commented Jan 2, 2022

feel free to continue this PR, if you want. my goal was to incrementally improve the codebase, but all these additional requirements just to get a small improvement merged make it impossibly hard to contribute.

@Turbo87 Turbo87 deleted the database-pools branch October 17, 2022 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants