Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry known idempotent SELECT queries on connection-related exceptions #51336

Merged

Conversation

adrianna-chang-shopify
Copy link
Contributor

@adrianna-chang-shopify adrianna-chang-shopify commented Mar 15, 2024

Motivation / Background

Take 2 of #51166, but rather than assuming that any SQL coming from the #select methods is safe to retry, we retry only queries we have constructed and thus know to be idempotent.

Detail

This PR makes two types of queries retry-able by opting into our allow_retry flag:

  1. SELECT queries we construct by walking the Arel tree via #to_sql_and_binds. We use a new retryable attribute on collector classes, which defaults to true for most node types, but will be set to false for non-idempotent node types (functions, SQL literals, update / delete / insert statements, etc). The retryable value is returned from #to_sql_and_binds and used by #select_all and passed down the call stack, eventually reaching the adapter's #internal_exec_query method.

  2. #find and #find_by queries with known attributes. We set allow_retry: true in #cached_find_by, and pass this down to #find_by_sql and #_query_by_sql.

These changes ensure that queries we know are safe to retry can be retried automatically.

Additional information

To verify:

  • Is test coverage sufficient? It's tricky to test every possible Active Record API that could lead to a given SQL query. I've tried to cover the most commonly used ones. I was hoping to test that inserts / updates / deletes are non-retryable, but since these are wrapped in transactions, we end up reconnecting on the BEGIN.
  • Are there Arel node types I've missed that might be non-idempotent? Note that we're considering all NamedFunction and SqlLiteral nodes to be non-idempotent, which likely excludes a lot of potential queries that would be retry-able, but this is the safest option for now.
  • Is cached_find_by always guaranteed to produce an idempotent query? This is the only way to ensure Post.find / Post.find_by(title: "foo") etc. are retryable, and I definitely think we should make these types of queries retryable. find_by with a SQL fragment supplied as args will be flagged as non-retryable.

cc @matthewd

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one change. Changes that are unrelated should be opened in separate PRs.
  • Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
  • Tests are added or updated if you fix a bug or add a feature.
  • CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.

@adrianna-chang-shopify
Copy link
Contributor Author

Test failures seem unrelated.

Copy link
Member

@rafaelfranca rafaelfranca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach. I'm not sure if the by not retrying queries that have SQL literals we will have any practical effect with this PR.

Did you investigate how many queries Rails generates with SQLLiteral nodes that we know are retriable?

@matthewd
Copy link
Member

This looks fantastic, thank you!

It'd be nice to have some test coverage at the Arel level, where we could just focus on "does visiting this node make us un-retryable?", and consequently be exhaustive without all the work of actually constructing queries and observing real reconnect/retries [in addition to having some of that full-stack coverage, as you do now]... but given we don't seem to have so-shaped tests for prepareable, maybe that's tricky to set up. (Read: if you can see a straightforward way of doing so, IMO it would be a nice-to-have... but if not, that's fine too.)

I have read through to_sql.rb and don't see any other nodes that strike me as retry-relevant. I've also confirmed that none of the affected visit methods are overridden in the dialect-specific subclasses (cc @yahonda).

cached_find_by should indeed be safe: it's only called in the most "plain" circumstances -- e.g. it's not used if there's a scope active on the class.

I do hope that we can independently figure out anything that's making Trilogy connections more prone to errors that would benefit from retrying (#51335?) -- this improves our ability to retry when necessary, but that still feels more like an improved workaround than a proper solution to the problem you're describing in the changelog.

Scrolling through, I'm spotting a couple of places where allow_retry doesn't get passed straight along... for example, does the public select_all method need/want to expose it at all?

@matthewd
Copy link
Member

Did you investigate how many queries Rails generates with SQLLiteral nodes that we know are retriable?

I think there are a few places we internally use Arel.sql, where we'll want to follow up with either (ideally) a new purpose-specific Arel node, or otherwise may need to introduce an internal-only "trust me" variety of SqlLiteral.

@adrianna-chang-shopify
Copy link
Contributor Author

It'd be nice to have some test coverage at the Arel level,

Done!

I do hope that we can independently figure out anything that's making Trilogy connections more prone to errors that would benefit from retrying

Agree, this mitigates the issue slightly but is ultimately a bandaid solution. It seems like the root of the issue is that it's easy for Trilogy to end up with a closed connection -- when an error (e.g. timeout) is encountered, in the case of infra blips or deploys, etc. We've generally been seeing connection errors pop up in long running background jobs, where Rails' ability to reconnect between requests / units of work doesn't help us. Mysql2 had an auto reconnect feature (though this appears to be deprecated for MySQL 8?), which was possibly why this wasn't a problem for us before.

I think there are a few places we internally use Arel.sql, where we'll want to follow up with either (ideally) a new purpose-specific Arel node, or otherwise may need to introduce an internal-only "trust me" variety of SqlLiteral.

Agree, I think a lot of the internal SQL literals that are generated can be marked as "known idempotent", but this likely requires introducing a new Arel node or a safe SQL literal variant. Still, I think a good portion of queries will benefit from the changes in this PR, so this feels like a good first step. We might also want to make updates / deletes retryable eventually..

@adrianna-chang-shopify adrianna-chang-shopify force-pushed the ac-retry-idempotent-queries-2 branch 2 times, most recently from 7d9c15d to 98de07f Compare March 19, 2024 19:14
@adrianna-chang-shopify
Copy link
Contributor Author

Actually, let me whip up a safe-SqlLiteral variant as part of this PR. Should be straightforward.

@adrianna-chang-shopify
Copy link
Contributor Author

Okay, I've added f2a8bf9. If we're okay merging this as a first pass, we can expand what we consider to be safe in future PRs.

@rafaelfranca
Copy link
Member

Test are failing

Copy link
Member

@matthewd matthewd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates and deletes are not safely retryable, because if we don't know whether the first one occurred, we can't trust that e.g. we aren't deleting a row that someone else inserted in between.

@@ -93,7 +93,7 @@ def strict_loading?
def append_constraints(connection, join, constraints)
if join.is_a?(Arel::Nodes::StringJoin)
join_string = Arel::Nodes::And.new(constraints.unshift join.left)
join.left = Arel.sql(connection.visitor.compile(join_string))
join.left = Arel.sql(connection.visitor.compile(join_string), retryable: true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is join_string never app-supplied?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT no, these are only constructed internally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI when join.is_a?(Arel::Nodes::StringJoin), the join is a user supplied string.
e.g.

has_many :taggings_with_no_tag, -> { joins("LEFT OUTER JOIN tags ON tags.id = taggings.tag_id").where("tags.id": nil) }, as: :taggable, class_name: "Tagging"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks @kamipo , I was struggling to find which codepath would test this, and settled for trying to walk through the code to deduce whether this could come from the app. I'll cut a quick PR to fix :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -446,7 +446,7 @@ def aggregate_column(column_name)
return column_name if Arel::Expressions === column_name

arel_column(column_name.to_s) do |name|
Arel.sql(column_name == :all ? "*" : name)
Arel.sql(column_name == :all ? "*" : name, retryable: true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure name here is an arbitrary string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arf, yeah you're right, I thought we always resolved arel_column from the columns hash on the class, but there's an else branch where we just yield the field directly:

def arel_column(field)
field = klass.attribute_aliases[field] || field
from = from_clause.name || from_clause.value
if klass.columns_hash.key?(field) && (!from || table_name_matches?(from))
table[field]
elsif field.match?(/\A\w+\.\w+\z/)
table, column = field.split(".")
predicate_builder.resolve_arel_attribute(table, column) do
lookup_table_klass_from_join_dependencies(table)
end
else
yield field
end
end

activerecord/lib/active_record/relation/query_methods.rb Outdated Show resolved Hide resolved
activerecord/lib/arel/alias_predication.rb Show resolved Hide resolved
@@ -3,7 +3,7 @@
module Arel # :nodoc: all
module Collectors
class Composite
attr_accessor :preparable
attr_accessor :preparable, :retryable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this (and preparable) should propagate to the composed left and right collectors when assigned 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (retryable only, don't want to mix changes to preparable into this PR 😅 )

activerecord/lib/arel/select_manager.rb Outdated Show resolved Hide resolved
@adrianna-chang-shopify adrianna-chang-shopify force-pushed the ac-retry-idempotent-queries-2 branch 3 times, most recently from 83d129b to 7327f20 Compare March 20, 2024 18:45
@adrianna-chang-shopify
Copy link
Contributor Author

@matthewd thanks for flagging a couple codepaths where user-generated SQL could be passed through -- I've removed the retryable option from any such nodes. Let me know if you think this is missing anything else.

This commit makes two types of queries retry-able by opting into our `allow_retry` flag:
1) SELECT queries we construct by walking the Arel tree via `#to_sql_and_binds`. We use a
new `retryable` attribute on collector classes, which defaults to true for most node types,
but will be set to false for non-idempotent node types (functions, SQL literals, etc). The
`retryable` value is returned from  `#to_sql_and_binds` and used by `#select_all` and
passed down the call stack, eventually reaching the adapter's `#internal_exec_query` method.

Internally-generated SQL literals are marked as retryable via a new `retryable` attribute on
`Arel::Nodes::SqlLiteral`.

2) `#find` and `#find_by` queries with known attributes. We set `allow_retry: true` in `#cached_find_by`,
and pass this down to `#find_by_sql` and `#_query_by_sql`.

These changes ensure that queries we know are safe to retry can be retried automatically.
@matthewd matthewd merged commit 530d771 into rails:main Mar 26, 2024
4 checks passed
@matthewd
Copy link
Member

This is great! Thanks again for pushing through the much more complicated approach! ❤️

It'll be interesting to see what portion of real-world queries this makes "naturally retryable".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants