Retry known idempotent SELECT queries on connection-related exceptions #51336

adrianna-chang-shopify · 2024-03-15T21:13:37Z

Motivation / Background

Take 2 of #51166, but rather than assuming that any SQL coming from the #select methods is safe to retry, we retry only queries we have constructed and thus know to be idempotent.

Detail

This PR makes two types of queries retry-able by opting into our allow_retry flag:

SELECT queries we construct by walking the Arel tree via #to_sql_and_binds. We use a new retryable attribute on collector classes, which defaults to true for most node types, but will be set to false for non-idempotent node types (functions, SQL literals, update / delete / insert statements, etc). The retryable value is returned from #to_sql_and_binds and used by #select_all and passed down the call stack, eventually reaching the adapter's #internal_exec_query method.
#find and #find_by queries with known attributes. We set allow_retry: true in #cached_find_by, and pass this down to #find_by_sql and #_query_by_sql.

These changes ensure that queries we know are safe to retry can be retried automatically.

Additional information

To verify:

Is test coverage sufficient? It's tricky to test every possible Active Record API that could lead to a given SQL query. I've tried to cover the most commonly used ones. I was hoping to test that inserts / updates / deletes are non-retryable, but since these are wrapped in transactions, we end up reconnecting on the BEGIN.
Are there Arel node types I've missed that might be non-idempotent? Note that we're considering all NamedFunction and SqlLiteral nodes to be non-idempotent, which likely excludes a lot of potential queries that would be retry-able, but this is the safest option for now.
Is cached_find_by always guaranteed to produce an idempotent query? This is the only way to ensure Post.find / Post.find_by(title: "foo") etc. are retryable, and I definitely think we should make these types of queries retryable. find_by with a SQL fragment supplied as args will be flagged as non-retryable.

cc @matthewd

Checklist

Before submitting the PR make sure the following are checked:

This Pull Request is related to one change. Changes that are unrelated should be opened in separate PRs.
Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
Tests are added or updated if you fix a bug or add a feature.
CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.

adrianna-chang-shopify · 2024-03-18T16:20:55Z

Test failures seem unrelated.

rafaelfranca

I like this approach. I'm not sure if the by not retrying queries that have SQL literals we will have any practical effect with this PR.

Did you investigate how many queries Rails generates with SQLLiteral nodes that we know are retriable?

matthewd · 2024-03-19T05:18:04Z

This looks fantastic, thank you!

It'd be nice to have some test coverage at the Arel level, where we could just focus on "does visiting this node make us un-retryable?", and consequently be exhaustive without all the work of actually constructing queries and observing real reconnect/retries [in addition to having some of that full-stack coverage, as you do now]... but given we don't seem to have so-shaped tests for prepareable, maybe that's tricky to set up. (Read: if you can see a straightforward way of doing so, IMO it would be a nice-to-have... but if not, that's fine too.)

I have read through to_sql.rb and don't see any other nodes that strike me as retry-relevant. I've also confirmed that none of the affected visit methods are overridden in the dialect-specific subclasses (cc @yahonda).

cached_find_by should indeed be safe: it's only called in the most "plain" circumstances -- e.g. it's not used if there's a scope active on the class.

I do hope that we can independently figure out anything that's making Trilogy connections more prone to errors that would benefit from retrying (#51335?) -- this improves our ability to retry when necessary, but that still feels more like an improved workaround than a proper solution to the problem you're describing in the changelog.

Scrolling through, I'm spotting a couple of places where allow_retry doesn't get passed straight along... for example, does the public select_all method need/want to expose it at all?

matthewd · 2024-03-19T05:36:32Z

Did you investigate how many queries Rails generates with SQLLiteral nodes that we know are retriable?

I think there are a few places we internally use Arel.sql, where we'll want to follow up with either (ideally) a new purpose-specific Arel node, or otherwise may need to introduce an internal-only "trust me" variety of SqlLiteral.

adrianna-chang-shopify · 2024-03-19T18:58:45Z

It'd be nice to have some test coverage at the Arel level,

Done!

I do hope that we can independently figure out anything that's making Trilogy connections more prone to errors that would benefit from retrying

Agree, this mitigates the issue slightly but is ultimately a bandaid solution. It seems like the root of the issue is that it's easy for Trilogy to end up with a closed connection -- when an error (e.g. timeout) is encountered, in the case of infra blips or deploys, etc. We've generally been seeing connection errors pop up in long running background jobs, where Rails' ability to reconnect between requests / units of work doesn't help us. Mysql2 had an auto reconnect feature (though this appears to be deprecated for MySQL 8?), which was possibly why this wasn't a problem for us before.

I think there are a few places we internally use Arel.sql, where we'll want to follow up with either (ideally) a new purpose-specific Arel node, or otherwise may need to introduce an internal-only "trust me" variety of SqlLiteral.

Agree, I think a lot of the internal SQL literals that are generated can be marked as "known idempotent", but this likely requires introducing a new Arel node or a safe SQL literal variant. Still, I think a good portion of queries will benefit from the changes in this PR, so this feels like a good first step. We might also want to make updates / deletes retryable eventually..

adrianna-chang-shopify · 2024-03-19T19:24:24Z

Actually, let me whip up a safe-SqlLiteral variant as part of this PR. Should be straightforward.

adrianna-chang-shopify · 2024-03-19T21:34:43Z

Okay, I've added f2a8bf9. If we're okay merging this as a first pass, we can expand what we consider to be safe in future PRs.

rafaelfranca · 2024-03-19T23:53:50Z

Test are failing

matthewd

Updates and deletes are not safely retryable, because if we don't know whether the first one occurred, we can't trust that e.g. we aren't deleting a row that someone else inserted in between.

matthewd · 2024-03-20T07:44:10Z

activerecord/lib/active_record/associations/join_dependency/join_association.rb

@@ -93,7 +93,7 @@ def strict_loading?
          def append_constraints(connection, join, constraints)
            if join.is_a?(Arel::Nodes::StringJoin)
              join_string = Arel::Nodes::And.new(constraints.unshift join.left)
-              join.left = Arel.sql(connection.visitor.compile(join_string))
+              join.left = Arel.sql(connection.visitor.compile(join_string), retryable: true)


Is join_string never app-supplied?

AFAICT no, these are only constructed internally.

FYI when join.is_a?(Arel::Nodes::StringJoin), the join is a user supplied string.
e.g.

rails/activerecord/test/models/rating.rb

Line 7 in 284baa1

has_many :taggings_with_no_tag, -> { joins("LEFT OUTER JOIN tags ON tags.id = taggings.tag_id").where("tags.id": nil) }, as: :taggable, class_name: "Tagging"

Ah, thanks @kamipo , I was struggling to find which codepath would test this, and settled for trying to walk through the code to deduce whether this could come from the app. I'll cut a quick PR to fix :)

matthewd · 2024-03-20T07:46:13Z

activerecord/lib/active_record/relation/calculations.rb

@@ -446,7 +446,7 @@ def aggregate_column(column_name)
        return column_name if Arel::Expressions === column_name

        arel_column(column_name.to_s) do |name|
-          Arel.sql(column_name == :all ? "*" : name)
+          Arel.sql(column_name == :all ? "*" : name, retryable: true)


I'm pretty sure name here is an arbitrary string

Arf, yeah you're right, I thought we always resolved arel_column from the columns hash on the class, but there's an else branch where we just yield the field directly:

rails/activerecord/lib/active_record/relation/query_methods.rb

Lines 1863 to 1877 in 7c68c52

def arel_column(field)

field = klass.attribute_aliases[field] || field

from = from_clause.name || from_clause.value

if klass.columns_hash.key?(field) && (!from || table_name_matches?(from))

table[field]

elsif field.match?(/\A\w+\.\w+\z/)

table, column = field.split(".")

predicate_builder.resolve_arel_attribute(table, column) do

lookup_table_klass_from_join_dependencies(table)

end

else

yield field

end

end

activerecord/lib/active_record/relation/predicate_builder.rb

activerecord/lib/active_record/relation/query_methods.rb

activerecord/lib/arel/alias_predication.rb

matthewd · 2024-03-20T07:53:05Z

activerecord/lib/arel/collectors/composite.rb

@@ -3,7 +3,7 @@
 module Arel # :nodoc: all
  module Collectors
    class Composite
-      attr_accessor :preparable
+      attr_accessor :preparable, :retryable


Technically this (and preparable) should propagate to the composed left and right collectors when assigned 🤔

Done (retryable only, don't want to mix changes to preparable into this PR 😅 )

activerecord/lib/arel/select_manager.rb

adrianna-chang-shopify · 2024-03-22T13:51:59Z

@matthewd thanks for flagging a couple codepaths where user-generated SQL could be passed through -- I've removed the retryable option from any such nodes. Let me know if you think this is missing anything else.

This commit makes two types of queries retry-able by opting into our `allow_retry` flag: 1) SELECT queries we construct by walking the Arel tree via `#to_sql_and_binds`. We use a new `retryable` attribute on collector classes, which defaults to true for most node types, but will be set to false for non-idempotent node types (functions, SQL literals, etc). The `retryable` value is returned from `#to_sql_and_binds` and used by `#select_all` and passed down the call stack, eventually reaching the adapter's `#internal_exec_query` method. Internally-generated SQL literals are marked as retryable via a new `retryable` attribute on `Arel::Nodes::SqlLiteral`. 2) `#find` and `#find_by` queries with known attributes. We set `allow_retry: true` in `#cached_find_by`, and pass this down to `#find_by_sql` and `#_query_by_sql`. These changes ensure that queries we know are safe to retry can be retried automatically.

matthewd · 2024-03-26T15:21:45Z

This is great! Thanks again for pushing through the much more complicated approach! ❤️

It'll be interesting to see what portion of real-world queries this makes "naturally retryable".

See rails/rails#51336

rails-bot bot added the activerecord label Mar 15, 2024

adrianna-chang-shopify mentioned this pull request Mar 15, 2024

Retry SELECT queries on connection-related exceptions #51166

Closed

4 tasks

adrianna-chang-shopify force-pushed the ac-retry-idempotent-queries-2 branch 2 times, most recently from 39a2400 to 4323c84 Compare March 15, 2024 21:33

rafaelfranca approved these changes Mar 18, 2024

View reviewed changes

adrianna-chang-shopify force-pushed the ac-retry-idempotent-queries-2 branch from 4323c84 to c5164f6 Compare March 19, 2024 18:27

adrianna-chang-shopify force-pushed the ac-retry-idempotent-queries-2 branch 2 times, most recently from 7d9c15d to 98de07f Compare March 19, 2024 19:14

adrianna-chang-shopify force-pushed the ac-retry-idempotent-queries-2 branch from 56915c3 to f2a8bf9 Compare March 19, 2024 21:34

matthewd reviewed Mar 20, 2024

View reviewed changes

adrianna-chang-shopify force-pushed the ac-retry-idempotent-queries-2 branch 3 times, most recently from 83d129b to 7327f20 Compare March 20, 2024 18:45

adrianna-chang-shopify force-pushed the ac-retry-idempotent-queries-2 branch from 7327f20 to eabcff2 Compare March 26, 2024 13:26

matthewd merged commit 530d771 into rails:main Mar 26, 2024
4 checks passed

adrianna-chang-shopify deleted the ac-retry-idempotent-queries-2 branch March 26, 2024 15:28

allan-pires mentioned this pull request Apr 1, 2024

fix: allow ActiveRecord::Querying find_by_sql to accept any arguments open-telemetry/opentelemetry-ruby-contrib#918

Closed

aidanharan added a commit to rails-sqlserver/activerecord-sqlserver-adapter that referenced this pull request Apr 18, 2024

Added allow_retry argument

58893ba

See rails/rails#51336

aidanharan mentioned this pull request Apr 18, 2024

Eliminate uses of lease_connection rails-sqlserver/activerecord-sqlserver-adapter#1157

Merged

adrianna-chang-shopify mentioned this pull request Apr 23, 2024

Join association built from string join should not retry #51644

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry known idempotent SELECT queries on connection-related exceptions #51336

Retry known idempotent SELECT queries on connection-related exceptions #51336

adrianna-chang-shopify commented Mar 15, 2024 •

edited

adrianna-chang-shopify commented Mar 18, 2024

rafaelfranca left a comment

matthewd commented Mar 19, 2024

matthewd commented Mar 19, 2024

adrianna-chang-shopify commented Mar 19, 2024

adrianna-chang-shopify commented Mar 19, 2024

adrianna-chang-shopify commented Mar 19, 2024

rafaelfranca commented Mar 19, 2024

matthewd left a comment

matthewd Mar 20, 2024

adrianna-chang-shopify Mar 20, 2024

kamipo Apr 21, 2024

adrianna-chang-shopify Apr 23, 2024

adrianna-chang-shopify Apr 23, 2024

matthewd Mar 20, 2024

adrianna-chang-shopify Mar 20, 2024

matthewd Mar 20, 2024

adrianna-chang-shopify Mar 20, 2024

adrianna-chang-shopify commented Mar 22, 2024

matthewd commented Mar 26, 2024

	def arel_column(field)
	field = klass.attribute_aliases[field] \|\| field
	from = from_clause.name \|\| from_clause.value

	if klass.columns_hash.key?(field) && (!from \|\| table_name_matches?(from))
	table[field]
	elsif field.match?(/\A\w+\.\w+\z/)
	table, column = field.split(".")
	predicate_builder.resolve_arel_attribute(table, column) do
	lookup_table_klass_from_join_dependencies(table)
	end
	else
	yield field
	end
	end

Retry known idempotent SELECT queries on connection-related exceptions #51336

Retry known idempotent SELECT queries on connection-related exceptions #51336

Conversation

adrianna-chang-shopify commented Mar 15, 2024 • edited

Motivation / Background

Detail

Additional information

Checklist

adrianna-chang-shopify commented Mar 18, 2024

rafaelfranca left a comment

Choose a reason for hiding this comment

matthewd commented Mar 19, 2024

matthewd commented Mar 19, 2024

adrianna-chang-shopify commented Mar 19, 2024

adrianna-chang-shopify commented Mar 19, 2024

adrianna-chang-shopify commented Mar 19, 2024

rafaelfranca commented Mar 19, 2024

matthewd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrianna-chang-shopify commented Mar 22, 2024

matthewd commented Mar 26, 2024

adrianna-chang-shopify commented Mar 15, 2024 •

edited