-
Notifications
You must be signed in to change notification settings - Fork 22k
Postgres: use ANY instead of IN for array inclusion queries #49388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There are failing tests that call e.g. There is also a test for a binary column that's somehow generating this SQL: SELECT "binaries".* FROM "binaries" WHERE "binaries"."data" = any('{"{:value=>\"\\xF0\\x9F\\xA5\\xA6\", :format=>1}","{:value=>\"\\xF0\\x9F\\x8D\\xA6\", :format=>1}"}') There are also tests that fail because this query apparently isn't being cached. Thoughts on how best to resolve these? One option for the out-of-range errors is to skip values that fail to serialize:diff --git a/activerecord/lib/active_record/connection_adapters/postgresql/oid/array.rb b/activerecord/lib/active_record/connection_adapters/postgresql/oid/array.rb
index e46e47102b..eab716d010 100644
--- a/activerecord/lib/active_record/connection_adapters/postgresql/oid/array.rb
+++ b/activerecord/lib/active_record/connection_adapters/postgresql/oid/array.rb
@@ -9,12 +9,13 @@ class Array < Type::Value # :nodoc:
Data = Struct.new(:encoder, :values) # :nodoc:
- attr_reader :subtype, :delimiter
+ attr_reader :subtype, :delimiter, :ignore_serialize_errors
delegate :type, :user_input_in_time_zone, :limit, :precision, :scale, to: :subtype
- def initialize(subtype, delimiter = ",")
+ def initialize(subtype, delimiter = ",", ignore_serialize_errors = false)
@subtype = subtype
@delimiter = delimiter
+ @ignore_serialize_errors = ignore_serialize_errors
@pg_encoder = PG::TextEncoder::Array.new name: "#{type}[]", delimiter: delimiter
@pg_decoder = PG::TextDecoder::Array.new name: "#{type}[]", delimiter: delimiter
@@ -79,7 +80,13 @@ def force_equality?(value)
private
def type_cast_array(value, method)
if value.is_a?(::Array)
- value.map { |item| type_cast_array(item, method) }
+ result = []
+ value.each do |item|
+ result << type_cast_array(item, method)
+ rescue
+ raise unless ignore_serialize_errors
+ end
+ result
else
@subtype.public_send(method, value)
end
diff --git a/activerecord/lib/arel/visitors/postgresql.rb b/activerecord/lib/arel/visitors/postgresql.rb
index e2699ebe38..20cd198c84 100644
--- a/activerecord/lib/arel/visitors/postgresql.rb
+++ b/activerecord/lib/arel/visitors/postgresql.rb
@@ -89,7 +89,7 @@ def visit_Arel_Nodes_HomogeneousIn(o, collector)
collector << " != all("
end
- type_caster = ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Array.new(o.attribute.type_caster, ",")
+ type_caster = ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Array.new(o.attribute.type_caster, ",", true)
values = [type_caster.serialize(o.values)]
proc_for_binds = -> value { ActiveModel::Attribute.with_cast_value(o.attribute.name, value, type_caster) }
collector.add_binds(values, proc_for_binds, &bind_block) |
This is the generated SQL in |
This might be related? #49050 |
4a349b5
to
af2fced
Compare
Apart from |
https://pganalyze.com/blog/5mins-postgres-performance-in-lists-vs-any-operator-bind-parameters https://www.postgresql.org/docs/current/functions-comparisons.html |
7a48811
to
f081460
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a pretty large Postgres codebase (and love pganalyze!). I just ran our CI against this branch, and everything passed 👍
I was searching PRs labeled PostgreSQL and noticed that tag wasn't on this PR, and isn't getting much use in general. Can that tag be added to this PR? Can anyone do that? Are there more open PRs that can be tagged with PostgreSQL? Thanks. |
@seanlinsley At work we have a fairly large Rails 6.1 codebase (~100K LOC) and use PostgreSQL, PGSS, PgHero, and I advocate for more usage of it. Commonly we have these "giant IN clause" queries that are problematic because they're using hundreds or even thousands of values. If this PR improves the visibility on that and helps us fix those queries, that would be valuable to the platform health and performance. Typically these queries time out/get cancelled. I'm interested in this patch! Would there be any plans to back-port this functionality to Rails 6.1 after it's merged and stable? Thanks. |
Hi @andyatkinson, yes this change will collapse the hundreds / thousands of separate PGSS entries into a single one: select pg_stat_statements_reset();
select count(*) from users where id in (1);
select count(*) from users where id in (1,2);
select count(*) from users where id in (1,2,3);
select count(*) from users where id = any('{1}');
select count(*) from users where id = any('{1,2}');
select count(*) from users where id = any('{1,2,3}'); select calls, query from pg_stat_statements order by query;
calls | query
-------+---------------------------------------------------
3 | select count(*) from users where id = any($1)
1 | select count(*) from users where id in ($1)
1 | select count(*) from users where id in ($1,$2)
1 | select count(*) from users where id in ($1,$2,$3)
1 | select pg_stat_statements_reset() We're currently testing this on production using a 6.1 backport: https://github.com/pganalyze/rails/commits/any-backport. Feel free to use it, though I'd recommend keeping it on your own fork so you won't be impacted if that branch goes away in the future. |
At this point 6.1 only receives security fixes, and bug fixes are only backported to 7.0. This is a pretty large change so I wouldn't expect it to be available until 7.2 |
IMHO this is too "drastic" change just because of |
@simi while it would be great for Postgres to fix this upstream, there have been years of discussions over that (and previous attempts), so it's not guaranteed that patch will be committed. Even if it is committed, this PR may be worthwhile because it allows the use of prepared statements, and is a good deal faster as seen in the benchmark. |
@ghiculescu I work with @seanlinsley and I e-mailed the Postgres developer list about this and got a response. The gist of it is
As far as I can tell, non-constant expressions (e.g., The Postgres e-mail also had the suggestion to add an explicit cast here and some recommended reading in the Postgres source code; I will review that. And for what it's worth, I agree with Sean that the fate of the Postgres patch is uncertain and that the performance improvements are worthwhile. Plus, in the best case, the patch won't be out until Postgres 17 (which will take years to upgrade to). |
I've read through the function mentioned on the Postgres list that transforms an IN expression to an array expression. My C is limited, but it looks like the cases that can't be translated directly have to do with edge cases like mixed types in the IN list. I don't think that's an issue here? |
@seanlinsley Performance boost seems like a good argument in here. 🚀 Would it be possible to remove
@ghiculescu I was also thinking about need of making this configurable and switch this as default behaviour later, but I'm not sure if SQL queries are guaranteed to be stable across |
@msakrejda this PR patches
@simi well, the soonest Postgres might normalize
@simi unless we can find a reason why users would need the old behavior, I do not think this should be configurable since |
Yup, that's clear. I mean to make it clear this is mainly performance update and also as a side-effect it helps with pg_stat_statements grouping of queries. |
While I understand it is important to avoid polluting pg_stat_statements, I am concerned about the potential unforeseen side effects of rewriting all 'in' clauses to 'any' for this purpose. This topic has also been recognized by PostgreSQL developers. If it gets addressed in a future PostgreSQL release, users dealing with pg_stat_statements pollution will have the option to upgrade to a newer version. |
Hello
@yahonda do you see any? There is some point mentioned before in https://pganalyze.com/blog/5mins-postgres-performance-in-lists-vs-any-operator-bind-parameters but it goes into the ANY direction, similar to this PR. At the moment we are using where_any gem because we want clean We saw no performance increase after upgrading. But the app only does 20k/rpm. About waiting for PostgreSQL update seems to be a more complicated way. Upgrading a database is more complicated operation with potential downtime while re-writing query is smoother. |
Crunchy Data also has an example comparing IN with ANY:
Relevant PostgreSQL docs: Given that information, maybe this PR should be evaluated more heavily (or solely) on the performance benefit of switching from IN to ANY. The benefit to pg_stat_statements could be secondary. The author included benchmarks in the original description:
If these are repeatable, and there are no regressions, and IN to ANY works equivalently for operators but offers better performance, this is a win right? Since PostgreSQL 16 is now available (including Docker https://hub.docker.com/_/postgres), the benchmark could be tested on 16 too. I also saw the suggestion to typecast the array of integer values. It would be interesting to add that to the benchmark to see if it removes any latency (possibly from client query parsing?). |
@seanlinsley |
@yahonda I would encourage you to help us find issues with this PR so we can be confident there are no unforeseen side effects. I don't think it's acceptable to wait a year for a potential fix in Postgres 17 when Rails can so easily resolve the issue. Summarizing the discussion so far:
@andyatkinson I haven't done this because I'm not confident that Rails provides a Postgres-compatible version of the data type's name at this point in the code. After adding the test suite for all data types, I'll give that a try.
@andyatkinson as I understand it, |
Sorry, I've been distracted with conference+travel. I haven't looked at the diff yet, but just to cut through some of the meta-conversation:
|
= any()
instead of IN ()
2059a69
to
eb95793
Compare
The PR description and changelog entry have been updated to highlight the benefit to prepared statements. I still intend on adding tests for other data types, but now the code no longer falls back to
|
eb95793
to
5d19193
Compare
Hi, where are we on that topic? |
This is still waiting on review from Rails maintainers. It could use additional tests for the many Postgres data types supported, but I'm not sure if there is a list of what types are and aren't supported by Rails. |
In case it's useful, I wanted to add one report of running the provided benchmark. I copied the contents of the benchmark above into andy@Andrews-Laptop ~/P/r/g/bug_report_templates (array-any)> BENCHMARK_PREPARED=1 ruby any_benchmark.rb
Generating data...
Inserting 20000 users...
..............................................................................................................................................................................................
..........Done!
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-darwin23]
Warming up --------------------------------------
Model.where(id: [...])
130.000 i/100ms
Calculating -------------------------------------
Model.where(id: [...])
623.762 (±21.2%) i/s - 12.090k in 20.125447s While running, I confirmed my 400-600 i/s is what I'm seeing, and I'm not seeing much of a reliable difference between I have pg_stat_statements enabled locally, and I reset the data I see examples of 16384 | 5 | t | -4050608859221672199 | SELECT COUNT(*) FROM "users" WHERE "users"."id" IN ($1, $2, $3, $4, $5, $6, $7, $8, $9,..... |
def visit_Arel_Nodes_HomogeneousIn(o, collector) | ||
visit o.left, collector | ||
collector << (o.type == :in ? " = ANY (" : " != ALL (") | ||
type_caster = ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Array.new(o.attribute.type_caster, ",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is coupling the arel visitor with the active record database adapter. We can't do this. Maybe we should store the type in the attribute like we do with the type caster?
require "active_model/attribute" | ||
require "active_record" | ||
require "active_record/connection_adapters/postgresql_adapter" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arel should not depend on active record/model
end | ||
|
||
def type_cast(value) # :nodoc: | ||
def type_cast(value, in_array: false) # :nodoc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What in_array
means here? and why we need to pass it to this method, should not this information be in the value?
Hello there, brought to here by the PgAnalyze about query text. I wonder what the status is here? Seems like its been almost a year since last development, thanks! |
@seanlinsley this is a welcome change, my 🙏 it gets merged eventually. I've had to do this more-or-less hackily for every high-traffic Rails app I've worked on. I'm about to do it yet again, because using |
@seanlinsley In the current implementation, does As far as I know, this case is actually optimized by PostgreSQL to |
Looks like PG 18 might finally do something about this on the db side - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c0962a113d1f2f94cb7222a7ca025a67e9ce3860
|
@vladkosarev that commit seems to be for However, there's a different commit (and followup commit removing the GUC) that addresses the query normalization issue. Postgres 18 is expected to support pg_stat_statements normalization like this: SELECT pg_stat_statements_reset();
SELECT * FROM test WHERE a IN (1, 2, 3, 4, 5, 6, 7);
SELECT * FROM test WHERE a IN (1, 2, 3, 4, 5, 6, 7, 8);
SELECT query, calls FROM pg_stat_statements WHERE query LIKE 'SELECT%';
-[ RECORD 1 ]------------------------------
query | SELECT * FROM test WHERE a IN ($1 /*, ... */)
calls | 2 |
Oops, you are right. That's the one. This solves the issue for people that can upgrade when the time comes. |
Motivation
ANY
has several advantages overIN
:pg_stat_statements
churn can be avoidedpg_stat_activity
(when using prepared statements)Detail
Currently, array inclusion queries like
where(id: [1,2])
generate the SQLid IN (1, 2)
. This PR replaces that withid = ANY ('{1,2}')
, orid = ANY ($1)
when prepared statements are enabled.NOT IN
is implemented using!= ALL
https://stackoverflow.com/a/11730845See also the forum post: https://discuss.rubyonrails.org/t/83667
Additional information
This is measurably faster on Postgres 15.4. Here are the min and max iterations per second from several benchmark runs:
BENCHMARK_PREPARED=1
BENCHMARK_PREPARED=0
Here's the benchmark: