Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres: use ANY instead of IN for array inclusion queries #49388

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

seanlinsley
Copy link
Contributor

@seanlinsley seanlinsley commented Sep 26, 2023

Motivation

ANY has several advantages over IN:

  • using a single bind param allows the use of prepared statements
  • query parsing is faster
  • pg_stat_statements churn can be avoided
  • queries are less likely to be truncated in pg_stat_activity (when using prepared statements)

Detail

Currently, array inclusion queries like where(id: [1,2]) generate the SQL id IN (1, 2). This PR replaces that with id = ANY ('{1,2}'), or id = ANY ($1) when prepared statements are enabled.

NOT IN is implemented using != ALL https://stackoverflow.com/a/11730845

See also the forum post: https://discuss.rubyonrails.org/t/83667

Additional information

This is measurably faster on Postgres 15.4. Here are the min and max iterations per second from several benchmark runs:

  • 490 - 600 on the main branch
  • 750 - 950 on this branch with BENCHMARK_PREPARED=1
  • 780 - 1020 on this branch with BENCHMARK_PREPARED=0
Here's the benchmark:
# Adapted from activerecord/examples/performance.rb

require "active_record"
require "benchmark/ips"

TIME    = (ENV["BENCHMARK_TIME"] || 20).to_i
RECORDS = (ENV["BENCHMARK_RECORDS"] || TIME * 1000).to_i

conn = {
  adapter: "postgresql",
  database: "postgres",
  prepared_statements: ENV["BENCHMARK_PREPARED"] == "1"
}

ActiveRecord::Base.establish_connection(conn)

class User < ActiveRecord::Base
  connection.create_table :users, force: true do |t|
    t.string :name, :email
    t.timestamps
  end
end

def progress_bar(int); print "." if (int % 100).zero? ; end

puts "Generating data..."

module ActiveRecord
  class Faker
    LOREM = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse non aliquet diam. Curabitur vel urna metus, quis malesuada elit.
     Integer consequat tincidunt felis. Etiam non erat dolor. Vivamus imperdiet nibh sit amet diam eleifend id posuere diam malesuada. Mauris at accumsan sem.
     Donec id lorem neque. Fusce erat lorem, ornare eu congue vitae, malesuada quis neque. Maecenas vel urna a velit pretium fermentum. Donec tortor enim,
     tempor venenatis egestas a, tempor sed ipsum. Ut arcu justo, faucibus non imperdiet ac, interdum at diam. Pellentesque ipsum enim, venenatis ut iaculis vitae,
     varius vitae sem. Sed rutrum quam ac elit euismod bibendum. Donec ultricies ultricies magna, at lacinia libero mollis aliquam. Sed ac arcu in tortor elementum
     tincidunt vel interdum sem. Curabitur eget erat arcu. Praesent eget eros leo. Nam magna enim, sollicitudin vehicula scelerisque in, vulputate ut libero.
     Praesent varius tincidunt commodo".split

    def self.name
      LOREM.grep(/^\w*$/).sort_by { rand }.first(2).join " "
    end

    def self.email
      LOREM.grep(/^\w*$/).sort_by { rand }.first(2).join("@") + ".com"
    end
  end
end

begin
  puts "Inserting #{RECORDS} users..."
  RECORDS.times do |record|
    user = User.create(
      created_at: Date.today,
      name: ActiveRecord::Faker.name,
      email: ActiveRecord::Faker.email
    )
    progress_bar(record)
  end
  puts "Done!\n"

  Benchmark.ips(TIME) do |x|
    x.report "Model.where(id: [...])" do
      User.where(id: (1..100).to_a).count
    end
  end
ensure
  User.connection.drop_table :users
end

@seanlinsley
Copy link
Contributor Author

seanlinsley commented Sep 26, 2023

There are failing tests that call e.g. where(id: [3, 9223372036854775808]) on a bigint column, using values that are larger than the bigint column can contain. I assume this was relying on Postgres to fall back to numeric? Since the column can't contain values that large I'm not sure what the right thing to do here is.

There is also a test for a binary column that's somehow generating this SQL:

SELECT "binaries".* FROM "binaries" WHERE "binaries"."data" = any('{"{:value=>\"\\xF0\\x9F\\xA5\\xA6\", :format=>1}","{:value=>\"\\xF0\\x9F\\x8D\\xA6\", :format=>1}"}')

There are also tests that fail because this query apparently isn't being cached.

Thoughts on how best to resolve these?

One option for the out-of-range errors is to skip values that fail to serialize:
diff --git a/activerecord/lib/active_record/connection_adapters/postgresql/oid/array.rb b/activerecord/lib/active_record/connection_adapters/postgresql/oid/array.rb
index e46e47102b..eab716d010 100644
--- a/activerecord/lib/active_record/connection_adapters/postgresql/oid/array.rb
+++ b/activerecord/lib/active_record/connection_adapters/postgresql/oid/array.rb
@@ -9,12 +9,13 @@ class Array < Type::Value # :nodoc:
 
           Data = Struct.new(:encoder, :values) # :nodoc:
 
-          attr_reader :subtype, :delimiter
+          attr_reader :subtype, :delimiter, :ignore_serialize_errors
           delegate :type, :user_input_in_time_zone, :limit, :precision, :scale, to: :subtype
 
-          def initialize(subtype, delimiter = ",")
+          def initialize(subtype, delimiter = ",", ignore_serialize_errors = false)
             @subtype = subtype
             @delimiter = delimiter
+            @ignore_serialize_errors = ignore_serialize_errors
 
             @pg_encoder = PG::TextEncoder::Array.new name: "#{type}[]", delimiter: delimiter
             @pg_decoder = PG::TextDecoder::Array.new name: "#{type}[]", delimiter: delimiter
@@ -79,7 +80,13 @@ def force_equality?(value)
           private
             def type_cast_array(value, method)
               if value.is_a?(::Array)
-                value.map { |item| type_cast_array(item, method) }
+                result = []
+                value.each do |item|
+                  result << type_cast_array(item, method)
+                rescue
+                  raise unless ignore_serialize_errors
+                end
+                result
               else
                 @subtype.public_send(method, value)
               end
diff --git a/activerecord/lib/arel/visitors/postgresql.rb b/activerecord/lib/arel/visitors/postgresql.rb
index e2699ebe38..20cd198c84 100644
--- a/activerecord/lib/arel/visitors/postgresql.rb
+++ b/activerecord/lib/arel/visitors/postgresql.rb
@@ -89,7 +89,7 @@ def visit_Arel_Nodes_HomogeneousIn(o, collector)
             collector << " != all("
           end
 
-          type_caster = ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Array.new(o.attribute.type_caster, ",")
+          type_caster = ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Array.new(o.attribute.type_caster, ",", true)
           values = [type_caster.serialize(o.values)]
           proc_for_binds = -> value { ActiveModel::Attribute.with_cast_value(o.attribute.name, value, type_caster) }
           collector.add_binds(values, proc_for_binds, &bind_block)

@ghiculescu
Copy link
Member

I assume this was relying on Postgres to fall back to numeric?

This is the generated SQL in test_to_sql_with_large_number: SELECT "authors".* FROM "authors" WHERE "authors"."id" IN (3). So it looks like something in Rails is filtering out the impossible value. I think your code should do that too.

@ghiculescu
Copy link
Member

This might be related? #49050

@seanlinsley seanlinsley force-pushed the array-any branch 2 times, most recently from 4a349b5 to af2fced Compare September 27, 2023 03:02
@ghiculescu
Copy link
Member

Apart from pg_stat_statements handles them, is there any difference between id = any('{1,2}') and id IN (1, 2)? I'm wondering if we need a config for this or if it's safe to make it the only behaviour for postgres.

@seanlinsley
Copy link
Contributor Author

seanlinsley commented Sep 27, 2023

IN and any() are similar in their behavior, though the performance varies. The ~70% performance boost seen in the benchmark is likely caused by Postgres being faster at parsing parameter values than parsing SQL.

https://pganalyze.com/blog/5mins-postgres-performance-in-lists-vs-any-operator-bind-parameters

https://www.postgresql.org/docs/current/functions-comparisons.html

Copy link
Member

@ghiculescu ghiculescu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a pretty large Postgres codebase (and love pganalyze!). I just ran our CI against this branch, and everything passed 👍

@rafaelfranca rafaelfranca added the ready PRs ready to merge label Sep 29, 2023
@andyatkinson
Copy link
Member

I was searching PRs labeled PostgreSQL and noticed that tag wasn't on this PR, and isn't getting much use in general.

Can that tag be added to this PR? Can anyone do that?

Are there more open PRs that can be tagged with PostgreSQL? Thanks.

@andyatkinson
Copy link
Member

andyatkinson commented Oct 1, 2023

@seanlinsley At work we have a fairly large Rails 6.1 codebase (~100K LOC) and use PostgreSQL, PGSS, PgHero, and I advocate for more usage of it. Commonly we have these "giant IN clause" queries that are problematic because they're using hundreds or even thousands of values. If this PR improves the visibility on that and helps us fix those queries, that would be valuable to the platform health and performance. Typically these queries time out/get cancelled. I'm interested in this patch!

Would there be any plans to back-port this functionality to Rails 6.1 after it's merged and stable?

Thanks.

@seanlinsley
Copy link
Contributor Author

seanlinsley commented Oct 1, 2023

Hi @andyatkinson, yes this change will collapse the hundreds / thousands of separate PGSS entries into a single one:

select pg_stat_statements_reset();
select count(*) from users where id in (1);
select count(*) from users where id in (1,2);
select count(*) from users where id in (1,2,3);
select count(*) from users where id = any('{1}');
select count(*) from users where id = any('{1,2}');
select count(*) from users where id = any('{1,2,3}');
select calls, query from pg_stat_statements order by query;
 calls |                       query                       
-------+---------------------------------------------------
     3 | select count(*) from users where id = any($1)
     1 | select count(*) from users where id in ($1)
     1 | select count(*) from users where id in ($1,$2)
     1 | select count(*) from users where id in ($1,$2,$3)
     1 | select pg_stat_statements_reset()

We're currently testing this on production using a 6.1 backport: https://github.com/pganalyze/rails/commits/any-backport. Feel free to use it, though I'd recommend keeping it on your own fork so you won't be impacted if that branch goes away in the future.

@skipkayhil
Copy link
Member

Would there be any plans to back-port this functionality to Rails 6.1 after it's merged and stable?

At this point 6.1 only receives security fixes, and bug fixes are only backported to 7.0. This is a pretty large change so I wouldn't expect it to be available until 7.2

@simi
Copy link
Contributor

simi commented Oct 3, 2023

IMHO this is too "drastic" change just because of pg_stat_statements use-case and should be fixed at pg_stat_statements itself. It is in progress at https://commitfest.postgresql.org/45/2837/. I can try to do review in there and ask around how to move that forward.

@seanlinsley
Copy link
Contributor Author

seanlinsley commented Oct 3, 2023

@simi while it would be great for Postgres to fix this upstream, there have been years of discussions over that (and previous attempts), so it's not guaranteed that patch will be committed.

Even if it is committed, this PR may be worthwhile because it allows the use of prepared statements, and is a good deal faster as seen in the benchmark.

@msakrejda
Copy link

Apart from pg_stat_statements handles them, is there any difference between id = any('{1,2}') and id IN (1, 2)? I'm wondering if we need a config for this or if it's safe to make it the only behaviour for postgres.

@ghiculescu I work with @seanlinsley and I e-mailed the Postgres developer list about this and got a response. The gist of it is

If all the values in the IN form were being sent to the backend as constants of the same datatype, I think you're okay to consider it as exactly equivalent to =ANY.

As far as I can tell, non-constant expressions (e.g., User.where(id: User.select(:id))) still end up as IN lists (please correct me if I'm wrong), so this should be fine.

The Postgres e-mail also had the suggestion to add an explicit cast here and some recommended reading in the Postgres source code; I will review that.

And for what it's worth, I agree with Sean that the fate of the Postgres patch is uncertain and that the performance improvements are worthwhile. Plus, in the best case, the patch won't be out until Postgres 17 (which will take years to upgrade to).

@msakrejda
Copy link

I've read through the function mentioned on the Postgres list that transforms an IN expression to an array expression. My C is limited, but it looks like the cases that can't be translated directly have to do with edge cases like mixed types in the IN list. I don't think that's an issue here?

@simi
Copy link
Contributor

simi commented Oct 4, 2023

@simi while it would be great for Postgres to fix this upstream, there have been years of discussions over that (and previous attempts), so it's not guaranteed that patch will be committed.

Even if it is committed, this PR may be worthwhile because it allows the use of prepared statements, and is a good deal faster as seen in the benchmark.

@seanlinsley Performance boost seems like a good argument in here. 🚀 Would it be possible to remove pg_stat_statements comments from code then?

Apart from pg_stat_statements handles them, is there any difference between id = any('{1,2}') and id IN (1, 2)? I'm wondering if we need a config for this or if it's safe to make it the only behaviour for postgres.

@ghiculescu I was also thinking about need of making this configurable and switch this as default behaviour later, but I'm not sure if SQL queries are guaranteed to be stable across ActiveRecord minor/patch versions. Also, currently, this config will make it a little easier to test since branching would not be needed at various places.

@seanlinsley
Copy link
Contributor Author

I've read through the function mentioned on the Postgres list that transforms an IN expression to an array expression. My C is limited, but it looks like the cases that can't be translated directly have to do with edge cases like mixed types in the IN list. I don't think that's an issue here?

@msakrejda this PR patches HomogeneousIn, which like the name suggests only contains values of the same type. For example, where(id: [1, 2, nil]) generates WHERE (id = any('{1,2}') OR id IS NULL). So we should be good there.

@seanlinsley Performance boost seems like a good argument in here. 🚀 Would it be possible to remove pg_stat_statements comments from code then?

@simi well, the soonest Postgres might normalize IN list queries properly is in ~11 months for v17, so it's likely that users upgrading to Rails 7.2 will see this PR reduce their pg_stat_statements churn. But I will update the code comment and changelog to point out the performance improvement.

@ghiculescu I was also thinking about need of making this configurable and switch this as default behaviour later, but I'm not sure if SQL queries are guaranteed to be stable across ActiveRecord minor/patch versions. Also, currently, this config will make it a little easier to test since branching would not be needed at various places.

@simi unless we can find a reason why users would need the old behavior, I do not think this should be configurable since any looks to be better than IN in several ways, with no downsides.

@simi
Copy link
Contributor

simi commented Oct 7, 2023

@simi well, the soonest Postgres might normalize IN list queries properly is in ~11 months for v17, so it's likely that users upgrading to Rails 7.2 will see this PR reduce their pg_stat_statements churn. But I will update the code comment and changelog to point out the performance improvement.

Yup, that's clear. I mean to make it clear this is mainly performance update and also as a side-effect it helps with pg_stat_statements grouping of queries.

@yahonda
Copy link
Member

yahonda commented Oct 9, 2023

While I understand it is important to avoid polluting pg_stat_statements, I am concerned about the potential unforeseen side effects of rewriting all 'in' clauses to 'any' for this purpose.

This topic has also been recognized by PostgreSQL developers. If it gets addressed in a future PostgreSQL release, users dealing with pg_stat_statements pollution will have the option to upgrade to a newer version.

@benoittgt
Copy link
Contributor

benoittgt commented Oct 9, 2023

Hello

I am concerned about the potential unforeseen side effects of rewriting all 'in' clauses to 'any' for this purpose.

@yahonda do you see any? There is some point mentioned before in https://pganalyze.com/blog/5mins-postgres-performance-in-lists-vs-any-operator-bind-parameters but it goes into the ANY direction, similar to this PR.

At the moment we are using where_any gem because we want clean pg_stat_statements but also we use prepared statements and want to enable it for ANY operator too.

We saw no performance increase after upgrading. But the app only does 20k/rpm.

About waiting for PostgreSQL update seems to be a more complicated way. Upgrading a database is more complicated operation with potential downtime while re-writing query is smoother.

@andyatkinson
Copy link
Member

Crunchy Data also has an example comparing IN with ANY:

From the definitions, expression IN (...) is equivalent to expression = ANY (...) but for the parameter type!
https://www.crunchydata.com/blog/postgres-query-boost-using-any-instead-of-in

Relevant PostgreSQL docs:
https://www.postgresql.org/docs/current/functions-comparisons.html#FUNCTIONS-COMPARISONS-ANY-SOME

Given that information, maybe this PR should be evaluated more heavily (or solely) on the performance benefit of switching from IN to ANY. The benefit to pg_stat_statements could be secondary.

The author included benchmarks in the original description:

This turns out to be a good deal faster (using Postgres 15.4). Here are the min and max iterations per > second from several benchmark runs:

490 - 600 on the main branch
750 - 950 on this branch with BENCHMARK_PREPARED=1
780 - 1020 on this branch with BENCHMARK_PREPARED=0

If these are repeatable, and there are no regressions, and IN to ANY works equivalently for operators but offers better performance, this is a win right?

Since PostgreSQL 16 is now available (including Docker https://hub.docker.com/_/postgres), the benchmark could be tested on 16 too.

I also saw the suggestion to typecast the array of integer values. It would be interesting to add that to the benchmark to see if it removes any latency (possibly from client query parsing?).

@andyatkinson
Copy link
Member

@seanlinsley
Would you need to update https://github.com/rails/rails/blob/main/activerecord/lib/arel/visitors/to_sql.rb#L338 to handle the ANY keyword?

@seanlinsley
Copy link
Contributor Author

While I understand it is important to avoid polluting pg_stat_statements, I am concerned about the potential unforeseen side effects of rewriting all 'in' clauses to 'any' for this purpose.

@yahonda I would encourage you to help us find issues with this PR so we can be confident there are no unforeseen side effects. I don't think it's acceptable to wait a year for a potential fix in Postgres 17 when Rails can so easily resolve the issue. Summarizing the discussion so far:

  • beyond fixing PGSS churn, this PR improves performance and allows more queries to use prepared statements
  • my team is running this as a 6.1 backport on production, while @benoittgt and others are using the where_any gem
  • I am currently writing a suite of tests to confirm the behavior for every Postgres data type that Rails supports, so we know which data types should fall back to IN because of serialization issues
  • I have confirmed that subqueries work as expected, though there are no tests for that currently

I also saw the suggestion to typecast the array of integer values. It would be interesting to add that to the benchmark to see if it removes any latency (possibly from client query parsing?).

@andyatkinson I haven't done this because I'm not confident that Rails provides a Postgres-compatible version of the data type's name at this point in the code. After adding the test suite for all data types, I'll give that a try.

Would you need to update https://github.com/rails/rails/blob/main/activerecord/lib/arel/visitors/to_sql.rb#L338 to handle the ANY keyword?

@andyatkinson as I understand it, to_sql.rb is the generic implementation that can be specialized for some databases. That's why the postgresql.rb file modified in this PR inherits from it: class PostgreSQL < Arel::Visitors::ToSql.

@matthewd
Copy link
Member

Sorry, I've been distracted with conference+travel. I haven't looked at the diff yet, but just to cut through some of the meta-conversation:

  • I want this change for the prepared-statements benefits, regardless of more niche factors (pg_stat_statements, but also DB-side parsing times, unless we're making them egregiously slower)
  • It will need to work with all data types (opt-in of individual types is unacceptable)
  • I would encourage you to find more productive ways to engage with the first feedback from a committer than "I would encourage you to help us find issues with this PR"

@seanlinsley seanlinsley changed the title Reduce pg_stat_statements churn using = any() instead of IN () Postgres: use ANY instead of IN for array inclusion queries Oct 11, 2023
@seanlinsley seanlinsley force-pushed the array-any branch 2 times, most recently from 2059a69 to eb95793 Compare October 11, 2023 22:24
@seanlinsley
Copy link
Contributor Author

seanlinsley commented Oct 11, 2023

The PR description and changelog entry have been updated to highlight the benefit to prepared statements.

I still intend on adding tests for other data types, but now the code no longer falls back to IN for data types with tests. Notes on the changes:

  • properly serializing encrypted values and JSON required duplicating Array#serialize and HomogeneousIn#casted_values to avoid double-serializing
  • #49050 has been copied into this PR so the added test passes for SQLite, so it should be merged first
  • type_cast now accepts in_array: true to determine if a binary value should be serialized as a string, or passed to the pg gem in the special format: 1 structure

@CedricCouton
Copy link

Hi, where are we on that topic?

@seanlinsley
Copy link
Contributor Author

This is still waiting on review from Rails maintainers.

It could use additional tests for the many Postgres data types supported, but I'm not sure if there is a list of what types are and aren't supported by Rails.

@andyatkinson
Copy link
Member

In case it's useful, I wanted to add one report of running the provided benchmark.

I copied the contents of the benchmark above into any_benchmark.rb in the bug_report_templates directory, checked out the array-any branch, running PostgreSQL 16.1 installed locally on a macOS laptop:

andy@Andrews-Laptop ~/P/r/g/bug_report_templates (array-any)> BENCHMARK_PREPARED=1 ruby any_benchmark.rb
Generating data...
Inserting 20000 users...
..............................................................................................................................................................................................
..........Done!
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-darwin23]
Warming up --------------------------------------
Model.where(id: [...])
                       130.000 i/100ms
Calculating -------------------------------------
Model.where(id: [...])
                        623.762 (±21.2%) i/s -     12.090k in  20.125447s

While running, I confirmed my postgres database had the users table that the benchmark script adds, before it's removed at the end. I review activity in pg_stat_activity and saw the Ruby client while it was running.

400-600 i/s is what I'm seeing, and I'm not seeing much of a reliable difference between main and this branch.

I have pg_stat_statements enabled locally, and I reset the data SELECT pg_stat_statements_reset();

I see examples of IN lists like, but none of =ANY, not sure whether that was expected.

16384 |    5 | t        | -4050608859221672199 | SELECT COUNT(*) FROM "users" WHERE "users"."id" IN ($1, $2, $3, $4, $5, $6, $7, $8, $9,.....

def visit_Arel_Nodes_HomogeneousIn(o, collector)
visit o.left, collector
collector << (o.type == :in ? " = ANY (" : " != ALL (")
type_caster = ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Array.new(o.attribute.type_caster, ",")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coupling the arel visitor with the active record database adapter. We can't do this. Maybe we should store the type in the attribute like we do with the type caster?

Comment on lines +3 to +5
require "active_model/attribute"
require "active_record"
require "active_record/connection_adapters/postgresql_adapter"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arel should not depend on active record/model

@@ -130,13 +130,18 @@ def quote_default_expression(value, column) # :nodoc:
end
end

def type_cast(value) # :nodoc:
def type_cast(value, in_array: false) # :nodoc:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What in_array means here? and why we need to pass it to this method, should not this information be in the value?

@rafaelfranca rafaelfranca removed the ready PRs ready to merge label Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.