Simplify and optimize Postgres query for primary_keys() #27961

jordanlewis · 2017-02-10T07:47:18Z

Take two on #27949, which was flawed because unnest doesn't always return in a consistent order. Now we use generate_subscripts which should always return rows in a consistent order. It is available since Postgres 8.4.

In summary, this pull request simplifies the query used to determine the primary keys of a table in Postgres to achieve a ~%66 speedup per table primary key query during application startup.

cc @kamipo (thanks for your patience!)

Here's the EXPLAIN data with the new and old queries on a sample table:

CREATE TABLE comp (a int, b int, c text, d numeric, primary key(a, b));

Before:

jordan=# EXPLAIN ANALYZE WITH pk_constraint AS (
  SELECT conrelid, unnest(conkey) AS connum FROM pg_constraint
  WHERE contype = 'p'
    AND conrelid = '"comp"'::regclass
), cons AS (
  SELECT conrelid, connum, row_number() OVER() AS rownum FROM pk_constraint
)
SELECT attr.attname FROM pg_attribute attr
INNER JOIN cons ON attr.attrelid = cons.conrelid AND attr.attnum = cons.connum
ORDER BY cons.rownum;
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=101.59..101.61 rows=8 width=72) (actual time=0.900..0.900 rows=2 loops=1)
   Sort Key: cons.rownum
   Sort Method: quicksort  Memory: 25kB
   CTE pk_constraint
     ->  Seq Scan on pg_constraint  (cost=0.00..1.53 rows=100 width=6) (actual time=0.016..0.018 rows=2 loops=1)
           Filter: ((contype = 'p'::"char") AND (conrelid = '25096'::oid))
           Rows Removed by Filter: 11
   CTE cons
     ->  WindowAgg  (cost=0.00..3.25 rows=100 width=14) (actual time=0.020..0.026 rows=2 loops=1)
           ->  CTE Scan on pk_constraint  (cost=0.00..2.00 rows=100 width=6) (actual time=0.017..0.021 rows=2 loops=1)
   ->  Hash Join  (cost=3.50..96.70 rows=8 width=72) (actual time=0.869..0.875 rows=2 loops=1)
         Hash Cond: ((attr.attrelid = cons.conrelid) AND (attr.attnum = cons.connum))
         ->  Seq Scan on pg_attribute attr  (cost=0.00..73.78 rows=2578 width=70) (actual time=0.007..0.277 rows=2615 loops=1)
         ->  Hash  (cost=2.00..2.00 rows=100 width=14) (actual time=0.047..0.047 rows=2 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  CTE Scan on cons  (cost=0.00..2.00 rows=100 width=14) (actual time=0.026..0.032 rows=2 loops=1)
 Planning time: 0.249 ms
 Execution time: 1.009 ms
(18 rows)

After:

jordan=# EXPLAIN ANALYZE SELECT a.attname
                  FROM pg_index i
            CROSS JOIN generate_subscripts(i.indkey, 1) k
                  JOIN pg_attribute a
                    ON a.attrelid=i.indrelid
                   AND a.attnum=i.indkey[k]
                 WHERE i.indrelid='comp'::regclass
                   AND i.indisprimary;
                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=14.26..46.47 rows=20 width=64) (actual time=0.061..0.064 rows=2 loops=1)
   Hash Cond: (i.indkey[k.k] = a.attnum)
   ->  Nested Loop  (cost=0.00..24.52 rows=1000 width=35) (actual time=0.035..0.037 rows=2 loops=1)
         ->  Seq Scan on pg_index i  (cost=0.00..4.51 rows=1 width=31) (actual time=0.026..0.028 rows=1 loops=1)
               Filter: (indisprimary AND (indrelid = '16903'::oid))
               Rows Removed by Filter: 123
         ->  Function Scan on generate_subscripts k  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.007..0.007 rows=2 loops=1)
   ->  Hash  (cost=14.20..14.20 rows=4 width=70) (actual time=0.017..0.017 rows=8 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a  (cost=0.28..14.20 rows=4 width=70) (actual time=0.011..0.012 rows=8 loops=1)
               Index Cond: (attrelid = '16903'::oid)
 Planning time: 0.183 ms
 Execution time: 0.095 ms
(13 rows)

rails-bot · 2017-02-10T07:47:21Z

Thanks for the pull request, and welcome! The Rails team is excited to review your changes, and you should hear from @schneems (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

This repository is being automatically checked for code quality issues using Code Climate. You can see results for this analysis in the PR status below. Newly introduced issues should be fixed before a Pull Request is considered ready to review.

Please see the contribution instructions for more information.

kamipo · 2017-02-10T08:21:11Z

activerecord/lib/active_record/connection_adapters/postgresql/schema_statements.rb

+                  JOIN pg_attribute a
+                    ON a.attrelid=i.indrelid
+                   AND a.attnum=i.indkey[k]
+                 WHERE i.indrelid=#{quote(quote_table_name(table_name))}::regclass


I confirmed that generate_subscripts behaves as expected.

psql -E activerecord_unittest psql (9.5.4) Type "help" for help. activerecord_unittest=# create table comp3 (a int, b int, c int, primary key (a,c, b)); CREATE TABLE activerecord_unittest=# select a.attname from pg_index i cross join unnest(i.indkey) as k join pg_attribute a on a.attrelid = i.indrelid and a.attnum = k where indrelid='comp3'::regclass and i.indisprimary; attname --------- a b c (3 rows) activerecord_unittest=# select a.attname from pg_index i cross join generate_subscripts(i.indkey, 1) k join pg_attribute a on a.attrelid = i.indrelid and a.attnum = i.indkey[k] where indrelid='comp3'::regclass and i.indisprimary; attname --------- a c b (3 rows) activerecord_unittest=#

[nit] I prefer space around equals even in SQL like Style/SpaceAroundOperators.

rails/.rubocop.yml

Lines 66 to 67 in 55d66e2

Style/SpaceAroundOperators:

Enabled: true

We have few lines missing space around equals.

% git grep -n "\w=[\w#']" lib/active_record/connection_adapters lib/active_record/connection_adapters/abstract_mysql_adapter.rb:556: raw_table_options.sub!(/(ENGINE=\w+)(?: AUTO_INCREMENT=\d+)/, '\1') lib/active_record/connection_adapters/abstract_mysql_adapter.rb:561: if raw_table_options.sub!(/ COMMENT='.+'/, "") lib/active_record/connection_adapters/abstract_mysql_adapter.rb:845: # to work with MySQL 5.7.6 which sets optimizer_switch='derived_merge=on' lib/active_record/connection_adapters/sqlite3_adapter.rb:348: WHERE name=#{quote(row['name'])} AND type='index' lib/active_record/connection_adapters/sqlite3_adapter.rb:352: WHERE name=#{quote(row['name'])} AND type='index' % git grep -n "}=[\w#']" lib/active_record/connection_adapters lib/active_record/connection_adapters/abstract_mysql_adapter.rb:486: execute("UPDATE #{quote_table_name(table_name)} SET #{quote_column_name(column_name)}=#{quote(default)} WHERE #{quote_column_name(column_name)} IS NULL") lib/active_record/connection_adapters/postgresql/schema_statements.rb:528: execute("UPDATE #{quote_table_name(table_name)} SET #{quote_column_name(column_name)}=#{quote_default_expression(default, column)} WHERE #{quote_column_name(column_name)} IS NULL") if column lib/active_record/connection_adapters/sqlite3_adapter.rb:418: exec_query("UPDATE #{quote_table_name(table_name)} SET #{quote_column_name(column_name)}=#{quote(default)} WHERE #{quote_column_name(column_name)} IS NULL")

Added the spaces.

sgrif · 2017-02-10T12:37:23Z

While we're touching these, can we just refactor it to use information_schema, so we can share the code between every adapter except SQLite and Oracle? The query is roughly:

SELECT column_name FROM information_schema.key_column_usage
    WHERE constraint_name IN (
        SELECT table_constraints.constraint_name FROM information_schema.table_constraints WHERE constraint_type = 'PRIMARY KEY'
    ) AND table_name = $1 AND table_schema = $2

(If anyone wanted to refactor all of our schema lookup code to use information_schema, you can see examples of what all the queries would need to be here. The only things that need to differ between databases is the name of the column type column, and what the value of table_schema needs to be scoped to when no schema was explicitly given)

jordanlewis · 2017-02-10T18:23:48Z

@sgrif seems like a pretty good idea. The query can't be optimized quite as well this way, but I suppose the gain in DRY in the Rails codebase could be of greater importance. What do you think?

I'm happy to update my patch as you suggest, but I think I'll stick to just modifying primary_keys for now to keep the scope of this change reasonable.

The query diff (before refactoring to the rest of the adapters) would be:

diff --git a/activerecord/lib/active_record/connection_adapters/postgresql/schema_statements.rb b/activerecord/lib/active_record/connection_adapters/postgresql/schema_statements.rb
index 9072745200..c8785a06dd 100644
--- a/activerecord/lib/active_record/connection_adapters/postgresql/schema_statements.rb
+++ b/activerecord/lib/active_record/connection_adapters/postgresql/schema_statements.rb
@@ -426,14 +426,13 @@ def pk_and_sequence_for(table) #:nodoc:

         def primary_keys(table_name) # :nodoc:
           select_values(<<-SQL.strip_heredoc, "SCHEMA")
-                SELECT a.attname
-                  FROM pg_index i
-            CROSS JOIN generate_subscripts(i.indkey, 1) k
-                  JOIN pg_attribute a
-                    ON a.attrelid = i.indrelid
-                   AND a.attnum = i.indkey[k]
-                 WHERE i.indrelid = #{quote(quote_table_name(table_name))}::regclass
-                   AND i.indisprimary
+            SELECT column_name
+              FROM information_schema.key_column_usage kcu
+              JOIN information_schema.table_constraints tc
+                ON kcu.table_name = tc.table_name
+               AND kcu.constraint_name = tc.constraint_name
+             WHERE constraint_type='PRIMARY KEY'
+               AND kcu.table_name=#{quote(quote_table_name(table_name))}
           SQL
         end

And there's the EXPLAIN ANALYZE output:

jordan=# EXPLAIN ANALYZE SELECT column_name FROM information_schema.key_column_usage kcu JOIN information_schema.table_constraints tc ON kcu.table_name = tc.table_name AND kcu.constraint_name = tc.constraint_name WHERE constraint_type='PRIMARY KEY' AND kcu.table_name='comp3';
                                                                                                                                                                                QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=57.65..97.05 rows=1 width=32) (actual time=0.269..0.299 rows=3 loops=1)
   ->  Hash Join  (cost=57.37..89.09 rows=15 width=40) (actual time=0.257..0.276 rows=3 loops=1)
         Hash Cond: (((c.conname)::information_schema.sql_identifier)::text = ("*SELECT* 1".constraint_name)::text)
         ->  Nested Loop  (cost=0.27..16.84 rows=1000 width=341) (actual time=0.198..0.214 rows=3 loops=1)
               Join Filter: (c.connamespace = nc.oid)
               Rows Removed by Join Filter: 5
               ->  Nested Loop  (cost=0.27..10.46 rows=1 width=172) (actual time=0.057..0.061 rows=1 loops=1)
                     Join Filter: (r.relnamespace = nr.oid)
                     Rows Removed by Join Filter: 3
                     ->  Nested Loop  (cost=0.27..9.34 rows=1 width=176) (actual time=0.044..0.046 rows=1 loops=1)
                           ->  Seq Scan on pg_constraint c  (cost=0.00..1.03 rows=1 width=104) (actual time=0.014..0.017 rows=3 loops=1)
                                 Filter: (contype = ANY ('{p,u,f}'::"char"[]))
                                 Rows Removed by Filter: 2
                           ->  Index Scan using pg_class_oid_index on pg_class r  (cost=0.27..8.30 rows=1 width=76) (actual time=0.008..0.008 rows=0 loops=3)
                                 Index Cond: (oid = c.conrelid)
                                 Filter: ((relkind = 'r'::"char") AND (((relname)::information_schema.sql_identifier)::text = 'comp3'::text))
                                 Rows Removed by Filter: 1
                     ->  Seq Scan on pg_namespace nr  (cost=0.00..1.07 rows=4 width=4) (actual time=0.007..0.011 rows=4 loops=1)
                           Filter: (NOT pg_is_other_temp_schema(oid))
                           Rows Removed by Filter: 2
               ->  Seq Scan on pg_namespace nc  (cost=0.00..1.06 rows=6 width=4) (actual time=0.002..0.005 rows=6 loops=1)
         ->  Hash  (cost=57.06..57.06 rows=3 width=64) (actual time=0.051..0.051 rows=1 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  Append  (cost=0.27..57.06 rows=3 width=64) (actual time=0.043..0.049 rows=1 loops=1)
                     ->  Subquery Scan on "*SELECT* 1"  (cost=0.27..11.66 rows=1 width=64) (actual time=0.043..0.048 rows=1 loops=1)
                           ->  Nested Loop  (cost=0.27..11.65 rows=1 width=288) (actual time=0.042..0.046 rows=1 loops=1)
                                 Join Filter: (r_1.relnamespace = nr_1.oid)
                                 Rows Removed by Join Filter: 3
                                 ->  Nested Loop  (cost=0.27..10.52 rows=1 width=132) (actual time=0.034..0.037 rows=1 loops=1)
                                       Join Filter: (c_1.connamespace = nc_1.oid)
                                       Rows Removed by Join Filter: 5
                                       ->  Nested Loop  (cost=0.27..9.39 rows=1 width=136) (actual time=0.030..0.031 rows=1 loops=1)
                                             ->  Seq Scan on pg_constraint c_1  (cost=0.00..1.05 rows=1 width=72) (actual time=0.011..0.013 rows=3 loops=1)
                                                   Filter: ((contype <> ALL ('{t,x}'::"char"[])) AND (((CASE contype WHEN 'c'::"char" THEN 'CHECK'::text WHEN 'f'::"char" THEN 'FOREIGN KEY'::text WHEN 'p'::"char" THEN 'PRIMARY KEY'::text WHEN 'u'::"char" THEN 'UNIQUE'::text ELSE NULL::text END)::information_schema.character_data)::text = 'PRIMARY KEY'::text))
                                                   Rows Removed by Filter: 2
                                             ->  Index Scan using pg_class_oid_index on pg_class r_1  (cost=0.27..8.33 rows=1 width=72) (actual time=0.005..0.005 rows=0 loops=3)
                                                   Index Cond: (oid = c_1.conrelid)
                                                   Filter: ((relkind = 'r'::"char") AND (((relname)::information_schema.sql_identifier)::text = 'comp3'::text) AND (pg_has_role(relowner, 'USAGE'::text) OR has_table_privilege(oid, 'INSERT, UPDATE, DELETE, TRUNCATE, REFERENCES, TRIGGER'::text) OR has_any_column_privilege(oid, 'INSERT, UPDATE, REFERENCES'::text)))
                                                   Rows Removed by Filter: 1
                                       ->  Seq Scan on pg_namespace nc_1  (cost=0.00..1.06 rows=6 width=4) (actual time=0.003..0.004 rows=6 loops=1)
                                 ->  Seq Scan on pg_namespace nr_1  (cost=0.00..1.07 rows=4 width=4) (actual time=0.003..0.005 rows=4 loops=1)
                                       Filter: (NOT pg_is_other_temp_schema(oid))
                                       Rows Removed by Filter: 2
                     ->  Subquery Scan on "*SELECT* 2"  (cost=0.28..45.40 rows=2 width=64) (actual time=0.001..0.001 rows=0 loops=1)
                           ->  Result  (cost=0.28..45.38 rows=2 width=288) (actual time=0.001..0.001 rows=0 loops=1)
                                 One-Time Filter: ((('CHECK'::character varying)::information_schema.character_data)::text = 'PRIMARY KEY'::text)
                                 ->  Nested Loop  (cost=0.28..45.38 rows=2 width=288) (never executed)
                                       ->  Nested Loop  (cost=0.00..27.13 rows=1 width=72) (never executed)
                                             Join Filter: (nr_2.oid = r_2.relnamespace)
                                             ->  Seq Scan on pg_class r_2  (cost=0.00..26.01 rows=1 width=72) (never executed)
                                                   Filter: ((relkind = 'r'::"char") AND (((relname)::information_schema.sql_identifier)::text = 'comp3'::text) AND (pg_has_role(relowner, 'USAGE'::text) OR has_table_privilege(oid, 'INSERT, UPDATE, DELETE, TRUNCATE, REFERENCES, TRIGGER'::text) OR has_any_column_privilege(oid, 'INSERT, UPDATE, REFERENCES'::text)))
                                             ->  Seq Scan on pg_namespace nr_2  (cost=0.00..1.07 rows=4 width=4) (never executed)
                                                   Filter: (NOT pg_is_other_temp_schema(oid))
                                       ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a_1  (cost=0.28..18.16 rows=2 width=6) (never executed)
                                             Index Cond: ((attrelid = r_2.oid) AND (attnum > 0))
                                             Filter: (attnotnull AND (NOT attisdropped))
   ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a  (cost=0.28..0.52 rows=1 width=70) (actual time=0.005..0.005 rows=1 loops=3)
         Index Cond: ((attrelid = r.oid) AND (attnum = ((information_schema._pg_expandarray(c.conkey))).x))
         Filter: ((NOT attisdropped) AND (pg_has_role(r.relowner, 'USAGE'::text) OR has_column_privilege(r.oid, attnum, 'SELECT, INSERT, UPDATE, REFERENCES'::text)))
 Planning time: 1.487 ms
 Execution time: 0.574 ms
(61 rows)

sgrif · 2017-02-10T18:31:02Z

Seems fine to me.

jeremy · 2017-02-10T18:35:33Z

I'm a fan of leaning on information_schema by default. If the motivation is to speed these up, though, IS perf will often disappoint.

sgrif · 2017-02-10T18:46:54Z

Do we have an actual use case where this is a hot spot? Ultimately rake db:schema:cache:dump makes the performance of schema lookups pretty much moot, doesn't it?

jordanlewis · 2017-02-13T20:10:44Z

@sgrif Good point. My motivation here had more to do with simplifying the query than optimizing it, to improve compatibility for Postgres backends that might not support common table expressions, a feature which is only used in this query - the latter was just a side-effect.

I've updated the patch to modify this query to use information_schema. I removed a few tests that check that primary_keys should throw an invalid query exception if the table doesn't exist or is not available from the current search path, since that's not very easy to accomplish without using the old method (::regclass casts). I don't think that's an important semantic of the API, though, since none of the other adapters have that behavior, which is why I removed the tests instead of trying to fix them.

I spent a little time seeing what it would take to move the primary_keys method up to the abstract adapter. I think it would take a decent amount of work, as the Postgres adapter has its own logic for extract_schema_qualified_name, and I'm not familiar enough with the codebase to take that on at this time. So, I've submitted this patch as-is, and hopefully someone with more expertise can take on the work to unify the implementations.

sgrif · 2017-02-13T20:15:57Z

Yeah, the definition of "schema" varies by backend (and of the in-tree adapters is only relevant to postgres), so you'll probably want to have the abstract version ask the adapter to split the string into schema and table name (with a default impl returning [nil, table_name]). This looks good to me once CI is green. Thanks!

primary_keys(table) needs to query various metadata tables in Postgres to determine the primary key for the table. Previously, it did so using a complex common table expression against pg_constraint and pg_attribute. This patch simplifies the query by using information_schema tables. This simplifies the logic, making the query far easier to understand, and additionally avoids an expensive unnest, window function query, and common table expression.

jordanlewis · 2017-02-13T23:04:10Z

Cool - build's green now. Thanks for the tips, @sgrif!

Fixes CI failure caused by rails#27961. https://travis-ci.org/rails/rails/jobs/201472146#L2390-L2409

rails-bot assigned schneems Feb 10, 2017

kamipo reviewed Feb 10, 2017

View reviewed changes

jordanlewis force-pushed the simplify-postgres-primary-keys-v2 branch from 7348f91 to ac827b7 Compare February 10, 2017 15:25

maclover7 added activerecord PostgreSQL labels Feb 10, 2017

jordanlewis force-pushed the simplify-postgres-primary-keys-v2 branch 2 times, most recently from 6e62e12 to ef050b4 Compare February 13, 2017 20:10

jordanlewis force-pushed the simplify-postgres-primary-keys-v2 branch from ef050b4 to 16a8d1a Compare February 13, 2017 20:19

jordanlewis force-pushed the simplify-postgres-primary-keys-v2 branch from 16a8d1a to b8e3af7 Compare February 13, 2017 20:20

jeremy requested a review from sgrif February 14, 2017 00:46

jeremy assigned sgrif Feb 14, 2017

sgrif merged commit b9661c7 into rails:master Feb 14, 2017

kamipo added a commit to kamipo/rails that referenced this pull request Feb 14, 2017

Should keep the primary key column order

2693c86

Fixes CI failure caused by rails#27961. https://travis-ci.org/rails/rails/jobs/201472146#L2390-L2409

kamipo mentioned this pull request Feb 14, 2017

Should keep the primary key column order #27997

Merged

daniel-rikowski mentioned this pull request Mar 17, 2017

Determining primary keys across multiple schemas #28470

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplify and optimize Postgres query for primary_keys() #27961

Simplify and optimize Postgres query for primary_keys() #27961

Uh oh!

jordanlewis commented Feb 10, 2017

Uh oh!

rails-bot commented Feb 10, 2017

Uh oh!

kamipo Feb 10, 2017

Uh oh!

kamipo Feb 10, 2017

Uh oh!

jordanlewis Feb 10, 2017

Uh oh!

sgrif commented Feb 10, 2017

Uh oh!

jordanlewis commented Feb 10, 2017 •

edited

Loading

Uh oh!

sgrif commented Feb 10, 2017

Uh oh!

jeremy commented Feb 10, 2017

Uh oh!

sgrif commented Feb 10, 2017

Uh oh!

jordanlewis commented Feb 13, 2017

Uh oh!

sgrif commented Feb 13, 2017

Uh oh!

jordanlewis commented Feb 13, 2017

Uh oh!

Uh oh!

Simplify and optimize Postgres query for primary_keys() #27961

Simplify and optimize Postgres query for primary_keys() #27961

Uh oh!

Conversation

jordanlewis commented Feb 10, 2017

Uh oh!

rails-bot commented Feb 10, 2017

Uh oh!

kamipo Feb 10, 2017

Choose a reason for hiding this comment

Uh oh!

kamipo Feb 10, 2017

Choose a reason for hiding this comment

Uh oh!

jordanlewis Feb 10, 2017

Choose a reason for hiding this comment

Uh oh!

sgrif commented Feb 10, 2017

Uh oh!

jordanlewis commented Feb 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgrif commented Feb 10, 2017

Uh oh!

jeremy commented Feb 10, 2017

Uh oh!

sgrif commented Feb 10, 2017

Uh oh!

jordanlewis commented Feb 13, 2017

Uh oh!

sgrif commented Feb 13, 2017

Uh oh!

jordanlewis commented Feb 13, 2017

Uh oh!

Uh oh!

jordanlewis commented Feb 10, 2017 •

edited

Loading