New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify and optimize Postgres query for primary_keys() #27961

Merged
merged 1 commit into from Feb 14, 2017

Conversation

Projects
None yet
7 participants
@jordanlewis
Contributor

jordanlewis commented Feb 10, 2017

Take two on #27949, which was flawed because unnest doesn't always return in a consistent order. Now we use generate_subscripts which should always return rows in a consistent order. It is available since Postgres 8.4.

In summary, this pull request simplifies the query used to determine the primary keys of a table in Postgres to achieve a ~%66 speedup per table primary key query during application startup.

cc @kamipo (thanks for your patience!)

Here's the EXPLAIN data with the new and old queries on a sample table:

CREATE TABLE comp (a int, b int, c text, d numeric, primary key(a, b));

Before:

jordan=# EXPLAIN ANALYZE WITH pk_constraint AS (
  SELECT conrelid, unnest(conkey) AS connum FROM pg_constraint
  WHERE contype = 'p'
    AND conrelid = '"comp"'::regclass
), cons AS (
  SELECT conrelid, connum, row_number() OVER() AS rownum FROM pk_constraint
)
SELECT attr.attname FROM pg_attribute attr
INNER JOIN cons ON attr.attrelid = cons.conrelid AND attr.attnum = cons.connum
ORDER BY cons.rownum;
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=101.59..101.61 rows=8 width=72) (actual time=0.900..0.900 rows=2 loops=1)
   Sort Key: cons.rownum
   Sort Method: quicksort  Memory: 25kB
   CTE pk_constraint
     ->  Seq Scan on pg_constraint  (cost=0.00..1.53 rows=100 width=6) (actual time=0.016..0.018 rows=2 loops=1)
           Filter: ((contype = 'p'::"char") AND (conrelid = '25096'::oid))
           Rows Removed by Filter: 11
   CTE cons
     ->  WindowAgg  (cost=0.00..3.25 rows=100 width=14) (actual time=0.020..0.026 rows=2 loops=1)
           ->  CTE Scan on pk_constraint  (cost=0.00..2.00 rows=100 width=6) (actual time=0.017..0.021 rows=2 loops=1)
   ->  Hash Join  (cost=3.50..96.70 rows=8 width=72) (actual time=0.869..0.875 rows=2 loops=1)
         Hash Cond: ((attr.attrelid = cons.conrelid) AND (attr.attnum = cons.connum))
         ->  Seq Scan on pg_attribute attr  (cost=0.00..73.78 rows=2578 width=70) (actual time=0.007..0.277 rows=2615 loops=1)
         ->  Hash  (cost=2.00..2.00 rows=100 width=14) (actual time=0.047..0.047 rows=2 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  CTE Scan on cons  (cost=0.00..2.00 rows=100 width=14) (actual time=0.026..0.032 rows=2 loops=1)
 Planning time: 0.249 ms
 Execution time: 1.009 ms
(18 rows)

After:

jordan=# EXPLAIN ANALYZE SELECT a.attname
                  FROM pg_index i
            CROSS JOIN generate_subscripts(i.indkey, 1) k
                  JOIN pg_attribute a
                    ON a.attrelid=i.indrelid
                   AND a.attnum=i.indkey[k]
                 WHERE i.indrelid='comp'::regclass
                   AND i.indisprimary;
                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=14.26..46.47 rows=20 width=64) (actual time=0.061..0.064 rows=2 loops=1)
   Hash Cond: (i.indkey[k.k] = a.attnum)
   ->  Nested Loop  (cost=0.00..24.52 rows=1000 width=35) (actual time=0.035..0.037 rows=2 loops=1)
         ->  Seq Scan on pg_index i  (cost=0.00..4.51 rows=1 width=31) (actual time=0.026..0.028 rows=1 loops=1)
               Filter: (indisprimary AND (indrelid = '16903'::oid))
               Rows Removed by Filter: 123
         ->  Function Scan on generate_subscripts k  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.007..0.007 rows=2 loops=1)
   ->  Hash  (cost=14.20..14.20 rows=4 width=70) (actual time=0.017..0.017 rows=8 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a  (cost=0.28..14.20 rows=4 width=70) (actual time=0.011..0.012 rows=8 loops=1)
               Index Cond: (attrelid = '16903'::oid)
 Planning time: 0.183 ms
 Execution time: 0.095 ms
(13 rows)
@rails-bot

This comment has been minimized.

rails-bot commented Feb 10, 2017

Thanks for the pull request, and welcome! The Rails team is excited to review your changes, and you should hear from @schneems (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

This repository is being automatically checked for code quality issues using Code Climate. You can see results for this analysis in the PR status below. Newly introduced issues should be fixed before a Pull Request is considered ready to review.

Please see the contribution instructions for more information.

activerecord/lib/active_record/connection_adapters/postgresql/schema_statements.rb Outdated
JOIN pg_attribute a
ON a.attrelid=i.indrelid
AND a.attnum=i.indkey[k]
WHERE i.indrelid=#{quote(quote_table_name(table_name))}::regclass

This comment has been minimized.

@kamipo

kamipo Feb 10, 2017

Member

I confirmed that generate_subscripts behaves as expected.

 psql -E activerecord_unittest                               
psql (9.5.4)
Type "help" for help.

activerecord_unittest=# create table comp3 (a int, b int, c int, primary key (a,c, b));
CREATE TABLE
activerecord_unittest=# select a.attname from pg_index i cross join unnest(i.indkey) as k join pg_attribute a on a.attrelid = i.indrelid and a.attnum = k where indrelid='comp3'::regclass and i.indisprimary;
 attname 
---------
 a
 b
 c
(3 rows)

activerecord_unittest=# select a.attname from pg_index i cross join generate_subscripts(i.indkey, 1) k join pg_attribute a on a.attrelid = i.indrelid and a.attnum = i.indkey[k] where indrelid='comp3'::regclass and i.indisprimary;
 attname 
---------
 a
 c
 b
(3 rows)

activerecord_unittest=# 

This comment has been minimized.

@kamipo

kamipo Feb 10, 2017

Member

[nit] I prefer space around equals even in SQL like Style/SpaceAroundOperators.

rails/.rubocop.yml

Lines 66 to 67 in 55d66e2

Style/SpaceAroundOperators:
Enabled: true

We have few lines missing space around equals.

% git grep -n "\w=[\w#']" lib/active_record/connection_adapters
lib/active_record/connection_adapters/abstract_mysql_adapter.rb:556:        raw_table_options.sub!(/(ENGINE=\w+)(?: AUTO_INCREMENT=\d+)/, '\1')
lib/active_record/connection_adapters/abstract_mysql_adapter.rb:561:        if raw_table_options.sub!(/ COMMENT='.+'/, "")
lib/active_record/connection_adapters/abstract_mysql_adapter.rb:845:          # to work with MySQL 5.7.6 which sets optimizer_switch='derived_merge=on'
lib/active_record/connection_adapters/sqlite3_adapter.rb:348:            WHERE name=#{quote(row['name'])} AND type='index'
lib/active_record/connection_adapters/sqlite3_adapter.rb:352:            WHERE name=#{quote(row['name'])} AND type='index'

% git grep -n "}=[\w#']" lib/active_record/connection_adapters
lib/active_record/connection_adapters/abstract_mysql_adapter.rb:486:          execute("UPDATE #{quote_table_name(table_name)} SET #{quote_column_name(column_name)}=#{quote(default)} WHERE #{quote_column_name(column_name)} IS NULL")
lib/active_record/connection_adapters/postgresql/schema_statements.rb:528:            execute("UPDATE #{quote_table_name(table_name)} SET #{quote_column_name(column_name)}=#{quote_default_expression(default, column)} WHERE #{quote_column_name(column_name)} IS NULL") if column
lib/active_record/connection_adapters/sqlite3_adapter.rb:418:          exec_query("UPDATE #{quote_table_name(table_name)} SET #{quote_column_name(column_name)}=#{quote(default)} WHERE #{quote_column_name(column_name)} IS NULL")

This comment has been minimized.

@jordanlewis

jordanlewis Feb 10, 2017

Contributor

Added the spaces.

@sgrif

This comment has been minimized.

Member

sgrif commented Feb 10, 2017

While we're touching these, can we just refactor it to use information_schema, so we can share the code between every adapter except SQLite and Oracle? The query is roughly:

SELECT column_name FROM information_schema.key_column_usage
    WHERE constraint_name IN (
        SELECT table_constraints.constraint_name FROM information_schema.table_constraints WHERE constraint_type = 'PRIMARY KEY'
    ) AND table_name = $1 AND table_schema = $2

(If anyone wanted to refactor all of our schema lookup code to use information_schema, you can see examples of what all the queries would need to be here. The only things that need to differ between databases is the name of the column type column, and what the value of table_schema needs to be scoped to when no schema was explicitly given)

@jordanlewis jordanlewis force-pushed the jordanlewis:simplify-postgres-primary-keys-v2 branch Feb 10, 2017

@jordanlewis

This comment has been minimized.

Contributor

jordanlewis commented Feb 10, 2017

@sgrif seems like a pretty good idea. The query can't be optimized quite as well this way, but I suppose the gain in DRY in the Rails codebase could be of greater importance. What do you think?

I'm happy to update my patch as you suggest, but I think I'll stick to just modifying primary_keys for now to keep the scope of this change reasonable.

The query diff (before refactoring to the rest of the adapters) would be:

diff --git a/activerecord/lib/active_record/connection_adapters/postgresql/schema_statements.rb b/activerecord/lib/active_record/connection_adapters/postgresql/schema_statements.rb
index 9072745200..c8785a06dd 100644
--- a/activerecord/lib/active_record/connection_adapters/postgresql/schema_statements.rb
+++ b/activerecord/lib/active_record/connection_adapters/postgresql/schema_statements.rb
@@ -426,14 +426,13 @@ def pk_and_sequence_for(table) #:nodoc:

         def primary_keys(table_name) # :nodoc:
           select_values(<<-SQL.strip_heredoc, "SCHEMA")
-                SELECT a.attname
-                  FROM pg_index i
-            CROSS JOIN generate_subscripts(i.indkey, 1) k
-                  JOIN pg_attribute a
-                    ON a.attrelid = i.indrelid
-                   AND a.attnum = i.indkey[k]
-                 WHERE i.indrelid = #{quote(quote_table_name(table_name))}::regclass
-                   AND i.indisprimary
+            SELECT column_name
+              FROM information_schema.key_column_usage kcu
+              JOIN information_schema.table_constraints tc
+                ON kcu.table_name = tc.table_name
+               AND kcu.constraint_name = tc.constraint_name
+             WHERE constraint_type='PRIMARY KEY'
+               AND kcu.table_name=#{quote(quote_table_name(table_name))}
           SQL
         end

And there's the EXPLAIN ANALYZE output:

jordan=# EXPLAIN ANALYZE SELECT column_name FROM information_schema.key_column_usage kcu JOIN information_schema.table_constraints tc ON kcu.table_name = tc.table_name AND kcu.constraint_name = tc.constraint_name WHERE constraint_type='PRIMARY KEY' AND kcu.table_name='comp3';
                                                                                                                                                                                QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=57.65..97.05 rows=1 width=32) (actual time=0.269..0.299 rows=3 loops=1)
   ->  Hash Join  (cost=57.37..89.09 rows=15 width=40) (actual time=0.257..0.276 rows=3 loops=1)
         Hash Cond: (((c.conname)::information_schema.sql_identifier)::text = ("*SELECT* 1".constraint_name)::text)
         ->  Nested Loop  (cost=0.27..16.84 rows=1000 width=341) (actual time=0.198..0.214 rows=3 loops=1)
               Join Filter: (c.connamespace = nc.oid)
               Rows Removed by Join Filter: 5
               ->  Nested Loop  (cost=0.27..10.46 rows=1 width=172) (actual time=0.057..0.061 rows=1 loops=1)
                     Join Filter: (r.relnamespace = nr.oid)
                     Rows Removed by Join Filter: 3
                     ->  Nested Loop  (cost=0.27..9.34 rows=1 width=176) (actual time=0.044..0.046 rows=1 loops=1)
                           ->  Seq Scan on pg_constraint c  (cost=0.00..1.03 rows=1 width=104) (actual time=0.014..0.017 rows=3 loops=1)
                                 Filter: (contype = ANY ('{p,u,f}'::"char"[]))
                                 Rows Removed by Filter: 2
                           ->  Index Scan using pg_class_oid_index on pg_class r  (cost=0.27..8.30 rows=1 width=76) (actual time=0.008..0.008 rows=0 loops=3)
                                 Index Cond: (oid = c.conrelid)
                                 Filter: ((relkind = 'r'::"char") AND (((relname)::information_schema.sql_identifier)::text = 'comp3'::text))
                                 Rows Removed by Filter: 1
                     ->  Seq Scan on pg_namespace nr  (cost=0.00..1.07 rows=4 width=4) (actual time=0.007..0.011 rows=4 loops=1)
                           Filter: (NOT pg_is_other_temp_schema(oid))
                           Rows Removed by Filter: 2
               ->  Seq Scan on pg_namespace nc  (cost=0.00..1.06 rows=6 width=4) (actual time=0.002..0.005 rows=6 loops=1)
         ->  Hash  (cost=57.06..57.06 rows=3 width=64) (actual time=0.051..0.051 rows=1 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  Append  (cost=0.27..57.06 rows=3 width=64) (actual time=0.043..0.049 rows=1 loops=1)
                     ->  Subquery Scan on "*SELECT* 1"  (cost=0.27..11.66 rows=1 width=64) (actual time=0.043..0.048 rows=1 loops=1)
                           ->  Nested Loop  (cost=0.27..11.65 rows=1 width=288) (actual time=0.042..0.046 rows=1 loops=1)
                                 Join Filter: (r_1.relnamespace = nr_1.oid)
                                 Rows Removed by Join Filter: 3
                                 ->  Nested Loop  (cost=0.27..10.52 rows=1 width=132) (actual time=0.034..0.037 rows=1 loops=1)
                                       Join Filter: (c_1.connamespace = nc_1.oid)
                                       Rows Removed by Join Filter: 5
                                       ->  Nested Loop  (cost=0.27..9.39 rows=1 width=136) (actual time=0.030..0.031 rows=1 loops=1)
                                             ->  Seq Scan on pg_constraint c_1  (cost=0.00..1.05 rows=1 width=72) (actual time=0.011..0.013 rows=3 loops=1)
                                                   Filter: ((contype <> ALL ('{t,x}'::"char"[])) AND (((CASE contype WHEN 'c'::"char" THEN 'CHECK'::text WHEN 'f'::"char" THEN 'FOREIGN KEY'::text WHEN 'p'::"char" THEN 'PRIMARY KEY'::text WHEN 'u'::"char" THEN 'UNIQUE'::text ELSE NULL::text END)::information_schema.character_data)::text = 'PRIMARY KEY'::text))
                                                   Rows Removed by Filter: 2
                                             ->  Index Scan using pg_class_oid_index on pg_class r_1  (cost=0.27..8.33 rows=1 width=72) (actual time=0.005..0.005 rows=0 loops=3)
                                                   Index Cond: (oid = c_1.conrelid)
                                                   Filter: ((relkind = 'r'::"char") AND (((relname)::information_schema.sql_identifier)::text = 'comp3'::text) AND (pg_has_role(relowner, 'USAGE'::text) OR has_table_privilege(oid, 'INSERT, UPDATE, DELETE, TRUNCATE, REFERENCES, TRIGGER'::text) OR has_any_column_privilege(oid, 'INSERT, UPDATE, REFERENCES'::text)))
                                                   Rows Removed by Filter: 1
                                       ->  Seq Scan on pg_namespace nc_1  (cost=0.00..1.06 rows=6 width=4) (actual time=0.003..0.004 rows=6 loops=1)
                                 ->  Seq Scan on pg_namespace nr_1  (cost=0.00..1.07 rows=4 width=4) (actual time=0.003..0.005 rows=4 loops=1)
                                       Filter: (NOT pg_is_other_temp_schema(oid))
                                       Rows Removed by Filter: 2
                     ->  Subquery Scan on "*SELECT* 2"  (cost=0.28..45.40 rows=2 width=64) (actual time=0.001..0.001 rows=0 loops=1)
                           ->  Result  (cost=0.28..45.38 rows=2 width=288) (actual time=0.001..0.001 rows=0 loops=1)
                                 One-Time Filter: ((('CHECK'::character varying)::information_schema.character_data)::text = 'PRIMARY KEY'::text)
                                 ->  Nested Loop  (cost=0.28..45.38 rows=2 width=288) (never executed)
                                       ->  Nested Loop  (cost=0.00..27.13 rows=1 width=72) (never executed)
                                             Join Filter: (nr_2.oid = r_2.relnamespace)
                                             ->  Seq Scan on pg_class r_2  (cost=0.00..26.01 rows=1 width=72) (never executed)
                                                   Filter: ((relkind = 'r'::"char") AND (((relname)::information_schema.sql_identifier)::text = 'comp3'::text) AND (pg_has_role(relowner, 'USAGE'::text) OR has_table_privilege(oid, 'INSERT, UPDATE, DELETE, TRUNCATE, REFERENCES, TRIGGER'::text) OR has_any_column_privilege(oid, 'INSERT, UPDATE, REFERENCES'::text)))
                                             ->  Seq Scan on pg_namespace nr_2  (cost=0.00..1.07 rows=4 width=4) (never executed)
                                                   Filter: (NOT pg_is_other_temp_schema(oid))
                                       ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a_1  (cost=0.28..18.16 rows=2 width=6) (never executed)
                                             Index Cond: ((attrelid = r_2.oid) AND (attnum > 0))
                                             Filter: (attnotnull AND (NOT attisdropped))
   ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a  (cost=0.28..0.52 rows=1 width=70) (actual time=0.005..0.005 rows=1 loops=3)
         Index Cond: ((attrelid = r.oid) AND (attnum = ((information_schema._pg_expandarray(c.conkey))).x))
         Filter: ((NOT attisdropped) AND (pg_has_role(r.relowner, 'USAGE'::text) OR has_column_privilege(r.oid, attnum, 'SELECT, INSERT, UPDATE, REFERENCES'::text)))
 Planning time: 1.487 ms
 Execution time: 0.574 ms
(61 rows)
@sgrif

This comment has been minimized.

Member

sgrif commented Feb 10, 2017

Seems fine to me.

@jeremy

This comment has been minimized.

Member

jeremy commented Feb 10, 2017

I'm a fan of leaning on information_schema by default. If the motivation is to speed these up, though, IS perf will often disappoint.

@sgrif

This comment has been minimized.

Member

sgrif commented Feb 10, 2017

Do we have an actual use case where this is a hot spot? Ultimately rake db:schema:cache:dump makes the performance of schema lookups pretty much moot, doesn't it?

@jordanlewis jordanlewis force-pushed the jordanlewis:simplify-postgres-primary-keys-v2 branch 2 times, most recently Feb 13, 2017

@jordanlewis

This comment has been minimized.

Contributor

jordanlewis commented Feb 13, 2017

@sgrif Good point. My motivation here had more to do with simplifying the query than optimizing it, to improve compatibility for Postgres backends that might not support common table expressions, a feature which is only used in this query - the latter was just a side-effect.

I've updated the patch to modify this query to use information_schema. I removed a few tests that check that primary_keys should throw an invalid query exception if the table doesn't exist or is not available from the current search path, since that's not very easy to accomplish without using the old method (::regclass casts). I don't think that's an important semantic of the API, though, since none of the other adapters have that behavior, which is why I removed the tests instead of trying to fix them.

I spent a little time seeing what it would take to move the primary_keys method up to the abstract adapter. I think it would take a decent amount of work, as the Postgres adapter has its own logic for extract_schema_qualified_name, and I'm not familiar enough with the codebase to take that on at this time. So, I've submitted this patch as-is, and hopefully someone with more expertise can take on the work to unify the implementations.

@sgrif

This comment has been minimized.

Member

sgrif commented Feb 13, 2017

Yeah, the definition of "schema" varies by backend (and of the in-tree adapters is only relevant to postgres), so you'll probably want to have the abstract version ask the adapter to split the string into schema and table name (with a default impl returning [nil, table_name]). This looks good to me once CI is green. Thanks!

@jordanlewis jordanlewis force-pushed the jordanlewis:simplify-postgres-primary-keys-v2 branch Feb 13, 2017

Simplify and optimize Postgres query for primary_keys()
primary_keys(table) needs to query various metadata tables in Postgres
to determine the primary key for the table. Previously, it did so using
a complex common table expression against pg_constraint and
pg_attribute.

This patch simplifies the query by using information_schema tables.
This simplifies the logic, making the query far easier to understand,
and additionally avoids an expensive unnest, window function query, and
common table expression.

@jordanlewis jordanlewis force-pushed the jordanlewis:simplify-postgres-primary-keys-v2 branch to b8e3af7 Feb 13, 2017

@jordanlewis

This comment has been minimized.

Contributor

jordanlewis commented Feb 13, 2017

Cool - build's green now. Thanks for the tips, @sgrif!

@jeremy jeremy requested a review from sgrif Feb 14, 2017

@sgrif sgrif merged commit b9661c7 into rails:master Feb 14, 2017

2 checks passed

codeclimate no new or fixed issues
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

kamipo added a commit to kamipo/rails that referenced this pull request Feb 14, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment