Count behavior #201

Merged
merged 3 commits into from Mar 30, 2011

Conversation

Projects
None yet
4 participants
Contributor

jmileham commented Mar 4, 2011

Hello, this may be the wrong forum for a patch like this, but Lighthouse isn't accepting my wordy writeup due to apparent spamminess, so I'm submitting it as a pull request instead. Let me know if I'm out of line.

This patch is related to:

https://rails.lighthouseapp.com/projects/8994-ruby-on-rails/tickets/5060

I've got a different approach to the problem that seems (to me) to be more semantically "correct" and has some performance bennies as well.

I propose changing the semantics to make :limit and :offset happen before count, which makes them consistent with the other ActiveRecord aggregate calculations -- e.g. sum(:limit => foo) operates only on the limited rows.

If we do that, then there are two performance wins:

  1. If the app developer desires a count of a limited set (e.g. the number of records that will be shown on the next page) of a ginormous table, they won't have to choose between instantiating the records (to use Array#length) and having the database count every row in the table that matches before returning. With this approach, the database will shortcut its count as soon as it collects offset + limit rows.
  2. If an app ever does a count(:limit => 0) in some corner case, we don't have to hit the database at all. We can just return zero.

As well as the fact that then count == length == size, so this issue is resolved.

Attached is a patch that implements what I described. Note that I also changed the tests for issue 6268, as if this approach is acceptable, those constraints would no longer apply.

On the "why'd he do that?" side of things, since Arel doesn't support subqueries, there's some trickery to make that happen in calculations.build_count_subquery. I'd really like to be able to do this inside Arel, but it appears that's not possible right now. Given that relational algebra doesn't define limit and offset, Arel could alternatively change the behavior of its own count operation for future versions in a similar way without opening the floodgates to generic subqueries.

Admittedly this patch is pretty edgy because of the Arel workaround, so even if you buy the semantic change, I could see wanting to change the behavior at the Arel layer instead. I just wanted to submit this as a proof of concept (all tests pass on the 4 default DB drivers) and see what people think. Likewise, there may be legitimate reasons that this approach isn't desirable. Please let me know what you think and I'll be glad to move the ball forward.

Thanks,
-john

Member

amatsuda commented Mar 10, 2011

+1 for this approach. Works perfectly for me (with my own pagination library).

Owner

pixeltrix commented Mar 12, 2011

-1 on this from me. Doing the count on a subquery forces the database to do full scan of the subquery. This can be very slow even on quite small datasets (e.g. my MacBook to 0.5 seconds to do the query 'SELECT COUNT(*) FROM (SELECT * FROM products LIMIT 500 OFFSET 0) AS t' where the products table contains 3000 records).

If a developer needs to know the number of records on the next page then this can be calculated very cheaply from a will_paginate or thinking-sphinx result, e.g:

# Given a results instance r
[[r.total_entries - r.offset, r.per_page].min, 0].max 
Contributor

jmileham commented Mar 12, 2011

Hi Andrew,

That's a pretty surprising result. Here's a Postgres 8.4.4 query plan for the two forms against a 2 million row table running on my MacBook. Note that the number of rows operated on in the sequential scan in the subquery form is shortcut after the LIMIT is reached. Also the runtimes appear to strongly favor the subquery approach. Obviously each database's query optimizer is different, but this is an operation that at least Postgres can optimize extremely well. I believe Sqlite3 and MySQL can do the same, but will do more research.

I definitely don't disagree that the count of the next page's records is easy to arrive at once you've either done a count of the unlimited query (which, as the example below shows can be quite expensive in and of itself), and min'd that against your page size. Alternatively you can avoid hitting all the matching rows in the table if you use to_a and instantiate the whole next pageful of records (which could be moderately expensive if there are a number of branchy includes), but otherwise the present state of ActiveRecord doesn't allow you to get at what I see to be a very useful optimization provided by the database, while also not being semantically in line with the rest of ActiveRecord's aggregate calculations.

Cheers,
-john

count_test=# select count(*) from stuff;
  count  
---------
 2097152
(1 row)

                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=50046.40..50046.41 rows=1 width=0) (actual time=623.414..623.414 rows=1 loops=1)
   ->  Seq Scan on stuff  (cost=0.00..44803.52 rows=2097152 width=0) (actual time=0.017..391.077 rows=2097152 loops=1)
 Total runtime: 623.484 ms
(3 rows)


count_test=# explain analyze select count(*) from (select 1 from stuff limit 100) as t;

                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3.39..3.40 rows=1 width=0) (actual time=0.054..0.054 rows=1 loops=1)
   ->  Limit  (cost=0.00..2.14 rows=100 width=0) (actual time=0.010..0.046 rows=100 loops=1)
         ->  Seq Scan on stuff  (cost=0.00..44803.52 rows=2097152 width=0) (actual time=0.009..0.024 rows=100 loops=1)
 Total runtime: 0.078 ms
(4 rows)
Contributor

jmileham commented Mar 12, 2011

Also as a quick note, in the patch, no subquery is generated unless there's a :limit or :offset on the relation, so there is no penalty to existing code that is not passing :limit or :offset. relation.except(:limit, :offset).count will result in the SQL you'd expect.

-john

Contributor

jmileham commented Mar 15, 2011

I checked sqlite3 and MySQL and both will shortcut the count of a limited subquery, returning extremely quickly when counting 100 matching rows of a 2 million-row table. Note that MySQL appears to cache the count of a table so as to avoid the full table scan in a trivial select count(*) from tablename; But when you make it do a little bit of work (in the example below, the LIKE expression matches every row), which is very likely to be the case in any scoped relation, the performance win of the subquery becomes clear:

This looks fast:

mysql> select count(*) from stuff;
+----------+
| count(*) |
+----------+
|  2097152 |
+----------+
1 row in set (0.00 sec)

But when you make the database perform some manner of scan (in this case a full table scan, though in less-contrived use cases, you'd be dealing with index scans and a smaller performance hit), things get sluggish:

mysql> select count(*) from stuff where value like 'some%';
+----------+
| count(*) |
+----------+
|  2097152 |
+----------+
1 row in set (0.47 sec)

Though if you don't care to know how many matching rows there are in the whole table, and just want to know how many results wil show up on a given page, then this is really fast:

mysql> select count(*) from (select 1 from stuff where value like 'some%' limit 100) t;
+----------+
| count(*) |
+----------+
|      100 |
+----------+
1 row in set (0.00 sec)

Sqlite3 exhibits similar behavior:

sqlite> select count(*) from stuff;
2097152
CPU Time: user 0.043846 sys 0.102457
sqlite> select count(*) from stuff where value like 'some%';
2097152
CPU Time: user 1.080249 sys 0.117463
sqlite> select count(*) from (select 1 from stuff where value like 'some%' limit 100) t;
100
CPU Time: user 0.000237 sys 0.000073

From reading Sqlite's documentation, its subqueries are implemented by rendering the subquery to a temporary table and then driving the main query off of that result. I suspect that characteristic may have contributed to the slowness mentioned on Andrew's database (though a time of 500ms for a scan of 500 rows is still a surprise). Since Sqlite performs the subquery in its entirety, if the products table has a lot of columns or large TEXT columns, reading and burning that data down to a temporary table would be an expensive process with the "SELECT *" from his example. Andrew's example SQL is not what the patch I submitted generates, though -- my patch performs a simple "SELECT 1" inside the subquery (if no column is explicitly requested) or a select on the requested column if it is.

Andrew, does this address your concerns? I'd be glad to do more research if there's an angle I'm overlooking. If anybody has any feedback, I'd appreciate your input.

Thanks,
-john

Owner

pixeltrix commented Mar 15, 2011

I didn't see the "SELECT 1" - that does make it quicker

Owner

tenderlove commented Mar 23, 2011

Adding this on master seems fine to me.

@jmileham arel supports subqueries just fine. Is there a particular feature you're looking for?

Contributor

jmileham commented Mar 23, 2011

Thanks Aaron,

Cool... the ability to use subqueries with in is no doubt far more valuable in everyday use than what I needed it to do. :) From a quick browse of the Arel code, though, it appears that Arel supports subqueries in the where clause as arguments to #in and #not_in, and as arguments of #union, #intersect and #except, but can't use a SelectManager as the argument of a #from. This is a kind of subquery that may not be philosophically in line with relational algebra, so maybe that's not a bad thing. But basically Arel can't presently do things like this without rendering the SQL of the subquery out first:

https://github.com/jmileham/rails/blob/d5994ee48af14d67f0eec7d23863d4b19211b078/activerecord/lib/active_record/relation/calculations.rb#L324-325

This could probably be done if Arel could visit SelectManager, but I'm not sure whether that's the right thing or not. Alternatively, Arel could probably wrap this into the behavior of its own count operation, because it's arguable that Arel shouldn't even be capable of rendering the SQL SELECT COUNT(*) FROM t LIMIT 3; since the LIMIT is noise in that context.

Owner

tenderlove commented Mar 24, 2011

@jmileham you can just call ast on the select manager. ARel does not do "relational algebra", it merely manages an SQL ast. Since one ast is a valid subtree of another ast, you should be able to pull the ast from one select manager, and pass it to another select manager.

As for aliasing your subquery, we can construct an AS node. Off the top of my head, you could do something like this:

as = Arel::Nodes::As.new sm1.grouping(sm1.ast), Arel.sql('omg')
sm2.project('whatever').from(as)

Probably we need a factory method for As nodes, but I see no reason why Arel cannot handle this use case today.

Owner

tenderlove commented Mar 24, 2011

This commit should do the trick.

Contributor

jmileham commented Mar 24, 2011

Awesome. I didn't know it was legit to touch the ast from outside of ARel so I didn't pursue the avenue of unwrapping the select manager and playing with its goodies, thinking that if that were meant to be, ARel would let me simply pass the select manager in directly. This will certainly lead to a cleaner impl. I'll wrap my head around it and submit a revised patch.

Owner

tenderlove commented Mar 24, 2011

Yes. The AST is all yours. I am happy to accept pull requests that make the methods accept manager objects and extract the ast rather than passing the ast. ;-)

Contributor

jmileham commented Mar 24, 2011

Note that this latest commit won't work unless you're running the from_select_mgr branch of jmileham/arel or my Arel patch gets accepted... :) If that doesn't fly, I'll refactor my commit. @tenderlove's Arel commit from yesterday was more than enough to get the job done.

Owner

tenderlove commented Mar 30, 2011

I've merged this in to master, so I'm closing now. Thanks!

tenderlove merged commit 28c73f0 into rails:master Mar 30, 2011

Owner

tenderlove commented Mar 30, 2011

oops, clicked the wrong button.

Contributor

jmileham commented Mar 30, 2011

Thank you!

bogdan referenced this pull request Feb 5, 2016

Closed

Introduced AR::Relation more_than? and similar methods #23477

0 of 2 tasks complete

@hisas hisas pushed a commit to hisas/rails that referenced this pull request May 9, 2017

@urbanautomaton urbanautomaton YAML serialization preserves crucial instance variables. Follows from…
… pull request #201
ec2c8f0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment