New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group trashes select values #16291

Closed
davmac314 opened this Issue Jul 25, 2014 · 14 comments

Comments

Projects
None yet
5 participants
@davmac314

davmac314 commented Jul 25, 2014

When I do a query with both select and group, the group trashes the values I specified with select, eg:

User.select('a', 'sum(b) as sum_b').group('a')

... will not include a sum_b column in the result. This appears to be caused by activerecord/lib/active_record/relation/calculation.rb, line ~299 (in both 4.1.4 and master):

  select_values = [
    operation_over_aggregate_column(
      aggregate_column(column_name),
      operation,
      distinct).as(aggregate_alias)
  ]
  select_values += select_values unless having_values.empty?

  select_values.concat group_fields.zip(group_aliases).map { |field,aliaz|
    if field.respond_to?(:as)
      field.as(aliaz)
    else
      "#{field} AS #{aliaz}"
    end
  }

  relation = except(:group)
  relation.group_values  = group
  relation.select_values = select_values

Essentially the last line trashes the previously set relation.select_values. One possible fix appears to be to change the last line to:

  relation.select_values = select_values

... but I am not particularly certain that this is correct in all cases, and anyway I'm not sure if the select_values should really be added to if they have already been specified.

@hmaurer

This comment has been minimized.

Show comment
Hide comment
@hmaurer

hmaurer Jul 27, 2014

Contributor

Could you provide a bug report in the form of an executable test case, using this script ? Thanks ❤️

Contributor

hmaurer commented Jul 27, 2014

Could you provide a bug report in the form of an executable test case, using this script ? Thanks ❤️

@davmac314

This comment has been minimized.

Show comment
Hide comment
@davmac314

davmac314 Aug 1, 2014

I'm so far unable to reproduce the problem when I try to boil it down to a script, but I am still looking into it.

davmac314 commented Aug 1, 2014

I'm so far unable to reproduce the problem when I try to boil it down to a script, but I am still looking into it.

@davmac314

This comment has been minimized.

Show comment
Hide comment
@davmac314

davmac314 Aug 1, 2014

Ok, got it. The problem occurs when calling empty? on a relation clause.

unless File.exist?('Gemfile')
  File.write('Gemfile', <<-GEMFILE)
source 'https://rubygems.org'
gem 'rails', github: 'rails/rails'
gem 'arel', github: 'rails/arel'
gem 'rack', github: 'rack/rack'
gem 'i18n', github: 'svenfuchs/i18n'
gem 'sqlite3'
GEMFILE

  system 'bundle'
end

require 'bundler'
Bundler.setup(:default)

require 'active_record'
require 'minitest/autorun'
require 'logger'

# This connection will do for database-independent bug reports.
ActiveRecord::Base.establish_connection(adapter: 'sqlite3', database: ':memory:')
ActiveRecord::Base.logger = Logger.new(STDOUT)

ActiveRecord::Schema.define do
  create_table :posts do |t|
    t.integer :groupid
  end
end

class Post < ActiveRecord::Base
end

class BugTest < Minitest::Test
  def test_association_stuff
    post = Post.create!(:groupid => 1)

    posts = Post.select('posts.id', 'SUM(posts.groupid) AS xxx')
        .group('posts.id')
        .order('xxx')

    assert_equal false, posts.empty?
  end
end

davmac314 commented Aug 1, 2014

Ok, got it. The problem occurs when calling empty? on a relation clause.

unless File.exist?('Gemfile')
  File.write('Gemfile', <<-GEMFILE)
source 'https://rubygems.org'
gem 'rails', github: 'rails/rails'
gem 'arel', github: 'rails/arel'
gem 'rack', github: 'rack/rack'
gem 'i18n', github: 'svenfuchs/i18n'
gem 'sqlite3'
GEMFILE

  system 'bundle'
end

require 'bundler'
Bundler.setup(:default)

require 'active_record'
require 'minitest/autorun'
require 'logger'

# This connection will do for database-independent bug reports.
ActiveRecord::Base.establish_connection(adapter: 'sqlite3', database: ':memory:')
ActiveRecord::Base.logger = Logger.new(STDOUT)

ActiveRecord::Schema.define do
  create_table :posts do |t|
    t.integer :groupid
  end
end

class Post < ActiveRecord::Base
end

class BugTest < Minitest::Test
  def test_association_stuff
    post = Post.create!(:groupid => 1)

    posts = Post.select('posts.id', 'SUM(posts.groupid) AS xxx')
        .group('posts.id')
        .order('xxx')

    assert_equal false, posts.empty?
  end
end
@rafaelfranca

This comment has been minimized.

Show comment
Hide comment
@rafaelfranca

rafaelfranca Aug 1, 2014

Member

empty? removed the select values because it needs to call count(:all) or it will fail in some conditions. If we fix your problem we will break other cases. Since we can't fix both I recommend to use count instead of empty?

Member

rafaelfranca commented Aug 1, 2014

empty? removed the select values because it needs to call count(:all) or it will fail in some conditions. If we fix your problem we will break other cases. Since we can't fix both I recommend to use count instead of empty?

@davmac314

This comment has been minimized.

Show comment
Hide comment
@davmac314

davmac314 Aug 2, 2014

You cannot be serious.

This is clearly a bug. You can't close it just because you can't immediately see how to fix it. My suggested change (which I made only tentatively) might break other cases, but that doesn't mean that there is no possible fix.

If, as you say, empty? needs to remove the select values, then it should at least throw an exception when select values have been specified, with a string to the effect of "'empty?' doesn't work when 'select' has been specified", rather than forcing users to spend time debugging an issue that isn't even a fault in their own code.

However, I simply cannot believe that there is no possible fix. For instance, it should be possible for empty? to just set limit 1 and run the query, returning true if there was 0 rows in the result. Using count(*) is just inefficient; it forces the rows the be counted even though we don't care about the number.

In any case, please re-open this bug. There is presently no satisfactory resolution.

davmac314 commented Aug 2, 2014

You cannot be serious.

This is clearly a bug. You can't close it just because you can't immediately see how to fix it. My suggested change (which I made only tentatively) might break other cases, but that doesn't mean that there is no possible fix.

If, as you say, empty? needs to remove the select values, then it should at least throw an exception when select values have been specified, with a string to the effect of "'empty?' doesn't work when 'select' has been specified", rather than forcing users to spend time debugging an issue that isn't even a fault in their own code.

However, I simply cannot believe that there is no possible fix. For instance, it should be possible for empty? to just set limit 1 and run the query, returning true if there was 0 rows in the result. Using count(*) is just inefficient; it forces the rows the be counted even though we don't care about the number.

In any case, please re-open this bug. There is presently no satisfactory resolution.

@rafaelfranca

This comment has been minimized.

Show comment
Hide comment
@rafaelfranca

rafaelfranca Aug 2, 2014

Member

However, I simply cannot believe that there is no possible fix. For instance, it should be possible for empty? to just set limit 1 and run the query, returning true if there was 0 rows in the result.

This is exactly how exists? works. As I said, you are just using the wrong method.

I'm not saying that there is no way to fix it. I'm saying that fixing it will break other people application that expect this exactly behaviour.

We could raise an exception when there are select values specified but again this will break some valid use cases that use select values that will not give any problem.

Member

rafaelfranca commented Aug 2, 2014

However, I simply cannot believe that there is no possible fix. For instance, it should be possible for empty? to just set limit 1 and run the query, returning true if there was 0 rows in the result.

This is exactly how exists? works. As I said, you are just using the wrong method.

I'm not saying that there is no way to fix it. I'm saying that fixing it will break other people application that expect this exactly behaviour.

We could raise an exception when there are select values specified but again this will break some valid use cases that use select values that will not give any problem.

@rafaelfranca rafaelfranca reopened this Aug 2, 2014

@davmac314

This comment has been minimized.

Show comment
Hide comment
@davmac314

davmac314 Aug 3, 2014

Thank you for re-opening.

It is my understanding that a query object should behave where possible like an Array or similar colleciton; that's why it has an 'empty?' method in the first place. The documentation here:

http://api.rubyonrails.org/classes/ActiveRecord/Relation.html#method-i-empty-3F

Says that empty? "returns true if there are no records", which it clearly fails to do in the test case I have provided. If other people's applications depend on the current behavior, then surely they are depending on undocumented behavior. Could you give me an example where the current behavior is desired? I just don't understand why you'd want this method to be broken.

davmac314 commented Aug 3, 2014

Thank you for re-opening.

It is my understanding that a query object should behave where possible like an Array or similar colleciton; that's why it has an 'empty?' method in the first place. The documentation here:

http://api.rubyonrails.org/classes/ActiveRecord/Relation.html#method-i-empty-3F

Says that empty? "returns true if there are no records", which it clearly fails to do in the test case I have provided. If other people's applications depend on the current behavior, then surely they are depending on undocumented behavior. Could you give me an example where the current behavior is desired? I just don't understand why you'd want this method to be broken.

@davmac314

This comment has been minimized.

Show comment
Hide comment
@davmac314

davmac314 Aug 3, 2014

I mean, just to clarify, you're saying that it shouldn't be fixed because some other people are relying on the broken behavior, right? Or am I misunderstanding?

davmac314 commented Aug 3, 2014

I mean, just to clarify, you're saying that it shouldn't be fixed because some other people are relying on the broken behavior, right? Or am I misunderstanding?

@matthewd

This comment has been minimized.

Show comment
Hide comment
@matthewd

matthewd Aug 3, 2014

Member

a query object should behave where possible like an Array or similar collection

group makes this vastly less true: see the different behaviour of the aggregate methods (including count), for example.

However, I see no reason that count(:all) (or any other implementation of empty?) shouldn't trash the select list: it is, by definition, not going to use it. Perhaps you're actually reporting that if we're going to trash the select list, we should also trash the equally-useless order list, because 1) it's useless, and 2) it may be depending on something in the select.

Member

matthewd commented Aug 3, 2014

a query object should behave where possible like an Array or similar collection

group makes this vastly less true: see the different behaviour of the aggregate methods (including count), for example.

However, I see no reason that count(:all) (or any other implementation of empty?) shouldn't trash the select list: it is, by definition, not going to use it. Perhaps you're actually reporting that if we're going to trash the select list, we should also trash the equally-useless order list, because 1) it's useless, and 2) it may be depending on something in the select.

@davmac314

This comment has been minimized.

Show comment
Hide comment
@davmac314

davmac314 Aug 3, 2014

However, I see no reason that count(:all) (or any other implementation of empty?) shouldn't trash the select list

See my testcase. If you trash the select list, you break anything which relies on the existence of an alias in that list. Now, in this case, you could certainly have empty? remove both the select and order lists, and that would make the test case above pass. I'm not sure that it would be valid in all cases, though; a 'where' clause can also reference aliases from the select list.

Why can't empty? (and count) just do a count from a nested subquery? I.e. if the original query is "XXX" and you call empty? it should do something like:

SELECT COUNT(*) FROM (XXX LIMIT 1) xxx_table;

davmac314 commented Aug 3, 2014

However, I see no reason that count(:all) (or any other implementation of empty?) shouldn't trash the select list

See my testcase. If you trash the select list, you break anything which relies on the existence of an alias in that list. Now, in this case, you could certainly have empty? remove both the select and order lists, and that would make the test case above pass. I'm not sure that it would be valid in all cases, though; a 'where' clause can also reference aliases from the select list.

Why can't empty? (and count) just do a count from a nested subquery? I.e. if the original query is "XXX" and you call empty? it should do something like:

SELECT COUNT(*) FROM (XXX LIMIT 1) xxx_table;
@matthewd

This comment has been minimized.

Show comment
Hide comment
@matthewd

matthewd Aug 3, 2014

Member

a 'where' clause can also reference aliases from the select list

Hmm... I don't know of a DB where that's true -- and AFAIK, it's disallowed for semantic reasons, not because it's difficult.

Why can't empty? (and count) just do a count from a nested subquery?

They could... the most obvious trade-off that comes to mind for me is that you're now (probably -- depending on how clever your DBMS is) evaluating all the entries in the SELECT clause, only to then discard them. And that can be very expensive.

.. LIMIT 1 ..

empty? deliberately does a plain count (instead of a LIMIT 1, or an EXISTS, say) as a performance optimisation (read: trade-off): we bet that if there are enough rows for the difference to matter, we'll probably have need of the actual count as well. Thus we trade a potentially-slower query against the probability of needing to run two.

Member

matthewd commented Aug 3, 2014

a 'where' clause can also reference aliases from the select list

Hmm... I don't know of a DB where that's true -- and AFAIK, it's disallowed for semantic reasons, not because it's difficult.

Why can't empty? (and count) just do a count from a nested subquery?

They could... the most obvious trade-off that comes to mind for me is that you're now (probably -- depending on how clever your DBMS is) evaluating all the entries in the SELECT clause, only to then discard them. And that can be very expensive.

.. LIMIT 1 ..

empty? deliberately does a plain count (instead of a LIMIT 1, or an EXISTS, say) as a performance optimisation (read: trade-off): we bet that if there are enough rows for the difference to matter, we'll probably have need of the actual count as well. Thus we trade a potentially-slower query against the probability of needing to run two.

@davmac314

This comment has been minimized.

Show comment
Hide comment
@davmac314

davmac314 Aug 4, 2014

Hmm... I don't know of a DB where that's true

Ok, that may be true, although it surprises me. I just tested with Mysql and it does not allow using the column aliases in the where clause. In any case aliases (in Mysql) can be referenced in both 'group by' and 'order by' clauses, and 'group by' clauses cannot be thrown away for a count operation without changing the semantics.

They could... the most obvious trade-off that comes to mind for me is that you're now (probably -- depending on how clever your DBMS is) evaluating all the entries in the SELECT clause, only to then discard them. And that can be very expensive.

I would think any DBMS worth it's salt would be able to optimise this, but ok, let's suppose it's better to keep the current behavior (no subquery) in cases where it actually does the right thing.

empty? deliberately does a plain count (instead of a LIMIT 1, or an EXISTS, say) as a performance optimisation (read: trade-off): we bet that if there are enough rows for the difference to matter, we'll probably have need of the actual count as well. Thus we trade a potentially-slower query against the probability of needing to run two.

Ok, that's fine; but the current trade-off seems to be incorrect behavior (and a failing query) rather than just slower performance.

In general, I don't think you can add a 'count' clause when a 'group by' clause is present and expect it to return a count of result rows. You will instead get a set of rows which are grouped as per the 'group by' clause and which contains a count of the items in each group. (That is to say, losing the 'select' list isn't the only issue demonstrated with the test case.) It's necessary to use a subquery in this case, I think. Perhaps a subquery could be used if-and-only-if there is a 'group by' clause.

davmac314 commented Aug 4, 2014

Hmm... I don't know of a DB where that's true

Ok, that may be true, although it surprises me. I just tested with Mysql and it does not allow using the column aliases in the where clause. In any case aliases (in Mysql) can be referenced in both 'group by' and 'order by' clauses, and 'group by' clauses cannot be thrown away for a count operation without changing the semantics.

They could... the most obvious trade-off that comes to mind for me is that you're now (probably -- depending on how clever your DBMS is) evaluating all the entries in the SELECT clause, only to then discard them. And that can be very expensive.

I would think any DBMS worth it's salt would be able to optimise this, but ok, let's suppose it's better to keep the current behavior (no subquery) in cases where it actually does the right thing.

empty? deliberately does a plain count (instead of a LIMIT 1, or an EXISTS, say) as a performance optimisation (read: trade-off): we bet that if there are enough rows for the difference to matter, we'll probably have need of the actual count as well. Thus we trade a potentially-slower query against the probability of needing to run two.

Ok, that's fine; but the current trade-off seems to be incorrect behavior (and a failing query) rather than just slower performance.

In general, I don't think you can add a 'count' clause when a 'group by' clause is present and expect it to return a count of result rows. You will instead get a set of rows which are grouped as per the 'group by' clause and which contains a count of the items in each group. (That is to say, losing the 'select' list isn't the only issue demonstrated with the test case.) It's necessary to use a subquery in this case, I think. Perhaps a subquery could be used if-and-only-if there is a 'group by' clause.

@rails-bot

This comment has been minimized.

Show comment
Hide comment
@rails-bot

rails-bot Nov 19, 2014

This issue has been automatically marked as stale because it has not been commented on for at least
three months.

The resources of the Rails team are limited, and so we are asking for your help.

If you can still reproduce this error on the 4-1-stable, 4-0-stable branches or on master,
please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions.

rails-bot commented Nov 19, 2014

This issue has been automatically marked as stale because it has not been commented on for at least
three months.

The resources of the Rails team are limited, and so we are asking for your help.

If you can still reproduce this error on the 4-1-stable, 4-0-stable branches or on master,
please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions.

@rails-bot rails-bot added the stale label Nov 19, 2014

@rails-bot rails-bot closed this Jan 2, 2015

@rails-bot

This comment has been minimized.

Show comment
Hide comment
@rails-bot

rails-bot Jan 2, 2015

This issue has been automatically closed because of inactivity.

If you can still reproduce this error on the 4-1-stable, 4-0-stable branches or on master,
please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions.

rails-bot commented Jan 2, 2015

This issue has been automatically closed because of inactivity.

If you can still reproduce this error on the 4-1-stable, 4-0-stable branches or on master,
please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment