Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A bug? Query the whole table when using find_in_batches and find_each #25537

Closed
liukgg opened this issue Jun 27, 2016 · 5 comments
Closed

A bug? Query the whole table when using find_in_batches and find_each #25537

liukgg opened this issue Jun 27, 2016 · 5 comments

Comments

@liukgg
Copy link

liukgg commented Jun 27, 2016

Steps to reproduce

Supposing we use MySQL and we have a model named "User" like this:

class User < ActiveRecord::Base
end

The table "users" have 100 million records in all.

When I use this query "User.find_in_batches", it generates a SQL:
"User Load (0.5ms) SELECT users.* FROM users" .

And this will take a very very long time and occupy much memory.

However, it do work well with a "each" and a block followed:

User.find_in_batches.each do |users|
    p users
end

The same problem with method "find_each".

So is this a bug?

Expected behavior

I expect this query "User.find_in_batches" to generate some SQL like this:
"User Load (0.4ms) SELECT users.* FROM users ORDER BY users.id ASC LIMIT 1000"

Actual behavior

Actually it generates a SQL and take a very very long time:
"User Load (0.5ms) SELECT users.* FROM users" .

System configuration

Rails version:
Both Rails 4.2.6 and Rails v5.0.0.rc2.
Ruby version:
2.3.0

Script

begin require 'bundler/inline'
rescue LoadError => e
  $stderr.puts 'Bundler version 1.10 or later is required. Please update your Bundler'
  raise e
end

gemfile(true) do
  source 'https://rubygems.org'
  gem 'rails', github: 'rails/rails'
  gem 'sqlite3'
end

require 'active_record'
require 'minitest/autorun'
require 'logger'

ActiveRecord::Base.establish_connection(adapter: 'sqlite3', database: ':memory:')
ActiveRecord::Base.logger = Logger.new(STDOUT)

ActiveRecord::Schema.define do
  create_table :users, force: true do |t|
  end
end

class User < ActiveRecord::Base
end

class BugTest < Minitest::Test
   10.times { User.create }

  # When I try 'User.find_in_batches' in rails console, it generates a SQL like this, which is not expected:
  # D, [2016-06-28T10:15:20.042833 #65205] DEBUG -- :   User Load (0.4ms)  SELECT `users`.* FROM `users`
  def test_find_in_batches
    User.find_in_batches.inspect
  end

  # It generates a SQL like this also, which is not expected:
  # D, [2016-06-28T10:15:20.042833 #65205] DEBUG -- :   User Load (0.4ms)  SELECT `users`.* FROM `users`
  def test_find_each
    User.find_each.inspect
  end
end

Other Information and what I have tried

I try to find the solution to this problem.And these are what I have done:

1, I read the source code(Rails 4.2.6) of this method "find_in_batches":

module ActiveRecord
  module Batches
    def find_in_batches(options = {})
      options.assert_valid_keys(:start, :batch_size)

      relation = self
      start = options[:start]
      batch_size = options[:batch_size] || 1000

      unless block_given?
        return to_enum(:find_in_batches, options) do
          total = start ? where(table[primary_key].gteq(start)).size : size
          (total - 1).div(batch_size) + 1
        end
      end

      if logger && (arel.orders.present? || arel.taken.present?)
        logger.warn("Scoped order and limit are ignored, it's forced to be batch order and batch size")
      end

      relation = relation.reorder(batch_order).limit(batch_size)
      records = start ? relation.where(table[primary_key].gteq(start)).to_a : relation.to_a

      while records.any?
        records_size = records.size
        primary_key_offset = records.last.id
        raise "Primary key not included in the custom select clause" unless primary_key_offset

        yield records

        break if records_size < batch_size

        records = relation.where(table[primary_key].gt(primary_key_offset)).to_a
      end
    end

....
  end
end

And I found the problem is here:

 unless block_given?
        return to_enum(:find_in_batches, options) do
          total = start ? where(table[primary_key].gteq(start)).size : size
          (total - 1).div(batch_size) + 1
        end
   end

2, Then I found that this "to_enum " method is from ruby "Object#to_enum", am I right?
But I cannot figure out what is the essence of this method.

3, I work with "pry", and I found that the problem really occurs here:

 return to_enum(:find_in_batches, options) do
          total = start ? where(table[primary_key].gteq(start)).size : size
          (total - 1).div(batch_size) + 1
        end

Any help will be appreciated. Thanks!

@liukgg liukgg changed the title A bug? Query the whole table when using find_in_batches A bug? Query the whole table when using find_in_batches and find_each Jun 27, 2016
@sgrif
Copy link
Contributor

sgrif commented Jun 27, 2016

Please provide a script that we can use to reproduce the issue using our standard bug report template

@liukgg
Copy link
Author

liukgg commented Jun 28, 2016

Ok. I have provided a script in the description above.

@htanata
Copy link
Contributor

htanata commented Jul 2, 2016

I think this only happens when inspect is used. It will eventually call ActiveRecord::Relation#inspect:

    def inspect
      entries = records.take([limit_value, 11].compact.min).map!(&:inspect)
      entries[10] = '...' if entries.size == 11

      "#<#{self.class.name} [#{entries.join(', ')}]>"
    end

Maybe it's better if inspect only queries the needed records since it only displays the first 10 records?

Something like this:

diff --git a/activerecord/lib/active_record/relation.rb b/activerecord/lib/active_record/relation.rb
index d042fe5..177ceba 100644
--- a/activerecord/lib/active_record/relation.rb
+++ b/activerecord/lib/active_record/relation.rb
@@ -680,7 +680,9 @@ def values
     end

     def inspect
-      entries = records.take([limit_value, 11].compact.min).map!(&:inspect)
+      subject = loaded? ? records : self
+      entries = subject.take([limit_value, 11].compact.min).map!(&:inspect)
+
       entries[10] = '...' if entries.size == 11

       "#<#{self.class.name} [#{entries.join(', ')}]>"

I can create a pull request if this solution is acceptable.

@liukgg
Copy link
Author

liukgg commented Jul 4, 2016

@htanata ,yeah, it only happens when inspect is used, but I think it can be better for inspect.

@stale stale bot added the stale label Mar 27, 2017
@stale
Copy link

stale bot commented Mar 27, 2017

This issue has been automatically marked as stale because it has not been commented on for at least three months.
The resources of the Rails team are limited, and so we are asking for your help.
If you can still reproduce this error on the 5-1-stable branch or on master, please reply with all of the information you have about it in order to keep the issue open.
Thank you for all your contributions.

htanata added a commit to htanata/rails that referenced this issue Mar 29, 2017
Instead of loading all records and returning only a subset of those,
just load the records as needed.

Fixes rails#25537.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants