Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pluck_each and pluck_in_batches batching methods #47894

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fatkodima
Copy link
Member

@fatkodima fatkodima commented Apr 8, 2023

Example:

Person.pluck_in_batches(:name, :email) do |batch|
  jobs = batch.map { |name, email| PartyReminderJob.new(name, email) }
  ActiveJob.perform_all_later(jobs)
end

Person.pluck_each(:email) do |email|
  PartyMailer.with(email: email).welcome_email.deliver_later
end

Plucking in batches is a very popular feature I saw many projects reimplement themselves to gain some performance.
I saw this in 2 my previous projects, in OSS projects (was able to find in mastodon), a few popular gems.

Benchmarks

Tested on a table with 50M records.
Compared to the recently introduced optimization for range batching.

CREATE TABLE users (id bigserial PRIMARY KEY, val integer);
INSERT INTO users (val) SELECT floor(random() * 30 + 1)::int FROM generate_series(1, 50000000) AS i;
ANALYZE users;

Whole table batching

Using ranges:

start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
User.in_batches(use_ranges: true) do |batch|
  batch.pluck(:id, :val)
end

elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
puts "Elapsed: #{elapsed}s"

Elapsed: 209.20533800008707s

Plucking in batches:

start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
User.pluck_in_batches(:id, :val) { }
elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
puts "Elapsed: #{elapsed}s"

Elapsed: 113.7704949999461s 馃敟

Batching with conditions

Using ranges:

start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
User.where("val = 21").in_batches(use_ranges: true) do |batch|
  batch.pluck(:id, :val)
end
elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
puts "Elapsed: #{elapsed}s"

Elapsed: 28.136486999923363s

No ranges:

start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
User.where("val = 21").in_batches do |batch|
  batch.pluck(:id, :val)
end
elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
puts "Elapsed: #{elapsed}s"

Elapsed: 39.96518399997149s

Plucking in batches:

start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
User.where("val = 21").pluck_in_batches(:id, :val) do |batch|
end
elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
puts "Elapsed: #{elapsed}s"

Elapsed: 16.415813000057824s 馃敟

These numbers are for the db on my local machine. The improvement will be much larger in production due to simpler queries and SQL queries reduction by half.

Also, implementing this feature would make #47466 unneeded.

The logic in pluck_in_batches looks similar to in_batches, but trying to dry it (extracting similar logic into helper methods or trying to reuse pluck_in_batches inside in_batches) will make the code more complex and less understandable.

cc @nvasilevski (as we discussed it in https://discuss.rubyonrails.org/t/yield-record-ids-to-in-batches-block/81102)

@fatkodima
Copy link
Member Author

For anyone interested in this - currently released as a gem (https://github.com/fatkodima/pluck_in_batches).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant