Add `pluck_each` and `pluck_in_batches` batching methods #47894

fatkodima · 2023-04-08T11:12:06Z

Example:

Person.pluck_in_batches(:name, :email) do |batch|
  jobs = batch.map { |name, email| PartyReminderJob.new(name, email) }
  ActiveJob.perform_all_later(jobs)
end

Person.pluck_each(:email) do |email|
  PartyMailer.with(email: email).welcome_email.deliver_later
end

Plucking in batches is a very popular feature I saw many projects reimplement themselves to gain some performance.
I saw this in 2 my previous projects, in OSS projects (was able to find in mastodon), a few popular gems.

Benchmarks

Tested on a table with 50M records.
Compared to the recently introduced optimization for range batching.

CREATE TABLE users (id bigserial PRIMARY KEY, val integer);
INSERT INTO users (val) SELECT floor(random() * 30 + 1)::int FROM generate_series(1, 50000000) AS i;
ANALYZE users;

Whole table batching

Using ranges:

start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
User.in_batches(use_ranges: true) do |batch|
  batch.pluck(:id, :val)
end

elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
puts "Elapsed: #{elapsed}s"

Elapsed: 209.20533800008707s

Plucking in batches:

start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
User.pluck_in_batches(:id, :val) { }
elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
puts "Elapsed: #{elapsed}s"

Elapsed: 113.7704949999461s 🔥

Batching with conditions

Using ranges:

start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
User.where("val = 21").in_batches(use_ranges: true) do |batch|
  batch.pluck(:id, :val)
end
elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
puts "Elapsed: #{elapsed}s"

Elapsed: 28.136486999923363s

No ranges:

start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
User.where("val = 21").in_batches do |batch|
  batch.pluck(:id, :val)
end
elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
puts "Elapsed: #{elapsed}s"

Elapsed: 39.96518399997149s

Plucking in batches:

start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
User.where("val = 21").pluck_in_batches(:id, :val) do |batch|
end
elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
puts "Elapsed: #{elapsed}s"

Elapsed: 16.415813000057824s 🔥

These numbers are for the db on my local machine. The improvement will be much larger in production due to simpler queries and SQL queries reduction by half.

Also, implementing this feature would make #47466 unneeded.

The logic in pluck_in_batches looks similar to in_batches, but trying to dry it (extracting similar logic into helper methods or trying to reuse pluck_in_batches inside in_batches) will make the code more complex and less understandable.

cc @nvasilevski (as we discussed it in https://discuss.rubyonrails.org/t/yield-record-ids-to-in-batches-block/81102)

fatkodima · 2023-05-16T11:14:31Z

For anyone interested in this - currently released as a gem (https://github.com/fatkodima/pluck_in_batches).

Add pluck_each and pluck_in_batches batching methods

ee362c0

rails-bot bot added the activerecord label Apr 8, 2023

fatkodima mentioned this pull request Sep 21, 2023

Avoid second query on in_batches.pluck #47462

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `pluck_each` and `pluck_in_batches` batching methods #47894

Add `pluck_each` and `pluck_in_batches` batching methods #47894

fatkodima commented Apr 8, 2023 •

edited

fatkodima commented May 16, 2023

Add pluck_each and pluck_in_batches batching methods #47894

Are you sure you want to change the base?

Add pluck_each and pluck_in_batches batching methods #47894

Conversation

fatkodima commented Apr 8, 2023 • edited

Benchmarks

Whole table batching

Batching with conditions

fatkodima commented May 16, 2023

Add `pluck_each` and `pluck_in_batches` batching methods #47894

Add `pluck_each` and `pluck_in_batches` batching methods #47894

fatkodima commented Apr 8, 2023 •

edited