Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Add total_record_limit option for find_in_batches #5696

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
10 participants
Contributor

softwaregravy commented Apr 1, 2012

Discussion from: #5502

This change will actually make it possible to divide up your record space across workers.

Contributor

softwaregravy commented Apr 1, 2012

Updated guide

I wonder if the option name could be max_records or max_record_count, thoughts?

@carlosantoniodasilva carlosantoniodasilva and 1 other commented on an outdated diff Apr 2, 2012

activerecord/lib/active_record/relation/batches.rb
relation = apply_finder_options(finder_options)
end
start = options.delete(:start).to_i
batch_size = options.delete(:batch_size) || 1000
+ remaining_record_limit = options.delete(:total_record_limit) || nil
@carlosantoniodasilva

carlosantoniodasilva Apr 2, 2012

Owner

No need to || nil.

@softwaregravy

softwaregravy Apr 2, 2012

Contributor

cleaned up

@carlosantoniodasilva carlosantoniodasilva and 1 other commented on an outdated diff Apr 2, 2012

activerecord/lib/active_record/relation/batches.rb
if primary_key_offset
- records = relation.where(table[primary_key].gt(primary_key_offset)).to_a
+ if remaining_record_limit && remaining_record_limit < batch_size
+ records = relation.where(table[primary_key].gt(primary_key_offset)).limit(remaining_record_limit).to_a
+ else
+ records = relation.where(table[primary_key].gt(primary_key_offset)).to_a
+ end
@carlosantoniodasilva

carlosantoniodasilva Apr 2, 2012

Owner

I think you can refactor it to something like:

records = relation.where(table[primary_key].gt(primary_key_offset))
if remaining_record_limit && remaining_record_limit < batch_size
  records = records.limit(remaining_record_limit)
end
records = records.to_a 
@softwaregravy

softwaregravy Apr 2, 2012

Contributor

suggestion taken verbatim. Didn't notice any further ways to improve.

Contributor

softwaregravy commented Apr 2, 2012

@carlosantoniodasilva Changed the param name to max_records. I thought it was better than total_record_limit, still open to a better name.

Change is ready for further review

@softwaregravy After some thought on the original discussion, I think that the max_records approach may be hard for you to find where to start the other batches / workers.

I think there might be a way to achieve the results you expect with a where approach, something like this:

Worker A: where(id: 1..20_000).find_in_batches...
Worder B: where(id: 20_001..40_000).find_in_batches...
...
up to 5 workers (or more if you want).

That does give you the requirement of having 20K - more or less, or any count - records by batch. I've already created a similar approach with different priorities and slices by :id, and it worked quite fine. Wdyt?

Contributor

softwaregravy commented Apr 3, 2012

The use case that specifically got me onto this wouldn't be served well by your proposed solution. I realize this is application specific, but my record set is non-continuous (I often create and delete records). I think having the max_records option is more flexible.

The same problem exists in either case though -- you need to calculate the indexes to start at manually. In your suggested approach, I think it's more clunky outside of the happy case of a continuous record blocks.

If you care about balanced batches:

max_records = 5000
(0...Post.count).step(max_records) do |offset|
  Post.find_in_batches(:start => Post.offset(offset).limit(1).first.id, :max_records => max_records) do ...
end 

vs.

max_records = 5000
previous_group_last_id = -1
(0...Post.count).step(max_records).do |offset|
  group_last_id = Post.offset(max_records).limit(1).first.id
  Post.where(id: (previous_group_last_id + 1)..group_last_id).find_in_batches ...
  previous_group_last_id = group_last_id
end

(maybe I'm not seeing a cleaner way to do this?)

I get where you're coming from. And I'm probably not representational of the app-space out there, but for my particular use case, I prefer my approach. That said, I think your suggestion might fit in a bit cleaner and be more intuitive with the overall AR query framework. Going down that road though, maybe something like this would be even more intuitive:

Post.where('id > ?', start_index).limit(max_records).find_in_batches do ...

None of these seem clearly better than the others to me.

Member

NZKoz commented Apr 3, 2012

So I'm having trouble seeing how your proposed implementation would be used to satisfy your use case:

Person.where("age > 21").find_in_batches(:batch_size => 200, : total_record_limit => 1000) do |group|
  ???
end

What do you anticipate being yielded there? 5 batches, of size 200. What happens to the other thousands of People? How do you anticipate parallelizing this?

Contributor

softwaregravy commented Apr 3, 2012

@NZKoz

What do you anticipate being yielded there? 5 batches, of size 200.
Yes
What happens to the other thousands of People?
You'd have to make other calls to get them. The param would be most useful when paired with the :start option
How do you anticipate parallelizing this?
In conjunction with the :start param. Right now, if you have a database with 10_000_000 records, and you need to iterate through them, there's no current way to use the existing batching code to go through them 1_000_000 at a time, say, as you might want to do if you had 10 workers.

To further illustrate, building on from my example above and using Delayed::Job as my means of parallelizing:

max_records = 5000
(0...Post.count).step(max_records) do |offset|
  self.delay.process_chunk(Post.offset(offset).limit(1).first.id, max_records)
end 

def process_chunk(start_id, max_records)
  Post.find_in_batches(:start => start_id, :max_records => max_records) do |batch|
    #something with the batch here
  end 
end

(Delayed::Job uses the delay keyword to basically fork off a job to a worker's queue, so in my example each worker would get the process_chunk method with a different start_id)

Member

NZKoz commented Apr 3, 2012

So your code still needs to iterate to find the batching points for the iteration, you can still iterate, get the list of ids you're splitting on then doing something like this in each of the workers:

  where(id: this_id..(next_id-1)).find_in_batches

muddying up the API with something like :max_records for such an edge case doesn't seem worth it to me, find_in_batches lets you iterate over a result set without having a crazy amount of memory allocated. Everything else should be done in the code which generates the result sets.

Contributor

softwaregravy commented Apr 25, 2012

@NZKoz I'm willing to accept that

If no one else sees something we might want to salvage we can go ahead and close this

@softwaregravy closing, thanks for your contribution.

tmikoss commented Sep 19, 2012

How about if this was implemented using the usual .limit syntax? For example,

User.where(:name => 'Bob').limit(10).find_in_batches(:batch_size => 10)

would yield one batch of 10, while

User.where(:name => 'Bob').limit(25).find_in_batches(:batch_size => 10)

would result in 2 batches of 10 and one batch of 5.

Unless I'm missing something, you should be able to extract the limit condition from base relation, then loop until that has been fulfilled.

My use case is slightly different - I ran into this while developing a background task against well-populated test DB. I tried to throw a limit condition on the query as the 'cleanest' way to make it run faster while I'm still not 100% sure about the implementation and would be running it repeatedly.

As I see it, my approach would get rid of a unintuitive limitation without introducing new parameters.

Are you fundamentally against such changes, or should I turn this comment into a pull request?

jmehnle commented Feb 7, 2016

@tmikoss, this approach was exactly the idea I had when I just ran into this use case. Would you still be up for posting a pull request?

FWIW I just got burned by changing:

MyModel.some_scope.limit(VERY_LARGE).each do |x| ... end

to

MyModel.some_scope.limit(VERY_LARGE).find_each do |x| ... end

It's a little scary that composing these ideas works most of the time, but can occasionally burn you and result in some instructions being completely ignored, without warning.

There is a warning that occurs when you accidentally cause rails to ignore the limit that you gave it. However, when I ran my tests, I didn't happen to catch the print statement that flew by. I see that there is an option to cause this to raise an error, rather than warning to stdout, but I'm not sure how you become aware of this option other than by realizing on your own that rails is behaving unexpectedly and then seeking out the option in the docs.

mahemoff commented Dec 14, 2016 edited

I don't think limit is an edge case for find_in_batches. I've had various needs for it when using Rails. One important use case for limits is dev/debug/testing, ie to quickly process some items or start some background jobs and check if it works okay, without blasting through the entire table.

@mahemoff I've also faced a situation when I need max_record option to be passed to find_in_batches. I would describe my case as a limitation logic (like 10000 messages history in slack for free account, it's loaded in batches of 100. I know that in case of slack I would use offset and limit, but it's still a good sample IMO). So I am sure max_records should be definitely supported

A possible workaround to make that happens:

    batchsize = 25
    limit = 100000
    batchsize = limit if batchsize > limit
   
    batch_count = 0
    max_batches = (limit / batchsize.to_f).ceil

    Post.all.find_in_batches(batch_size: batchsize) do |batch|
      batch_count +=1
      batch.each do |post|
        # do something with 'post'
      end

      break if max_batches == batch_count   
    end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment