in_batches using ids #20933

Merged
merged 1 commit into from Aug 7, 2015

Conversation

Projects
None yet
10 participants
@siadat
Contributor

siadat commented Jul 18, 2015

Introduction

This PR implements a new method, ActiveRecord::Relation#in_batches, which works like the existing find_in_batches but instead of yielding arrays of records, it yields ActiveRecord::Relation objects.

It is an alternative for #20899, as described in my comment here #20899 (comment).

It fetches the primary key of the records in each batch and uses that to create a new relation.

Examples

People.in_batches.each do |people|
  people.update_all(sleep: true)
end

Query methods could be called on yielded relations

People.in_batches(of: 100) do |people|
  people.where('id % 2 = 0').update_all(sleep: true)
  people.where('id % 2 = 1').each(&:party_all_night!)
end

Comparison to #20899 - using ids vs using offsets

Executed queries when load: false

This PR (using ids):
SELECT  "posts"."id" FROM "posts" ORDER BY "posts"."id" ASC LIMIT 2
SELECT  "posts"."id" FROM "posts" WHERE ("posts"."id" > 2) ORDER BY "posts"."id" ASC LIMIT 2
SELECT  "posts"."id" FROM "posts" WHERE ("posts"."id" > 4) ORDER BY "posts"."id" ASC LIMIT 2
Benchmark times (50M rows, no index, batch size 1000):
batch    time
1        194.7s
2        0.016s
3        0.006s
4        0.008s
5        0.007s
...
#20899 (using offsets):
SELECT COUNT(count_column) FROM (SELECT  1 AS count_column FROM "posts" LIMIT 2 OFFSET 0) subquery_for_count
SELECT COUNT(count_column) FROM (SELECT  1 AS count_column FROM "posts" LIMIT 2 OFFSET 2) subquery_for_count
SELECT COUNT(count_column) FROM (SELECT  1 AS count_column FROM "posts" LIMIT 2 OFFSET 4) subquery_for_count
Benchmark times (50M rows, no index, batch size 1000):
batch    time
1        192.5s
2        242.0s
3        257.5s
4        256.8s
5        258.7s
... (average times increase on each iteration, because of higher offset)

Executed queries when load: true

This PR (using ids):
SELECT  "posts".* FROM "posts" ORDER BY "posts"."id" ASC LIMIT 2
SELECT  "posts".* FROM "posts" WHERE ("posts"."id" > 2) ORDER BY "posts"."id" ASC LIMIT 2
SELECT  "posts".* FROM "posts" WHERE ("posts"."id" > 4) ORDER BY "posts"."id" ASC LIMIT 2
Benchmark times (50M rows, no index, batch size 1000):
batch    time
1        257.6s (*)
2        0.065s
3        0.080s
4        0.047s
5        0.049s
...
(*) the total time of the first iteration could be reduced to about 70% with an additional query that only selects the ids. More on this later.
#20899 (using offsets):
SELECT  "posts".* FROM "posts" ORDER BY "posts"."id" ASC LIMIT 2 OFFSET 0
SELECT  "posts".* FROM "posts" ORDER BY "posts"."id" ASC LIMIT 2 OFFSET 2
SELECT  "posts".* FROM "posts" ORDER BY "posts"."id" ASC LIMIT 2 OFFSET 4
Benchmark times (50M rows, no index, batch size 1000):
batch    time
1        273.1s
2        268.0s
3        277.5s
... (average times increase on each iteration, because of higher offset)

yielded_relation.to_sql

This PR (using ids):
SELECT "posts".* FROM "posts" WHERE "posts"."id" IN (1, 2) ORDER BY "posts"."id" ASC
SELECT "posts".* FROM "posts" WHERE "posts"."id" IN (3, 4) ORDER BY "posts"."id" ASC
SELECT "posts".* FROM "posts" WHERE "posts"."id" IN (5, 6) ORDER BY "posts"."id" ASC
#20899 (using offsets):
SELECT  "posts".* FROM "posts" ORDER BY "posts"."id" ASC LIMIT 2 OFFSET 0
SELECT  "posts".* FROM "posts" ORDER BY "posts"."id" ASC LIMIT 2 OFFSET 2
SELECT  "posts".* FROM "posts" ORDER BY "posts"."id" ASC LIMIT 2 OFFSET 4

Observations

This PR's implementation is consistent and forgiving. By consistent I mean it iterates through the batches using the id > last_id_of_previous_batch as was the case in the original find_in_batches implementation. And it is forgiving, because it allows the programmer to call any query methods on the yielded relation and the result will be exactly as expected. So, even this example will work well with no unexpected side-effects:

People.in_batches.each do |relation|
  relation.where(age: 21).update_all('age = age + 1')
end

This is not possible in #20899, please see (#20899 (comment)) for the explanation.

In this PR, the method fetches the IDs of records in each batch, but in #20899 we executed a COUNT query on a subquery for each batch. The previous PR uses OFFSET, which could be very slow for higher offset values. See this as an example.

Update

Notice that benchmark times for the current PR is very similar to the original find_each and superior compared to the other PR (using offsets). The reason is that the current implementation iterate through batches using primary key. Apart from the first batch, which executes the same query as the other PR, the other batches are always optimised and faster, because their queries are built using the primary key.

@siadat siadat changed the title from Add ActiveRecord::Relation#in_batches using ids to in_batches using ids Jul 18, 2015

@sgrif

This comment has been minimized.

Show comment
Hide comment
@sgrif

sgrif Jul 18, 2015

Member

I've been traveling so light on beig able to participate, but I think with_batches is too close to in_batches, and the difference between them is entirely unclear from the names.

Member

sgrif commented Jul 18, 2015

I've been traveling so light on beig able to participate, but I think with_batches is too close to in_batches, and the difference between them is entirely unclear from the names.

@siadat

This comment has been minimized.

Show comment
Hide comment
@siadat

siadat Jul 18, 2015

Contributor

@sgrif I agree, I was waiting for a better suggestion.

I just removed with_batches to make sure this PR only focuses on in_batches. Please note that find_each is deprecated in this PR as suggested by @dhh. Let me know if it is too soon to deprecate it now, or if I need to do anything else.

Contributor

siadat commented Jul 18, 2015

@sgrif I agree, I was waiting for a better suggestion.

I just removed with_batches to make sure this PR only focuses on in_batches. Please note that find_each is deprecated in this PR as suggested by @dhh. Let me know if it is too soon to deprecate it now, or if I need to do anything else.

@sgrif

This comment has been minimized.

Show comment
Hide comment
@sgrif

sgrif Jul 18, 2015

Member

OK. In regards to implementation I just need some time to catch up on the conversation

Member

sgrif commented Jul 18, 2015

OK. In regards to implementation I just need some time to catch up on the conversation

@dhh

This comment has been minimized.

Show comment
Hide comment
@dhh

dhh Jul 18, 2015

Member

Let's not deprecate find_each right now. We can just change the
implementation to use in_batches.
On Jul 18, 2015 22:11, "Sean Griffin" notifications@github.com wrote:

OK. In regards to implementation I just need some time to catch up on the
conversation


Reply to this email directly or view it on GitHub
#20933 (comment).

Member

dhh commented Jul 18, 2015

Let's not deprecate find_each right now. We can just change the
implementation to use in_batches.
On Jul 18, 2015 22:11, "Sean Griffin" notifications@github.com wrote:

OK. In regards to implementation I just need some time to catch up on the
conversation


Reply to this email directly or view it on GitHub
#20933 (comment).

@siadat

This comment has been minimized.

Show comment
Hide comment
@siadat

siadat Jul 18, 2015

Contributor

@dhh Done!

find_each and find_in_batches are back, but re-implemented to use the new in_batches.

Contributor

siadat commented Jul 18, 2015

@dhh Done!

find_each and find_in_batches are back, but re-implemented to use the new in_batches.

@bogdan

This comment has been minimized.

Show comment
Hide comment
@bogdan

bogdan Jul 19, 2015

Contributor

I am against this ids trick to be enabled by default. This is something advanced and should only be enabled with option. I am still ok if this option will be true by default so that I still can disable that.

Contributor

bogdan commented Jul 19, 2015

I am against this ids trick to be enabled by default. This is something advanced and should only be enabled with option. I am still ok if this option will be true by default so that I still can disable that.

@siadat

This comment has been minimized.

Show comment
Hide comment
@siadat

siadat Jul 19, 2015

Contributor

@bogdan could you explain a bit more? The API is exactly like the other PR.

Contributor

siadat commented Jul 19, 2015

@bogdan could you explain a bit more? The API is exactly like the other PR.

@siadat

This comment has been minimized.

Show comment
Hide comment
@siadat

siadat Jul 19, 2015

Contributor

I suspect your concern might be regarding manually loading records into the relation. If so, I should clarify that the loading is done inside the method definition, and the callers do not need to do anything extra. The API is kept as simple as the other PR.

Contributor

siadat commented Jul 19, 2015

I suspect your concern might be regarding manually loading records into the relation. If so, I should clarify that the loading is done inside the method definition, and the callers do not need to do anything extra. The API is kept as simple as the other PR.

@siadat

This comment has been minimized.

Show comment
Hide comment
@siadat

siadat Jul 19, 2015

Contributor

I just made a lot of edits to the description of this PR to include queries and comparisons with the other PR.

Contributor

siadat commented Jul 19, 2015

I just made a lot of edits to the description of this PR to include queries and comparisons with the other PR.

@siadat

This comment has been minimized.

Show comment
Hide comment
@siadat

siadat Jul 20, 2015

Contributor

@sgrif @bogdan @dhh Please have a look at the benchmark times I added to the PR description to see why it is faster than #20899.

Contributor

siadat commented Jul 20, 2015

@sgrif @bogdan @dhh Please have a look at the benchmark times I added to the PR description to see why it is faster than #20899.

@siadat siadat closed this Jul 20, 2015

@siadat siadat reopened this Jul 20, 2015

@siadat

This comment has been minimized.

Show comment
Hide comment
@siadat

siadat Jul 20, 2015

Contributor

I can make further improvement on both the performance ( * ) and readability, however I would rather get this reviewed before making any new changes.

( * ) The performance of load: true scenario could be improved by running an additional query before batches are fetched. This additional query only SELECTs the primary key, instead of *, in order to get the smallest id to start iterating batches from that. It will take 70% of the current execution time according to my benchmarks. I will submit another PR to make that improvement, after PR is merged.

Contributor

siadat commented Jul 20, 2015

I can make further improvement on both the performance ( * ) and readability, however I would rather get this reviewed before making any new changes.

( * ) The performance of load: true scenario could be improved by running an additional query before batches are fetched. This additional query only SELECTs the primary key, instead of *, in order to get the smallest id to start iterating batches from that. It will take 70% of the current execution time according to my benchmarks. I will submit another PR to make that improvement, after PR is merged.

@dhh

This comment has been minimized.

Show comment
Hide comment
@dhh

dhh Jul 20, 2015

Member

I'll let others review the implementation, but seeing the examples, I'd like to have a delegate method called all or something similar that allows this:

People.in_batches.each do |people|
  people.update_all(sleep: true)
end

To become this:

People.in_batches.update_all(sleep: true)

Basically, if there's no block_given?, then we perform the following call(s) on the batches one at the time. That's the overwhelmingly default setup.

We could expand that to something like:

People.in_batches.each_record { |record| ... }

That's getting pretty close to People.find_each.

Member

dhh commented Jul 20, 2015

I'll let others review the implementation, but seeing the examples, I'd like to have a delegate method called all or something similar that allows this:

People.in_batches.each do |people|
  people.update_all(sleep: true)
end

To become this:

People.in_batches.update_all(sleep: true)

Basically, if there's no block_given?, then we perform the following call(s) on the batches one at the time. That's the overwhelmingly default setup.

We could expand that to something like:

People.in_batches.each_record { |record| ... }

That's getting pretty close to People.find_each.

@bogdan

This comment has been minimized.

Show comment
Hide comment
@bogdan

bogdan Jul 20, 2015

Contributor

I don't like current PR because it goes too far from current implementation. It has several week points:

  • We need have a suspicious benchmark that:
    • wasn't run on all databases
    • can not cover all possible database states and DB schemas
  • We spawn a lot of queries for each batch

I think this PR have gone a wrong direction since we've changed implementation to remove Relation#last call. The reasoning was right but the solution is wrong. Maybe it is not wrong some day, but not in this PR.

I propose to come back to last as a concept and fetch it using offset.
I think this is an algorithm we can use:

# last = A || B
last = records.offset(records.offset_value + batch_size - 1).first || 
  records.offset(nil).limit(nil).last
  • A - fetches last element from batch when it is full: has batch_size number of elements.
  • B - fetches last element for the last batch
Contributor

bogdan commented Jul 20, 2015

I don't like current PR because it goes too far from current implementation. It has several week points:

  • We need have a suspicious benchmark that:
    • wasn't run on all databases
    • can not cover all possible database states and DB schemas
  • We spawn a lot of queries for each batch

I think this PR have gone a wrong direction since we've changed implementation to remove Relation#last call. The reasoning was right but the solution is wrong. Maybe it is not wrong some day, but not in this PR.

I propose to come back to last as a concept and fetch it using offset.
I think this is an algorithm we can use:

# last = A || B
last = records.offset(records.offset_value + batch_size - 1).first || 
  records.offset(nil).limit(nil).last
  • A - fetches last element from batch when it is full: has batch_size number of elements.
  • B - fetches last element for the last batch
@siadat

This comment has been minimized.

Show comment
Hide comment
@siadat

siadat Jul 20, 2015

Contributor

@dhh Thanks, I really like the idea of People.in_batches.update_all(sleep: true). I will work on it.

@bogdan Thank you for your feedback.

I don't like current PR because it goes too far from current implementation.

Could you explain a bit more? This implementation iterates through the batches using ids, similar to the original find_in_batches. The other PR uses the new method of offset. So, in this respect this PR is closer to the original implementation. And offset queries are slow. Please correct me if I am missing something here.

It has several week points:
We need have a suspicious benchmark that:
wasn't run on all databases
can not cover all possible database states and DB schemas

Sorry I didn't get this part. I ran my tests on a MySQL table with 50 million rows. The other PR is very slow, compared to this PR and compared to the original find_in_batches. I will be happy to create more formal benchmarks for both PRs if you could please make it a bit more clear what you are looking for and what I need to do.

We spawn a lot of queries for each batch

We are executing exactly the same number of queries on both pull requests, and it is also the same number of queries as the original implementation. The queries that this PR executes run faster than the other PR. Queries with offset become even slower as we go through more batches as I mentioned above. Let me know if I am missing anything.

I think this is an algorithm we can use:

👍 I liked this, I will try it on the other PR.

Thank you! 😄

Contributor

siadat commented Jul 20, 2015

@dhh Thanks, I really like the idea of People.in_batches.update_all(sleep: true). I will work on it.

@bogdan Thank you for your feedback.

I don't like current PR because it goes too far from current implementation.

Could you explain a bit more? This implementation iterates through the batches using ids, similar to the original find_in_batches. The other PR uses the new method of offset. So, in this respect this PR is closer to the original implementation. And offset queries are slow. Please correct me if I am missing something here.

It has several week points:
We need have a suspicious benchmark that:
wasn't run on all databases
can not cover all possible database states and DB schemas

Sorry I didn't get this part. I ran my tests on a MySQL table with 50 million rows. The other PR is very slow, compared to this PR and compared to the original find_in_batches. I will be happy to create more formal benchmarks for both PRs if you could please make it a bit more clear what you are looking for and what I need to do.

We spawn a lot of queries for each batch

We are executing exactly the same number of queries on both pull requests, and it is also the same number of queries as the original implementation. The queries that this PR executes run faster than the other PR. Queries with offset become even slower as we go through more batches as I mentioned above. Let me know if I am missing anything.

I think this is an algorithm we can use:

👍 I liked this, I will try it on the other PR.

Thank you! 😄

@dhh

This comment has been minimized.

Show comment
Hide comment
@dhh

dhh Jul 21, 2015

Member

I'm not sure I understand the criticism either. This PR looks both well-implemented, tested, and benchmarked. Happy to evaluate another PR as well, but that shouldn't stop progress here. As long as we settle on the public API, we can continue to refine the implementation behind the scenes. I like the public API here, particularly if we get the batch delegator added as well 👍

Member

dhh commented Jul 21, 2015

I'm not sure I understand the criticism either. This PR looks both well-implemented, tested, and benchmarked. Happy to evaluate another PR as well, but that shouldn't stop progress here. As long as we settle on the public API, we can continue to refine the implementation behind the scenes. I like the public API here, particularly if we get the batch delegator added as well 👍

@bogdan

This comment has been minimized.

Show comment
Hide comment
@bogdan

bogdan Jul 21, 2015

Contributor

@siadat I don't understand why the following have to happen:

          relation_yielded = self.where(primary_key => ids).reorder(batch_order)
          relation_yielded.instance_variable_set :@records, records
          relation_yielded.instance_variable_set :@loaded, true

I've replaced these 3 lines with: yielded_relation = batch_relation and all tests passed. Can you explain why this is needed?

Same thing for the load: false the following does the job:

relation_yielded = batch_relation

Instead of:

relation_yielded = self.where(primary_key => ids).reorder(batch_order)
Contributor

bogdan commented Jul 21, 2015

@siadat I don't understand why the following have to happen:

          relation_yielded = self.where(primary_key => ids).reorder(batch_order)
          relation_yielded.instance_variable_set :@records, records
          relation_yielded.instance_variable_set :@loaded, true

I've replaced these 3 lines with: yielded_relation = batch_relation and all tests passed. Can you explain why this is needed?

Same thing for the load: false the following does the job:

relation_yielded = batch_relation

Instead of:

relation_yielded = self.where(primary_key => ids).reorder(batch_order)
@siadat

This comment has been minimized.

Show comment
Hide comment
@siadat

siadat Jul 22, 2015

Contributor

@bogdan Thank you for reviewing the code. I just pushed a new test that will fail if you do that.

The relation_yielded is built using the record ids. It is not the same as the relation used internally to iterate through the records. E.g., if the internal relation to_sql is

... WHERE ("posts"."id" > 2) ORDER BY "posts"."id" ASC LIMIT 2

the yielded relation should be

... WHERE "posts"."id" IN (3, 4) ORDER BY "posts"."id" ASC

This lets us call all query methods on yielded relations safely. Otherwise we would have to somehow ban the programmer from calling query methods, e.g. where, on the yielded relation because the result could match records outside the current batch. Please see my comment in the other PR #20899 (comment) and this PR's description for an example.

As for the reason why records are loaded like that (instance_variable_set), it is because records are already loaded and we want to avoid making an extra query.

Contributor

siadat commented Jul 22, 2015

@bogdan Thank you for reviewing the code. I just pushed a new test that will fail if you do that.

The relation_yielded is built using the record ids. It is not the same as the relation used internally to iterate through the records. E.g., if the internal relation to_sql is

... WHERE ("posts"."id" > 2) ORDER BY "posts"."id" ASC LIMIT 2

the yielded relation should be

... WHERE "posts"."id" IN (3, 4) ORDER BY "posts"."id" ASC

This lets us call all query methods on yielded relations safely. Otherwise we would have to somehow ban the programmer from calling query methods, e.g. where, on the yielded relation because the result could match records outside the current batch. Please see my comment in the other PR #20899 (comment) and this PR's description for an example.

As for the reason why records are loaded like that (instance_variable_set), it is because records are already loaded and we want to avoid making an extra query.

@bogdan

This comment has been minimized.

Show comment
Hide comment
@bogdan

bogdan Jul 22, 2015

Contributor

@siadat

The test doesn't test the actual end user scenario: it looks completely bound to the implementation you made inside: if we would resolve the same problem in a different way, your test will fail but it should not.

I don't like the implementation with instance_variable_set. It is a hack that breaks OOP principles and the motivation for such hacks should be extremely strong. I don't see such motivation here.

I vote for moving this code to a different PR and merge the code we all agree on because it will never be perfect....

Contributor

bogdan commented Jul 22, 2015

@siadat

The test doesn't test the actual end user scenario: it looks completely bound to the implementation you made inside: if we would resolve the same problem in a different way, your test will fail but it should not.

I don't like the implementation with instance_variable_set. It is a hack that breaks OOP principles and the motivation for such hacks should be extremely strong. I don't see such motivation here.

I vote for moving this code to a different PR and merge the code we all agree on because it will never be perfect....

@dhh

This comment has been minimized.

Show comment
Hide comment
@dhh

dhh Jul 22, 2015

Member

I appreciate the vigorous review, @bogdan, but my take is that @siadat has answered the concerns and explained why the implementation is needed the way it is. We're going to need more rigorous arguments against the current implementation from preventing it from being merged.

Member

dhh commented Jul 22, 2015

I appreciate the vigorous review, @bogdan, but my take is that @siadat has answered the concerns and explained why the implementation is needed the way it is. We're going to need more rigorous arguments against the current implementation from preventing it from being merged.

@siadat

This comment has been minimized.

Show comment
Hide comment
@siadat

siadat Jul 22, 2015

Contributor

@bogdan Thank you for taking the time to review this PR 👍

The test doesn't test the actual end user scenario

The new test ensures that the returned relations are built to have all the primary keys explicitly specified in the condition. I think it does make a difference to the end-user whether the relation they are receiving is where(id: [5, 6]) (this PR) or offset(4).limit(2) (the other PR). But I will welcome a better alternative to assert this.

I don't like the implementation with instance_variable_set.

I agree, I don't like it as well, but it works! I've seen it done in other parts of rails. I would appreciate it if someone could review it and/or suggest an alternative. Maybe later we could add a private setter method for relations in order to allow building and manually loading records the OOP way.

I vote for the current PR, because it performs better than the other PR for large number of rows, it is as fast as the original find_in_batches (or faster, in case records are not loaded), and the yielded relations are more useful than the other PR's in my opinion.

Contributor

siadat commented Jul 22, 2015

@bogdan Thank you for taking the time to review this PR 👍

The test doesn't test the actual end user scenario

The new test ensures that the returned relations are built to have all the primary keys explicitly specified in the condition. I think it does make a difference to the end-user whether the relation they are receiving is where(id: [5, 6]) (this PR) or offset(4).limit(2) (the other PR). But I will welcome a better alternative to assert this.

I don't like the implementation with instance_variable_set.

I agree, I don't like it as well, but it works! I've seen it done in other parts of rails. I would appreciate it if someone could review it and/or suggest an alternative. Maybe later we could add a private setter method for relations in order to allow building and manually loading records the OOP way.

I vote for the current PR, because it performs better than the other PR for large number of rows, it is as fast as the original find_in_batches (or faster, in case records are not loaded), and the yielded relations are more useful than the other PR's in my opinion.

@dhh

This comment has been minimized.

Show comment
Hide comment
@dhh

dhh Jul 22, 2015

Member

That approach has my vote.

On Wed, Jul 22, 2015 at 12:34 PM, Sina Siadat notifications@github.com
wrote:

@bogdan https://github.com/bogdan Thank you for taking the time to
review this PR [image: 👍]

The test doesn't test the actual end user scenario

The new test ensures that the returned relations are built to have all the
primary keys explicitly specified in the condition. I think it does make a
difference to the end-user whether the relation they are receiving is where(id:
[5, 6]) (this PR) or where.offset(4).limit(2) (the other PR). But I will
welcome a better alternative to assert this.

I don't like the implementation with instance_variable_set.

I agree, I don't like it as well, but it works! I've seen in done in other
parts of rails. I would appreciate it if someone could review it and/or
suggest an alternative. Maybe later we could add a private setter method
for relations in order to allow building and manually loading records.

I vote for the current PR, because it performs better than the other PR,
it is as fast as the original find_in_batches (or faster, in case records
are not loaded), and the yielded relations are more useful than the other
PR's in my opinion.


Reply to this email directly or view it on GitHub
#20933 (comment).

Member

dhh commented Jul 22, 2015

That approach has my vote.

On Wed, Jul 22, 2015 at 12:34 PM, Sina Siadat notifications@github.com
wrote:

@bogdan https://github.com/bogdan Thank you for taking the time to
review this PR [image: 👍]

The test doesn't test the actual end user scenario

The new test ensures that the returned relations are built to have all the
primary keys explicitly specified in the condition. I think it does make a
difference to the end-user whether the relation they are receiving is where(id:
[5, 6]) (this PR) or where.offset(4).limit(2) (the other PR). But I will
welcome a better alternative to assert this.

I don't like the implementation with instance_variable_set.

I agree, I don't like it as well, but it works! I've seen in done in other
parts of rails. I would appreciate it if someone could review it and/or
suggest an alternative. Maybe later we could add a private setter method
for relations in order to allow building and manually loading records.

I vote for the current PR, because it performs better than the other PR,
it is as fast as the original find_in_batches (or faster, in case records
are not loaded), and the yielded relations are more useful than the other
PR's in my opinion.


Reply to this email directly or view it on GitHub
#20933 (comment).

@bogdan

This comment has been minimized.

Show comment
Hide comment
@bogdan

bogdan Jul 22, 2015

Contributor

@dhh @siadat ok take your turn.

This is my last test concern. I promise.

@siadat is the following solved:

People.in_batches.each do |relation|
  relation.where(age: 21).update_all('age = age + 1')
end

I would recommend to use this example as an idea for a test. It describes the problem we are solving more closely. I don't think that relation with where(id: [5, 6]) can be considered as something that end user wants. End user just wants less problems like in the code example above and no matter how they are solved.

Contributor

bogdan commented Jul 22, 2015

@dhh @siadat ok take your turn.

This is my last test concern. I promise.

@siadat is the following solved:

People.in_batches.each do |relation|
  relation.where(age: 21).update_all('age = age + 1')
end

I would recommend to use this example as an idea for a test. It describes the problem we are solving more closely. I don't think that relation with where(id: [5, 6]) can be considered as something that end user wants. End user just wants less problems like in the code example above and no matter how they are solved.

@siadat

This comment has been minimized.

Show comment
Hide comment
@siadat

siadat Jul 22, 2015

Contributor

@bogdan That's a good idea, thanks! I just pushed a new test for that.

Contributor

siadat commented Jul 22, 2015

@bogdan That's a good idea, thanks! I just pushed a new test for that.

+ # If you do not provide a block to #in_batches, it will return an Enumerator
+ # which yields ActiveRecord::Relation objects for chaining with other methods:
+ #
+ # Person.in_batches.with_index do |relation, batch|

This comment has been minimized.

@kaspth

kaspth Jul 22, 2015

Member

Lets change batch to batch_index.

@kaspth

kaspth Jul 22, 2015

Member

Lets change batch to batch_index.

+ #
+ # Person.in_batches.with_index do |relation, batch|
+ # puts "Processing relation ##{batch}"
+ # relation.each{|relation| relation.delete_all }

This comment has been minimized.

@kaspth

kaspth Jul 22, 2015

Member

This should have more spaces: relation.each { |relation| relation.delete_all }

@kaspth

kaspth Jul 22, 2015

Member

This should have more spaces: relation.each { |relation| relation.delete_all }

+ # This is especially useful if you want to work with the
+ # ActiveRecord::Relation object instead of the individual records.
+ #
+ # This is especially useful if you want multiple workers dealing with

This comment has been minimized.

@kaspth

kaspth Jul 22, 2015

Member

This is especially useful is used twice in short order. Lets rephrase one of them.

@kaspth

kaspth Jul 22, 2015

Member

This is especially useful is used twice in short order. Lets rephrase one of them.

+ #
+ # NOTE: It's not possible to set the order. That is automatically set to
+ # ascending on the primary key ("id ASC") to make the batch ordering
+ # work. This also means that this method only works when the primary key is

This comment has been minimized.

@kaspth

kaspth Jul 22, 2015

Member

Would consistent be better than work in to make the batch ordering work? That way we also explain why it works.

@kaspth

kaspth Jul 22, 2015

Member

Would consistent be better than work in to make the batch ordering work? That way we also explain why it works.

This comment has been minimized.

@kaspth

kaspth Jul 22, 2015

Member

I'd change the last sentence to Therefore the primary key must be orderable, e.g an integer or a string.

@kaspth

kaspth Jul 22, 2015

Member

I'd change the last sentence to Therefore the primary key must be orderable, e.g an integer or a string.

activerecord/test/cases/batches_test.rb
+
+ def test_in_batches_should_not_use_records_after_yielding_them_in_case_original_array_is_modified
+ not_a_post = "not a post"
+ not_a_post.stubs(:id).raises(StandardError, "not_a_post had #id called on it")

This comment has been minimized.

@kaspth

kaspth Jul 22, 2015

Member

Can we use Minitest stubbing please? 😄

@kaspth

kaspth Jul 22, 2015

Member

Can we use Minitest stubbing please? 😄

This comment has been minimized.

@siadat

siadat Jul 22, 2015

Contributor

note_a_post doesn't have an id method, so according to Minitest's README I'll have to define a singleton method:

def not_a_post.id
  raise StandardError.new("not_a_post had #id called on it")
end

Shall I do that?

@siadat

siadat Jul 22, 2015

Contributor

note_a_post doesn't have an id method, so according to Minitest's README I'll have to define a singleton method:

def not_a_post.id
  raise StandardError.new("not_a_post had #id called on it")
end

Shall I do that?

This comment has been minimized.

@matthewd

matthewd Jul 22, 2015

Member

This test doesn't seem to actually be enforcing anything useful, IMO

@matthewd

matthewd Jul 22, 2015

Member

This test doesn't seem to actually be enforcing anything useful, IMO

This comment has been minimized.

@kaspth

kaspth Jul 25, 2015

Member

@siadat Yes, define the singleton method 👍

@kaspth

kaspth Jul 25, 2015

Member

@siadat Yes, define the singleton method 👍

@kaspth

This comment has been minimized.

Show comment
Hide comment
@kaspth

kaspth Jul 22, 2015

Member

Well done @siadat 😄

Member

kaspth commented Jul 22, 2015

Well done @siadat 😄

@siadat siadat closed this Jul 22, 2015

@siadat siadat reopened this Jul 22, 2015

@siadat

This comment has been minimized.

Show comment
Hide comment
@siadat

siadat Jul 22, 2015

Contributor

Thank you @kaspth! 😄

Contributor

siadat commented Jul 22, 2015

Thank you @kaspth! 😄

@siadat siadat closed this Jul 22, 2015

@siadat siadat reopened this Jul 22, 2015

+ person.update_attributes(author_id: 1)
+
+ Post.in_batches(of: 2) do |batch|
+ batch.where('author_id >= 1').update_all('author_id = author_id + 1')

This comment has been minimized.

@siadat

siadat Jul 22, 2015

Contributor

I would have preferred to increment a more meaningful number here instead of author_id. Any suggestions?

@siadat

siadat Jul 22, 2015

Contributor

I would have preferred to increment a more meaningful number here instead of author_id. Any suggestions?

This comment has been minimized.

@matthewd

matthewd Jul 22, 2015

Member

It's not quite the same demonstration of potential disaster, but you could do something more like:

seen_records = []
Post.in_batches(of: 2) do |batch|
  seen_records += batch.where(some_cond).to_a
end
assert Post.where(some_cond).to_a <is suitably similar to> seen_records
@matthewd

matthewd Jul 22, 2015

Member

It's not quite the same demonstration of potential disaster, but you could do something more like:

seen_records = []
Post.in_batches(of: 2) do |batch|
  seen_records += batch.where(some_cond).to_a
end
assert Post.where(some_cond).to_a <is suitably similar to> seen_records

This comment has been minimized.

@siadat

siadat Jul 22, 2015

Contributor

👍 Thanks @matthewd, I will add this as well.

@siadat

siadat Jul 22, 2015

Contributor

👍 Thanks @matthewd, I will add this as well.

activerecord/test/cases/batches_test.rb
+ assert_equal 2, person.reload.author_id # incremented only once
+ end
+
+ def test_in_batches_should_return_a_relation_with_ids_in_range

This comment has been minimized.

@bogdan

bogdan Jul 23, 2015

Contributor

@siadat I still want this test to be deleted. There have others that really ensure that problem solved now.

@bogdan

bogdan Jul 23, 2015

Contributor

@siadat I still want this test to be deleted. There have others that really ensure that problem solved now.

siadat added a commit to siadat/rails that referenced this pull request Jul 25, 2015

Add ActiveRecord::Relation#in_batches
`in_batches` yields Relation objects if a block is given, otherwise it
returns an instance of `BatchesDelegator`. The existing `find_each` and
`find_in_batches` methods work with batches of records. The new API
allows working with relation batches as well.

Examples:

    Person.in_batches.each_record(&:party_all_night!)
    Person.in_batches.update_all(awesome: true)
    Person.in_batches.delete_all
    Person.in_batches.map do |relation|
      relation.delete_all
      sleep 10 # Throttles the delete queries
    end

Closes #20933.
@siadat

This comment has been minimized.

Show comment
Hide comment
@siadat

siadat Jul 25, 2015

Contributor

Done. The conflict is resolved until CHANGELOG is updated again in master! 😄

Contributor

siadat commented Jul 25, 2015

Done. The conflict is resolved until CHANGELOG is updated again in master! 😄

@dhh

This comment has been minimized.

Show comment
Hide comment
@dhh

dhh Jul 26, 2015

Member

@sgrif Did you have any comments or are we merging?

Member

dhh commented Jul 26, 2015

@sgrif Did you have any comments or are we merging?

@@ -0,0 +1,67 @@
+module ActiveRecord
+ module Batches
+ class BatchesDelegator

This comment has been minimized.

@sgrif

sgrif Jul 27, 2015

Member

What do you think about naming this something like BatchEnumerator instead?

@sgrif

sgrif Jul 27, 2015

Member

What do you think about naming this something like BatchEnumerator instead?

+ break if ids.empty?
+
+ primary_key_offset = ids.last
+ raise "Primary key not included in the custom select clause" unless primary_key_offset

This comment has been minimized.

@sgrif

sgrif Jul 27, 2015

Member

Should this be an ArgumentError?

@sgrif

sgrif Jul 27, 2015

Member

Should this be an ArgumentError?

+ ids = records.map(&:id)
+ relation_yielded = self.where(primary_key => ids).reorder(batch_order)
+ relation_yielded.instance_variable_set :@records, records
+ relation_yielded.instance_variable_set :@loaded, true

This comment has been minimized.

@sgrif

sgrif Jul 27, 2015

Member

What do you think about having a protected method that we can call, instead of reaching into the object's internals?

@sgrif

sgrif Jul 27, 2015

Member

What do you think about having a protected method that we can call, instead of reaching into the object's internals?

This comment has been minimized.

@sgrif

sgrif Jul 27, 2015

Member

I might be missing something, but it looks like the loaded relation would completely lose the proper ordering of the records.

@sgrif

sgrif Jul 27, 2015

Member

I might be missing something, but it looks like the loaded relation would completely lose the proper ordering of the records.

This comment has been minimized.

@siadat

siadat Jul 27, 2015

Contributor

@sgrif 👍 I like the idea of a method for loading records into a relation.

it looks like the loaded relation would completely lose the proper ordering of the records

All relations are ordered by primary key. I just added a test to ensure each record is returned in that order.

@siadat

siadat Jul 27, 2015

Contributor

@sgrif 👍 I like the idea of a method for loading records into a relation.

it looks like the loaded relation would completely lose the proper ordering of the records

All relations are ordered by primary key. I just added a test to ensure each record is returned in that order.

+ loop do
+ if load
+ records = batch_relation.to_a
+ ids = records.map(&:id)

This comment has been minimized.

@sgrif

sgrif Jul 27, 2015

Member

Shouldn't we still use pluck here?

@sgrif

sgrif Jul 27, 2015

Member

Shouldn't we still use pluck here?

This comment has been minimized.

@siadat

siadat Jul 27, 2015

Contributor

@sgrif records.map(&:id) was used here to make sure no extra query is executed. Since pluck on a loaded relation doesn't execute a new query we might as well replace:

records.map(&:id)

with

batch_relation.pluck(primary_key)
@siadat

siadat Jul 27, 2015

Contributor

@sgrif records.map(&:id) was used here to make sure no extra query is executed. Since pluck on a loaded relation doesn't execute a new query we might as well replace:

records.map(&:id)

with

batch_relation.pluck(primary_key)
+ def each_record
+ return to_enum(:each_record) unless block_given?
+
+ @relation.to_enum(:in_batches, of: @of, begin_at: @begin_at, end_at: @end_at, load: true).map do |relation|

This comment has been minimized.

@sgrif

sgrif Jul 27, 2015

Member

Shouldn't this be each, not map?

@sgrif

sgrif Jul 27, 2015

Member

Shouldn't this be each, not map?

+ # People.in_batches.update_all('age = age + 1')
+ [:delete_all, :update_all, :destroy_all].each do |method|
+ define_method(method) do |*args, &block|
+ @relation.to_enum(:in_batches, of: @of, begin_at: @begin_at, end_at: @end_at, load: false).map do |relation|

This comment has been minimized.

@sgrif

sgrif Jul 27, 2015

Member

Shouldn't this be each, and not map?

@sgrif

sgrif Jul 27, 2015

Member

Shouldn't this be each, and not map?

+ # end
+ def each
+ enum = @relation.to_enum(:in_batches, of: @of, begin_at: @begin_at, end_at: @end_at, load: false)
+ return enum.map { |relation| yield relation } if block_given?

This comment has been minimized.

@sgrif

sgrif Jul 27, 2015

Member

Shouldn't this be each, and not map?

@sgrif

sgrif Jul 27, 2015

Member

Shouldn't this be each, and not map?

activerecord/test/cases/batches_test.rb
+ assert_kind_of ActiveRecord::Relation, relation
+ end
+ end
+ end

This comment has been minimized.

@sgrif

sgrif Jul 27, 2015

Member

I really don't think we have sufficient test coverage here. I don't think any of the tests that are only doing assert_kind_of add any significant value, and the remaining tests don't feel comprehensive enough. Can we replace all of the assert_kind_of tests with something a bit more thorough?

@sgrif

sgrif Jul 27, 2015

Member

I really don't think we have sufficient test coverage here. I don't think any of the tests that are only doing assert_kind_of add any significant value, and the remaining tests don't feel comprehensive enough. Can we replace all of the assert_kind_of tests with something a bit more thorough?

siadat added a commit to siadat/rails that referenced this pull request Jul 27, 2015

Add ActiveRecord::Relation#in_batches
`in_batches` yields Relation objects if a block is given, otherwise it
returns an instance of `BatchEnumerator`. The existing `find_each` and
`find_in_batches` methods work with batches of records. The new API
allows working with relation batches as well.

Examples:

    Person.in_batches.each_record(&:party_all_night!)
    Person.in_batches.update_all(awesome: true)
    Person.in_batches.delete_all
    Person.in_batches.map do |relation|
      relation.delete_all
      sleep 10 # Throttles the delete queries
    end

Closes #20933.

siadat added a commit to siadat/rails that referenced this pull request Jul 27, 2015

Add ActiveRecord::Relation#in_batches
`in_batches` yields Relation objects if a block is given, otherwise it
returns an instance of `BatchEnumerator`. The existing `find_each` and
`find_in_batches` methods work with batches of records. The new API
allows working with relation batches as well.

Examples:

    Person.in_batches.each_record(&:party_all_night!)
    Person.in_batches.update_all(awesome: true)
    Person.in_batches.delete_all
    Person.in_batches.map do |relation|
      relation.delete_all
      sleep 10 # Throttles the delete queries
    end

Closes #20933.

siadat added a commit to siadat/rails that referenced this pull request Jul 27, 2015

Add ActiveRecord::Relation#in_batches
`in_batches` yields Relation objects if a block is given, otherwise it
returns an instance of `BatchEnumerator`. The existing `find_each` and
`find_in_batches` methods work with batches of records. The new API
allows working with relation batches as well.

Examples:

    Person.in_batches.each_record(&:party_all_night!)
    Person.in_batches.update_all(awesome: true)
    Person.in_batches.delete_all
    Person.in_batches.map do |relation|
      relation.delete_all
      sleep 10 # Throttles the delete queries
    end

Closes #20933.
@@ -641,6 +641,13 @@ def inspect
"#<#{self.class.name} [#{entries.join(', ')}]>"
end
+ protected
+
+ def load_records(records)

This comment has been minimized.

@siadat

siadat Jul 27, 2015

Contributor

@sgrif Please have a look at this protected method regarding loading records into the relation.

@siadat

siadat Jul 27, 2015

Contributor

@sgrif Please have a look at this protected method regarding loading records into the relation.

This comment has been minimized.

@siadat

siadat Jul 27, 2015

Contributor

@sgrif BTW I am sorry if amending made the review difficult. Let me know if you prefer separate commits from now on, and I'll squash them later again. 😄

@siadat

siadat Jul 27, 2015

Contributor

@sgrif BTW I am sorry if amending made the review difficult. Let me know if you prefer separate commits from now on, and I'll squash them later again. 😄

siadat added a commit to siadat/rails that referenced this pull request Aug 4, 2015

Add ActiveRecord::Relation#in_batches
`in_batches` yields Relation objects if a block is given, otherwise it
returns an instance of `BatchEnumerator`. The existing `find_each` and
`find_in_batches` methods work with batches of records. The new API
allows working with relation batches as well.

Examples:

    Person.in_batches.each_record(&:party_all_night!)
    Person.in_batches.update_all(awesome: true)
    Person.in_batches.delete_all
    Person.in_batches.map do |relation|
      relation.delete_all
      sleep 10 # Throttles the delete queries
    end

Closes #20933.
@siadat

This comment has been minimized.

Show comment
Hide comment
@siadat

siadat Aug 6, 2015

Contributor

Any particular reason why this PR is abandoned?

/cc @sgrif

Contributor

siadat commented Aug 6, 2015

Any particular reason why this PR is abandoned?

/cc @sgrif

@dhh

This comment has been minimized.

Show comment
Hide comment
@dhh

dhh Aug 6, 2015

Member

Not on my end. If @sgrif or @matthewd wants to do a review, let's make that happen this week. Otherwise, we'll merge on Monday and any subsequent reviews can happen as its part of rails/master. The API is good, so any additional tuning on the implementation can happen after a merge too.

Member

dhh commented Aug 6, 2015

Not on my end. If @sgrif or @matthewd wants to do a review, let's make that happen this week. Otherwise, we'll merge on Monday and any subsequent reviews can happen as its part of rails/master. The API is good, so any additional tuning on the implementation can happen after a merge too.

+ loop do
+ if load
+ records = batch_relation.to_a
+ ids = batch_relation.pluck(primary_key)

This comment has been minimized.

@sgrif

sgrif Aug 6, 2015

Member

Shouldn't this use .map to avoid the second query?

@sgrif

sgrif Aug 6, 2015

Member

Shouldn't this use .map to avoid the second query?

This comment has been minimized.

@siadat

siadat Aug 6, 2015

Contributor

There is no extra query, the relation is loaded and the current implementation of pluck checks that and uses its cached @records instance variable [1]. However, I like your suggestion because this way we won't depend on the implementation of pluck anymore, e.g.:

ids = records.map(&:id)

[1] https://github.com/rails/rails/blob/master/activerecord/lib/active_record/relation/calculations.rb#L165

@siadat

siadat Aug 6, 2015

Contributor

There is no extra query, the relation is loaded and the current implementation of pluck checks that and uses its cached @records instance variable [1]. However, I like your suggestion because this way we won't depend on the implementation of pluck anymore, e.g.:

ids = records.map(&:id)

[1] https://github.com/rails/rails/blob/master/activerecord/lib/active_record/relation/calculations.rb#L165

+ records = batch_relation.to_a
+ ids = batch_relation.pluck(primary_key)
+ relation_yielded = self.where(primary_key => ids).reorder(batch_order)
+ relation_yielded.load_records(records)

This comment has been minimized.

@sgrif

sgrif Aug 6, 2015

Member

Won't this cause it to ignore the reorder on the line above?

@sgrif

sgrif Aug 6, 2015

Member

Won't this cause it to ignore the reorder on the line above?

This comment has been minimized.

@siadat

siadat Aug 6, 2015

Contributor

Interesting point. 👍 The reorder here could be removed and everything will work. The reason it was added is because the loaded records array is fetched using this exact order clause, and I thought it might be useful to include that in the yielded relation. I will remove it.

@siadat

siadat Aug 6, 2015

Contributor

Interesting point. 👍 The reorder here could be removed and everything will work. The reason it was added is because the loaded records array is fetched using this exact order clause, and I thought it might be useful to include that in the yielded relation. I will remove it.

@sgrif

This comment has been minimized.

Show comment
Hide comment
@sgrif

sgrif Aug 6, 2015

Member

The PR wasn't abandoned, we just don't have unlimited time. I left a few more comments, this looks fine beyond those two things.

Member

sgrif commented Aug 6, 2015

The PR wasn't abandoned, we just don't have unlimited time. I left a few more comments, this looks fine beyond those two things.

siadat added a commit to siadat/rails that referenced this pull request Aug 6, 2015

Add ActiveRecord::Relation#in_batches
`in_batches` yields Relation objects if a block is given, otherwise it
returns an instance of `BatchEnumerator`. The existing `find_each` and
`find_in_batches` methods work with batches of records. The new API
allows working with relation batches as well.

Examples:

    Person.in_batches.each_record(&:party_all_night!)
    Person.in_batches.update_all(awesome: true)
    Person.in_batches.delete_all
    Person.in_batches.map do |relation|
      relation.delete_all
      sleep 10 # Throttles the delete queries
    end

Closes #20933.

siadat added a commit to siadat/rails that referenced this pull request Aug 6, 2015

Add ActiveRecord::Relation#in_batches
`in_batches` yields Relation objects if a block is given, otherwise it
returns an instance of `BatchEnumerator`. The existing `find_each` and
`find_in_batches` methods work with batches of records. The new API
allows working with relation batches as well.

Examples:

    Person.in_batches.each_record(&:party_all_night!)
    Person.in_batches.update_all(awesome: true)
    Person.in_batches.delete_all
    Person.in_batches.map do |relation|
      relation.delete_all
      sleep 10 # Throttles the delete queries
    end

Closes #20933.

siadat added a commit to siadat/rails that referenced this pull request Aug 7, 2015

Add ActiveRecord::Relation#in_batches
`in_batches` yields Relation objects if a block is given, otherwise it
returns an instance of `BatchEnumerator`. The existing `find_each` and
`find_in_batches` methods work with batches of records. The new API
allows working with relation batches as well.

Examples:

    Person.in_batches.each_record(&:party_all_night!)
    Person.in_batches.update_all(awesome: true)
    Person.in_batches.delete_all
    Person.in_batches.map do |relation|
      relation.delete_all
      sleep 10 # Throttles the delete queries
    end

Closes #20933.
Add ActiveRecord::Relation#in_batches
`in_batches` yields Relation objects if a block is given, otherwise it
returns an instance of `BatchEnumerator`. The existing `find_each` and
`find_in_batches` methods work with batches of records. The new API
allows working with relation batches as well.

Examples:

    Person.in_batches.each_record(&:party_all_night!)
    Person.in_batches.update_all(awesome: true)
    Person.in_batches.delete_all
    Person.in_batches.map do |relation|
      relation.delete_all
      sleep 10 # Throttles the delete queries
    end
@siadat

This comment has been minimized.

Show comment
Hide comment
@siadat

siadat Aug 7, 2015

Contributor

@dhh @sgrif Thank you for your comments! Commit updated. 😄

Contributor

siadat commented Aug 7, 2015

@dhh @sgrif Thank you for your comments! Commit updated. 😄

dhh added a commit that referenced this pull request Aug 7, 2015

@dhh dhh merged commit b3f5d3c into rails:master Aug 7, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@dhh

This comment has been minimized.

Show comment
Hide comment
@dhh

dhh Aug 7, 2015

Member

👏

Member

dhh commented Aug 7, 2015

👏

@siadat

This comment has been minimized.

Show comment
Hide comment
@siadat

siadat Aug 7, 2015

Contributor

Thank you! 🎉

Contributor

siadat commented Aug 7, 2015

Thank you! 🎉

@reshadman

This comment has been minimized.

Show comment
Hide comment
@reshadman

reshadman Aug 17, 2015

Great discussion and PR 🎯

Great discussion and PR 🎯

+ yielded_relation.load_records(records)
+ else
+ ids = batch_relation.pluck(primary_key)
+ yielded_relation = self.where(primary_key => ids)

This comment has been minimized.

@YorickPeterse

YorickPeterse Sep 6, 2016

@siadat I'm very late to the party, but what's the motivation for using pluck / map here instead of using a sub-query? Given a large enough batch size loading the data (even a single column) into memory could have a negative impact. To use a sub-query one only would have to do:

self.where(primary_key => batch_relation.select(primary_key))

Rails will automatically take care of the rest.

@YorickPeterse

YorickPeterse Sep 6, 2016

@siadat I'm very late to the party, but what's the motivation for using pluck / map here instead of using a sub-query? Given a large enough batch size loading the data (even a single column) into memory could have a negative impact. To use a sub-query one only would have to do:

self.where(primary_key => batch_relation.select(primary_key))

Rails will automatically take care of the rest.

This comment has been minimized.

@bogdan

bogdan Sep 6, 2016

Contributor

Using subquery is slower on mysql. I am not sure what kind of batch size it needs to be to be larger than 10 AR::Base objects.... and also most iterators would load the batch size of AR::Base into memory which is a terrible idea anyway.

@bogdan

bogdan Sep 6, 2016

Contributor

Using subquery is slower on mysql. I am not sure what kind of batch size it needs to be to be larger than 10 AR::Base objects.... and also most iterators would load the batch size of AR::Base into memory which is a terrible idea anyway.

This comment has been minimized.

@YorickPeterse

YorickPeterse Sep 6, 2016

@bogdan

Using subquery is slower on mysql.

This is not a given and depends on your version of MySQL and the query you're running. For example, MySQL 5.6 features a bunch of improvements to the query planner so it can better handle sub queries. Even then it feels weird for Rails to code things together so that the least powerful database out there performs the best at the cost of better databases performing worse.

I am not sure what kind of batch size it needs to be to be larger than 10 AR::Base objects

e.g. a batch size of 5 000, 10 000, really anything above 1000.

and also most iterators would load the batch size of AR::Base into memory which is a terrible idea anyway.

Even then the sub-query approach would be better as it removes the need for first plucking IDs into memory.

Having said all that, if it's better to discuss this in a separate issue I'll gladly create one.

@YorickPeterse

YorickPeterse Sep 6, 2016

@bogdan

Using subquery is slower on mysql.

This is not a given and depends on your version of MySQL and the query you're running. For example, MySQL 5.6 features a bunch of improvements to the query planner so it can better handle sub queries. Even then it feels weird for Rails to code things together so that the least powerful database out there performs the best at the cost of better databases performing worse.

I am not sure what kind of batch size it needs to be to be larger than 10 AR::Base objects

e.g. a batch size of 5 000, 10 000, really anything above 1000.

and also most iterators would load the batch size of AR::Base into memory which is a terrible idea anyway.

Even then the sub-query approach would be better as it removes the need for first plucking IDs into memory.

Having said all that, if it's better to discuss this in a separate issue I'll gladly create one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment