Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of duplicates for replace on has_many-through #33954

Conversation

febeling
Copy link
Contributor

@febeling febeling commented Sep 23, 2018

There is a bug in the handling of duplicates when assigning (replacing) associated records, which made the result dependent on whether a given record was associated already, before being assigned anew. E.g.

post.people = [person, person]
post.people.count
# => 2

while

post.people = [person]
post.people = [person, person]
post.people.count
# => 1

This change adds a test to provoke the former incorrect behavior, and fixes it.

Cause of the bug was the handling of record collections as sets, and using - (difference) and & (intersection) operations on them indiscriminately. This temporary conversion to sets would eliminate duplicates.

The fix calculates an occurrence distribution hash, with counts for each element. Based on these counts items are kept or removed in the difference and intersection operations.

Fixes #33942.

@rails-bot
Copy link

Thanks for the pull request, and welcome! The Rails team is excited to review your changes, and you should hear from @georgeclaghorn (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

This repository is being automatically checked for code quality issues using Code Climate. You can see results for this analysis in the PR status below. Newly introduced issues should be fixed before a Pull Request is considered ready to review.

Please see the contribution instructions for more information.

@febeling
Copy link
Contributor Author

Just fixed the naming mixup between union and intersection.

@petestreet
Copy link

petestreet commented Sep 26, 2018

Hi @febeling! This is a great find. There were a couple things I noticed that might be improved:

  • typo in occurences method name :)
  • not sure about the "from_set" / "as_set" terminology, as these methods are dealing with arrays and hashes rather than actual Sets.

I also took inspiration from this stackoverflow answer, which accomplishes the same work that this PR is doing with the map and zip operations, but a little more succinctly. A potential implementation would be to modify the new code in has_many_through_association.rb to the following:

def difference(a, b)
  counts_b = occurrences(b)

  a.reject do |object|
    occurrence?(counts_b, object)
  end
end

def intersection(a, b)
  counts_b = occurrences(b)

  a.select do |object|
    occurrence?(counts_b, object)
  end
end

def occurrences(array)
  array.each_with_object(Hash.new(0)) do |object, counts|
    counts[object] += 1
  end
end

def occurrence?(counts, object)
  counts[object] > 0 && counts[object] -= 1
end

I ran benchmarks of the original version against the modified code in this gist. The speed of execution is comparable (within a ~5% margin of error), but this modified code uses roughly 5x less memory.

@febeling
Copy link
Contributor Author

Thanks for your thorough review, @petestreet. The optimization you have researched looks absolutely valid, and I've changed the code to apply it. I also changed the method names to some I find more descriptive of the now changed algorithm. I hope you follow with these. What do you think?

@petestreet
Copy link

Looks good, @febeling. Potentially would want to have comments indicating that mark_occurrence and distribution go together with difference/intersection, but Rails core members might have better opinions on style there.

@febeling
Copy link
Contributor Author

febeling commented Sep 29, 2018

I realize the methods intersection and difference are a bit alien entities in this class. They might be better located in an Array extension. Because they have little connection with the rest of this AR class, I also abstained from adding tests for them (which would be alien to the test suite) - something that otherwise seems called for, given the complexity of the implementation. (Edit: just added some, as I had a change of heart)

So I'd love to hear an opinion by a core member if, firstly, this should move over to, say activesupport/core_ext/array or some other more appropriate place. I'm sure there are comparable cases. Cases where code is general in applicability, but rarely used. And if that's considered good material for AS, or not.

Secondly, what can be done to help this get merged? I do realize this is plenty of code for a bug fix. The reason for that is, duplicates weren't actually using the correct operators up to now, meaning the implementation was actually missing (for a very narrow edge case, to be fair: partial replacement while also using duplicates - something that's uncommon generally, and not even possible with regular has_many).

@febeling
Copy link
Contributor Author

Just added very basic test examples for illustration of intersection and difference for arrays in a separate commit.

@febeling
Copy link
Contributor Author

@georgeclaghorn Any thoughts on this fix?

@febeling febeling force-pushed the inconsistent-assignment-has-many-through-33942 branch from c528c54 to ffface6 Compare November 4, 2018 12:06
@foucist
Copy link

foucist commented Nov 5, 2018

I'm curious if perhaps it makes sense to define & and - methods directly in ActiveRecord::Associations::HasManyThroughAssociation rather than introducing the intersection/difference methods.

Copy link
Member

@tenderlove tenderlove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation in this PR looks good to me, but can we change the test around a little? Thanks!

assert_equal association.send(:difference,
[1, 1, 2, 3, 3],
[1, 3, 3, 3, 4]
), [1, 2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer we don't use send on an internal API. Is the above integration test not enough to prevent regression? If not, can we add another integration test? That way we are free to change the internal API later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review!

Yes, good point. I looked at the code more, and I think we're good with only the integration test.

I tossed the respective commit and pushed again.

There was a bug in the handling of duplicates when
assigning (replacing) associated records, which made the result
dependent on whether a given record was associated already before
being assigned anew. E.g.

    post.people = [person, person]
    post.people.count
    # => 2

while

    post.people = [person]
    post.people = [person, person]
    post.people.count
    # => 1

This change adds a test to provoke the former incorrect behavior, and
fixes it.

Cause of the bug was the handling of record collections as sets, and
using `-` (difference) and `&` (union) operations on them
indiscriminately. This temporary conversion to sets would eliminate
duplicates.

The fix is to decorate record collections for these operations, and
only for the `has_many :through` case. It is done by counting
occurrences, and use the record together with the occurrence number as
element, in order to make them work well in sets. Given

    a, b = *Person.all

then the collection used for finding the difference or union of
records would be internally changed from

     [a, b, a]

to

     [[a, 1], [b, 1], [a, 2]]

for these operations. So a first occurrence and a second occurrence
would be distinguishable, which is all that is necessary for this
task.

Fixes rails#33942.
@febeling febeling force-pushed the inconsistent-assignment-has-many-through-33942 branch from ffface6 to f915758 Compare November 6, 2018 19:09
@febeling
Copy link
Contributor Author

@tenderlove I think the concerns are addressed, wdyt?

Copy link
Member

@tenderlove tenderlove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Sorry for the delayed feedback.

@tenderlove tenderlove merged commit 023a840 into rails:master Nov 20, 2018
@febeling
Copy link
Contributor Author

@tenderlove thanks!

@febeling
Copy link
Contributor Author

@tenderlove Actually, I’m really stoked about suddenly becoming a Rails contributor 🎉😁

rafaelfranca pushed a commit that referenced this pull request Nov 20, 2018
…any-through-33942

Fix handling of duplicates for `replace` on has_many-through
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicates lost on collection=objects and collection_singular_ids=ids for has_many :through association
8 participants