Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HashSet should implement union #6253

Closed
scabug opened this issue Aug 16, 2012 · 17 comments
Closed

HashSet should implement union #6253

scabug opened this issue Aug 16, 2012 · 17 comments

Comments

@scabug
Copy link

@scabug scabug commented Aug 16, 2012

HashSet does not implement union. That means that creating the union of two sets is very inefficient. Adding a large set to a small set creates the entire set anew.

scala> val a = Set.empty[Int]
a: scala.collection.immutable.Set[Int] = Set()

scala> val b = Set(1,2,3,4,5)
b: scala.collection.immutable.Set[Int] = Set(5, 1, 2, 3, 4)

scala> val c = a union b
c: scala.collection.immutable.Set[Int] = Set(5, 1, 2, 3, 4)

scala> c eq b
res0: Boolean = false

c is a completely new set even though it contains the exact same elements as b and could therefore be eq b.

Clearly this is not good. A set should have the most efficient possible implementation for intersection and union. The implementation of filter for set (#6196) allows for a fast intersection. The missing operation union is already implemented in HashSet as merge. So all that needs to be done is to adapt the merge algorithm of HashMap to HashSet.

@scabug
Copy link
Author

@scabug scabug commented Aug 16, 2012

Imported From: https://issues.scala-lang.org/browse/SI-6253?orig=1
Reporter: @rklaehn
Assignee: @rklaehn
Affected Versions: 2.10.0-M6
See #6196

@scabug
Copy link
Author

@scabug scabug commented Aug 31, 2012

Simon (symon) said:
I disagree with Rüdiger on the fact that c should be equal to b in the previous example, given that eq is the pysical equality, not the verification that the content is the same. Nevertheless, there is indeed a major performance issue on Set. I'm using the mutable version of HashSet, and here is an example snippet :

def grade: Double = {
  val setA: HashSet = // get from somewhere else
  val setB: HashSet = // get from somewhere else
  if ((setA size) == 0 || (setB size) == 0) return 0
  else return (setA & setB size) / (setA | set B size)
}

This piece of code is repeated a lot of times in a loop, and the entire loop is executed in around 4.5 sec.

But when I replace the union with somethin else...

def grade: Double = {
  val setA: HashSet = // get from somewhere else
  val setB: HashSet = // get from somewhere else
  if ((setA size) == 0 || (setB size) == 0) return 0
  else return (setA & setB size) / ((setA size) + (setB size))  // just a gross approximation for testing purpose
}

... then the execution time goes down to around 0.35 sec.

So, I think this is a faulty behavior and should be corrected.

As a quick patch, I used size(A) + size(B) - size(A intersect B), but this only works because I just want the size.

@scabug
Copy link
Author

@scabug scabug commented Aug 31, 2012

@soc said:
I disagree on that and agree with Rüdiger. It is imho completely reasonable to expect that a no-op like union with an empty set doesn't cause needless copying, that's one of the key points of immutable, persistent collections.

@scabug
Copy link
Author

@scabug scabug commented Aug 31, 2012

Simon (symon) said:
hmmm, I think I see your point... There is nevertheless a big performance issue in the case I presented. Should I open a new ticket with it or is it close enough to this one ?

@scabug
Copy link
Author

@scabug scabug commented Aug 31, 2012

@soc said:
Yes, I agree on the performance part of course.
As I think that the fix for one issue doesn't necessarily come with the fix for the other issue, I would suggest to first search whether there already exists an open ticket for your issue and if not, creating a new ticket and linking this one to the new one, so that the discussion is not lost.

@scabug
Copy link
Author

@scabug scabug commented Aug 31, 2012

@rklaehn said:
I just wanted to let you know that I am working on additions to HashSet to implement all set operations (union, intersection, diff and subsetOf) as bulk Tree/Tree operations. This gives a pretty significant speedup in many cases, especially when the objects in question have an expensive hashCode and equals implementation.

Unfortunately Josh Suereth told me that right now is not a good time to get this included to the scala library, since they are very busy with getting 2.10 out of the door. I managed to get a few small improvements in, but the bulk operations will have to wait.

@simon: you are requiring union (|) and intersection (&). The bulk union operation would take care of making union as fast as possible. So maybe write a ticket for intersection?

Regarding the eq thing: as a general rule, whenever you call a method on an immutable collection that returns an object that is equal to the original object, you should try to return the original object whenever this is possible without changing the asymptotic complexity. There are some operations where this is not possible, but for the set operations on HashSets it should always be possible! I even have a dedicated test for that in my HashSet improvements.

val x1 = x.someOp(...) // if(x==x1) then x1 should be eq to x!

@scabug
Copy link
Author

@scabug scabug commented Mar 10, 2013

@vigdorchik said:
@rüdiger Can you submit a PR for 2.11?

@scabug
Copy link
Author

@scabug scabug commented Mar 29, 2013

@rklaehn said:
Eugene: I have just updated my local fork and am checking if the tests still pass. Against which branch should I submit a pull request once I am convinced that everything works? master?

@scabug
Copy link
Author

@scabug scabug commented Mar 30, 2013

@rklaehn said:
I decided to tweak these methods in a separate repo before submitting them. If anybody wants to take a look, https://github.com/rklaehn/hashset

The current version is significantly faster in most cases, especially for elements with expensive equals and hashcode and obviously has much better structural sharing. But it is still slower in some degenerate cases, so I am not done yet.

@scabug
Copy link
Author

@scabug scabug commented Oct 31, 2013

@Ichoran said:
Are you done yet? Now would be the time to get improvements into 2.11.

@scabug
Copy link
Author

@scabug scabug commented Nov 5, 2013

@rklaehn said:
I have a version of all set/set operations (union, intersection, subsetOf, ...) that is well tested and is much faster on the average case (factor 2 to 4). More importantly, it uses structural sharing whenever possible. But it is slightly slower in some degenerate cases, and I am using some implementation techniques which might be controversial.

For example I use null as a guard value for the empty set and then convert null to HashSet.empty only in the publicly accessible methods. I also use a mutable buffer to build the new arrays of the set to be created. This is safe since the buffer is only accessed from a single thread, but it is not exactly straightforward.

So all in all it is a pretty major code change. If you think there is a chance to get this into scala 2.11 I will try to find some time to integrate it. But I am very busy at work right now.

@scabug
Copy link
Author

@scabug scabug commented Nov 5, 2013

@Ichoran said:
Well, I'm also going to be pressed to get the additions I want into the collections library. I'll try to get this in also if you submit it with tests, but if you're too busy with work let's defer it to 2.12 (I can't promise inclusion, only attention).

@scabug
Copy link
Author

@scabug scabug commented Nov 5, 2013

@rklaehn said:
When is scala 2.11 due? I have a week of holiday left this year, so maybe I can work on it then.

@scabug
Copy link
Author

@scabug scabug commented Nov 6, 2013

@Ichoran said:
Any significant changes would have to be in (including review and testing) by Friday to make 2.11 without a very compelling argument that we are missing an essential feature.

@scabug
Copy link
Author

@scabug scabug commented Dec 26, 2013

@rklaehn said:
Damn. Seems that I missed the deadline. Do you anticipate any major collection redesign, or would it be OK to just add a pull request against 2.12?

@scabug
Copy link
Author

@scabug scabug commented Dec 26, 2013

@Ichoran said:
You can add a pull request against 2.11 and I'll evaluate. There's a reasonable chance that it will be fine for M8 (and if not we can just delay it to 2.12).

@scabug
Copy link
Author

@scabug scabug commented Jan 16, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant