Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance tweaks for Seq#diff and Seq#intersect #9365

Merged
merged 1 commit into from
May 10, 2021

Conversation

martijnhoekstra
Copy link
Contributor

@martijnhoekstra martijnhoekstra commented Dec 5, 2020

Special case diff on List by structurally sharing when possible reverted

Benchmark results

Baseline

[info] # Run complete. Total time: 00:20:29
[info] Benchmark (size) Mode Cnt Score Error Units
[info] ListBenchmark.diff_first_half 0 avgt 20 40.875 ± 0.393 ns/op 1
[info] ListBenchmark.diff_first_half 1 avgt 20 65.350 ± 1.487 ns/op 1
[info] ListBenchmark.diff_first_half 10 avgt 20 592.473 ± 59.870 ns/op 1
[info] ListBenchmark.diff_first_half 100 avgt 20 6392.043 ± 136.179 ns/op 1
[info] ListBenchmark.diff_first_half 1000 avgt 20 69990.051 ± 1484.883 ns/op 1
[info] ListBenchmark.diff_identical 0 avgt 20 42.340 ± 2.373 ns/op 1
[info] ListBenchmark.diff_identical 1 avgt 20 89.342 ± 1.600 ns/op 1
[info] ListBenchmark.diff_identical 10 avgt 20 745.050 ± 28.070 ns/op 1
[info] ListBenchmark.diff_identical 100 avgt 20 8239.964 ± 326.827 ns/op 1
[info] ListBenchmark.diff_identical 1000 avgt 20 96104.255 ± 4346.676 ns/op 1
[info] ListBenchmark.diff_last_half 0 avgt 20 44.855 ± 5.776 ns/op 1
[info] ListBenchmark.diff_last_half 1 avgt 20 87.086 ± 2.926 ns/op 1
[info] ListBenchmark.diff_last_half 10 avgt 20 551.048 ± 9.701 ns/op 1
[info] ListBenchmark.diff_last_half 100 avgt 20 6052.273 ± 216.878 ns/op 1
[info] ListBenchmark.diff_last_half 1000 avgt 20 67614.880 ± 710.706 ns/op 1
[info] ListBenchmark.diff_notIncluded 0 avgt 20 71.985 ± 0.874 ns/op 1
[info] ListBenchmark.diff_notIncluded 1 avgt 20 85.513 ± 3.370 ns/op 1
[info] ListBenchmark.diff_notIncluded 10 avgt 20 429.533 ± 20.771 ns/op 1
[info] ListBenchmark.diff_notIncluded 100 avgt 20 1790.248 ± 40.322 ns/op 1
[info] ListBenchmark.diff_notIncluded 1000 avgt 20 18322.706 ± 555.524 ns/op 1
[info] ListBenchmark.diff_single_last 0 avgt 20 74.841 ± 4.999 ns/op 1
[info] ListBenchmark.diff_single_last 1 avgt 20 97.721 ± 5.290 ns/op 1
[info] ListBenchmark.diff_single_last 10 avgt 20 274.728 ± 13.519 ns/op 1
[info] ListBenchmark.diff_single_last 100 avgt 20 2324.095 ± 362.581 ns/op 1
[info] ListBenchmark.diff_single_last 1000 avgt 20 27409.479 ± 1529.480 ns/op 1
[info] ListBenchmark.diff_single_mid 0 avgt 20 69.905 ± 3.296 ns/op 1
[info] ListBenchmark.diff_single_mid 1 avgt 20 96.262 ± 1.628 ns/op 1
[info] ListBenchmark.diff_single_mid 10 avgt 20 288.531 ± 23.111 ns/op 1
[info] ListBenchmark.diff_single_mid 100 avgt 20 2725.829 ± 108.093 ns/op 1
[info] ListBenchmark.diff_single_mid 1000 avgt 20 26140.306 ± 312.435 ns/op 1

List through new Seq

[info] # Run complete. Total time: 00:20:27
[info] Benchmark (size) Mode Cnt Score Error Units
[info] ListBenchmark.diff_first_half 0 avgt 20 4.105 ± 0.028 ns/op
[info] ListBenchmark.diff_first_half 1 avgt 20 4.093 ± 0.029 ns/op
[info] ListBenchmark.diff_first_half 10 avgt 20 264.260 ± 10.194 ns/op
[info] ListBenchmark.diff_first_half 100 avgt 20 3364.628 ± 70.029 ns/op
[info] ListBenchmark.diff_first_half 1000 avgt 20 39072.612 ± 2671.788 ns/op
[info] ListBenchmark.diff_identical 0 avgt 20 3.887 ± 0.042 ns/op
[info] ListBenchmark.diff_identical 1 avgt 20 63.033 ± 1.916 ns/op
[info] ListBenchmark.diff_identical 10 avgt 20 368.556 ± 10.840 ns/op
[info] ListBenchmark.diff_identical 100 avgt 20 5861.276 ± 146.805 ns/op
[info] ListBenchmark.diff_identical 1000 avgt 20 69268.715 ± 1607.349 ns/op
[info] ListBenchmark.diff_last_half 0 avgt 20 4.266 ± 0.072 ns/op
[info] ListBenchmark.diff_last_half 1 avgt 20 65.870 ± 3.425 ns/op
[info] ListBenchmark.diff_last_half 10 avgt 20 272.576 ± 8.440 ns/op
[info] ListBenchmark.diff_last_half 100 avgt 20 3802.349 ± 81.459 ns/op
[info] ListBenchmark.diff_last_half 1000 avgt 20 52942.003 ± 9855.334 ns/op
[info] ListBenchmark.diff_notIncluded 0 avgt 20 5.940 ± 0.207 ns/op
[info] ListBenchmark.diff_notIncluded 1 avgt 20 65.802 ± 2.616 ns/op
[info] ListBenchmark.diff_notIncluded 10 avgt 20 194.225 ± 2.471 ns/op
[info] ListBenchmark.diff_notIncluded 100 avgt 20 1635.484 ± 43.555 ns/op
[info] ListBenchmark.diff_notIncluded 1000 avgt 20 15961.686 ± 342.366 ns/op
[info] ListBenchmark.diff_single_last 0 avgt 20 4.305 ± 0.036 ns/op
[info] ListBenchmark.diff_single_last 1 avgt 20 66.875 ± 0.930 ns/op
[info] ListBenchmark.diff_single_last 10 avgt 20 224.710 ± 10.345 ns/op
[info] ListBenchmark.diff_single_last 100 avgt 20 1723.869 ± 25.487 ns/op
[info] ListBenchmark.diff_single_last 1000 avgt 20 17669.288 ± 820.160 ns/op
[info] ListBenchmark.diff_single_mid 0 avgt 20 4.184 ± 0.036 ns/op
[info] ListBenchmark.diff_single_mid 1 avgt 20 68.890 ± 2.777 ns/op
[info] ListBenchmark.diff_single_mid 10 avgt 20 228.758 ± 5.395 ns/op
[info] ListBenchmark.diff_single_mid 100 avgt 20 1731.472 ± 22.302 ns/op
[info] ListBenchmark.diff_single_mid 1000 avgt 20 16487.683 ± 389.199 ns/op

New List:

[info] # Run complete. Total time: 00:20:26
[info] Benchmark (size) Mode Cnt Score Error Units
[info] ListBenchmark.diff_first_half 0 avgt 20 2.706 ± 0.107 ns/op
[info] ListBenchmark.diff_first_half 1 avgt 20 2.671 ± 0.027 ns/op
[info] ListBenchmark.diff_first_half 10 avgt 20 176.566 ± 12.005 ns/op
[info] ListBenchmark.diff_first_half 100 avgt 20 2724.257 ± 69.220 ns/op
[info] ListBenchmark.diff_first_half 1000 avgt 20 30617.560 ± 2649.887 ns/op
[info] ListBenchmark.diff_identical 0 avgt 20 2.489 ± 0.060 ns/op
[info] ListBenchmark.diff_identical 1 avgt 20 2.989 ± 0.084 ns/op
[info] ListBenchmark.diff_identical 10 avgt 20 359.982 ± 17.850 ns/op
[info] ListBenchmark.diff_identical 100 avgt 20 5488.378 ± 121.688 ns/op
[info] ListBenchmark.diff_identical 1000 avgt 20 68650.204 ± 3627.994 ns/op
[info] ListBenchmark.diff_last_half 0 avgt 20 2.762 ± 0.027 ns/op
[info] ListBenchmark.diff_last_half 1 avgt 20 3.540 ± 0.078 ns/op
[info] ListBenchmark.diff_last_half 10 avgt 20 290.114 ± 4.281 ns/op
[info] ListBenchmark.diff_last_half 100 avgt 20 3755.760 ± 108.688 ns/op
[info] ListBenchmark.diff_last_half 1000 avgt 20 45800.311 ± 878.631 ns/op
[info] ListBenchmark.diff_notIncluded 0 avgt 20 4.710 ± 0.171 ns/op
[info] ListBenchmark.diff_notIncluded 1 avgt 20 9.982 ± 0.182 ns/op
[info] ListBenchmark.diff_notIncluded 10 avgt 20 241.703 ± 6.063 ns/op
[info] ListBenchmark.diff_notIncluded 100 avgt 20 1977.344 ± 23.295 ns/op
[info] ListBenchmark.diff_notIncluded 1000 avgt 20 20022.380 ± 375.221 ns/op
[info] ListBenchmark.diff_single_last 0 avgt 20 2.944 ± 0.078 ns/op
[info] ListBenchmark.diff_single_last 1 avgt 20 4.286 ± 0.060 ns/op
[info] ListBenchmark.diff_single_last 10 avgt 20 268.102 ± 4.289 ns/op
[info] ListBenchmark.diff_single_last 100 avgt 20 2273.511 ± 70.090 ns/op
[info] ListBenchmark.diff_single_last 1000 avgt 20 21951.718 ± 226.428 ns/op
[info] ListBenchmark.diff_single_mid 0 avgt 20 2.842 ± 0.046 ns/op
[info] ListBenchmark.diff_single_mid 1 avgt 20 4.417 ± 0.063 ns/op
[info] ListBenchmark.diff_single_mid 10 avgt 20 162.779 ± 1.520 ns/op
[info] ListBenchmark.diff_single_mid 100 avgt 20 1207.562 ± 25.183 ns/op
[info] ListBenchmark.diff_single_mid 1000 avgt 20 11083.343 ± 129.916 ns/op

Diff becomes much faster (~20 times) for the special cases of length 0 and 1. For cases where the argument "runs out", structural sharing from that point onwards not only means using less memory used, it also means a speedup proportional to the part not traversed -- traversing half the list until the other runs out means a speed up of about 50%.

Two benchmarks show an unexpected slow down, by about 10%, for lists of 100 or 1000 items where the argument contains a single element not in the list (ListBenchmark.diff_notIncluded). I don't know what's up with that. I expected those to roughly converge with the one where the argument is a list consisting of the last value only, but those are 15%-20% faster in this PR. I ran them on my laptop, so maybe some weird power setting thing?

Also special case diff and intersect on Seq for empty case on either side

@scala-jenkins scala-jenkins added this to the 2.13.5 milestone Dec 5, 2020
@martijnhoekstra martijnhoekstra added the library:collections PRs involving changes to the standard collection library label Dec 5, 2020
@martijnhoekstra martijnhoekstra marked this pull request as draft December 5, 2020 21:23
@martijnhoekstra

This comment has been minimized.

@martijnhoekstra martijnhoekstra marked this pull request as ready for review December 6, 2020 09:44
@lrytz lrytz requested a review from retronym December 7, 2020 10:27
@SethTisue
Copy link
Member

@Ichoran interested in looking this one over?

Copy link
Contributor

@NthPortal NthPortal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks correct to me, but I would like another review as well because I don't fully trust myself

@SethTisue SethTisue added the performance the need for speed. usually compiler performance, sometimes runtime performance. label Jan 15, 2021
None
}
case Some(1) => None
case Some(n) => Some(n - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are these implementations trading off faster size=0/size=1 operations at the cost of these Some allocations? Because that doesn't sound performant. So perhaps I'm misunderstand, but I thought I would ask.

Copy link
Contributor Author

@martijnhoekstra martijnhoekstra Jan 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should at worst have the same amount of allocations (no shared elements) fewer (potentially far fewer) allocations and fewer lookups in most cases.

In this PR, updateWith to decrement (n >1) costs a hash lookup (to find the bucket), a Some allocation (to return the result) and another Some allocation (to hold de decremented result).

In the situation before, there was a hash lookup (to find the bucket), a Some allocation (to get the value to apply), then another hash lookup to find the bucket again, and then another default map allocation.

So the decrement costs an extra Some allocation, but saves a WithDefaultValue allocation and a hash lookup.

The other cases only look better: when the returned value is 1, now we don't have the extra Some allocation, and we clear the bucket. For List, maybe that empties the map an get to structurally share the rest, we short circuit and append and structurally share the tail. That last part is unavailable on Seq, but the rest of the elements are going to do a lookup on an empty HashMap, which, I just looked up, is actually no cheaper than a lookup in a non-empty HashMap, but it totally could be.

The last case is if the element is not in the underlying map. In the old case, that costs just the lookup, and a Some allocation iff there was something in the map before, and if not, just the lookup. In the new situation, it also costs just the lookup.

The 0 and 1 special casing is only at the start, where this avoids building the map entirely and could be done independently of the code in this part.

@dwijnand
Copy link
Member

@mkeskells you might find this interesting and perhaps even fancy reviewing it?

for (y <- sq) occ.updateWith(y) {
case None => Some(1)
case Some(n) => Some(n + 1)
}
Copy link
Member

@joshlemer joshlemer Jan 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could avoid some allocations if we cache the Some's up to (for example..) 10?

private object CachedSome {
  private[this] val cached: Array[Some[Int]] = Array.tabulate(10)(Some(_))
  def apply(i: Int): if (0 <= i && i < 10) cached(i) else Some(i)
}

(Would have to measure it though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't trust my laptop with the benchmarks, or, honestly, want to dedicate it to running them (I don't think that scala/scala-dev#753 was opened shortly after @NthPortal looked at this PR is a coincidence).

If this is worth it, it's plausible it's worth it globally, on Some.apply, so that any map that keeps a count of things takes advantage of it (though at the cost of a probably correctly branch-predicted instanceOf check), which would be orthogonal of this PR.

On the gripping hand, maybe that's me internally rationalizing my way out of just writing and running the damn benchmarks.

@SethTisue
Copy link
Member

@scala/collections opinions here? (and/or volunteers to do some benchmarking?)

@SethTisue SethTisue modified the milestones: 2.13.5, 2.13.6 Feb 9, 2021
@SethTisue SethTisue merged commit 76d189d into scala:2.13.x May 10, 2021
@SethTisue
Copy link
Member

SethTisue commented May 10, 2021

thanks Martijn (and reviewers)!

@@ -614,6 +614,36 @@ sealed abstract class List[+A]
}
}

// Override for performance: traverse only as much as needed
// and share tail when nothing needs to be filtered out anymore
override def diff[B >: A](that: collection.Seq[B]): List[A] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This override (I assume) causes a MiMa failure, see recent commit status in https://github.com/scala/scala/commits/2.13.x

Not sure why PR validation was green, maybe we're using a newer MiMa now. Also not sure who's right, the error could be a false positive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MiMa 0.8.0, the version this PR had, gives the same error.

On Martin's branch, STARR was still 2.13.4, so I guess we should look for some 2.13.4 vs 2.13.5 difference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I bet the problem was the lack of #9459 ! cc @dwijnand

@SethTisue
Copy link
Member

#9624 reverts the List#diff part, because bincompat; MiMa missed the problem because our build was misconfigured (which was fixed by #9459)

@SethTisue SethTisue changed the title Performance tweaks for List#diff, Seq#diff and Seq#intersect Performance tweaks for Seq#diff and Seq#intersect May 11, 2021
@som-snytt
Copy link
Contributor

It seems there was a minor regression. It wasn't caught because Ichoran did not care to look this one over. ^

I see there was a bit of smoke around benchmarking and mima. I don't know if the collections tests for correctness would have seen anything. This was one of those deeply buried lies, i.e., an asInstanceOf. paulp mentions in the commit how many array allocations it saves. Maybe the takeaway is that anything done in the name of performance should be kept in a special place, where the accumulated wisdom (or folly) can be consulted by future generations.

@martijnhoekstra
Copy link
Contributor Author

@som-snytt I suspect the performance wins were lost in #9624 where much of this PR was lost to mima's relentlessness.

@martijnhoekstra
Copy link
Contributor Author

martijnhoekstra commented May 25, 2021

Is your regression in terms of performance of correctness @som-snytt? Do you have some more details? Edit: I now see the other ticket

@som-snytt
Copy link
Contributor

som-snytt commented May 25, 2021

Yes in the linked ticket, Array.empty[Double] intersect xs throws because even an empty array knows its component. The OPT on isEmpty uses ArraySeq.empty. Edit: I don't know if that partial partial reversion suffices for now. We need a surgical term for that. But not to worry, there were actually a few breaking changes that made spark skip 2.13.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
library:collections PRs involving changes to the standard collection library performance the need for speed. usually compiler performance, sometimes runtime performance.
Projects
None yet
8 participants