Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune polymorphic implicits more aggressively #6580

Merged
merged 2 commits into from
Aug 10, 2018

Conversation

milessabin
Copy link
Contributor

@milessabin milessabin commented Apr 30, 2018

In rankImplicits, before we attempt to fully typecheck the pending candidate implicit, we first attempt to partially instantiate type variables in both the candidate and the target type and check for compatibility. If the compatibility check fails we can immediately prune the the candidate without having to fully typecheck it.

In the kinds of implicit searches typical of the inductive style found in shapeless and related libraries this can result in a drastic reduction in the search space and a corresponding reduction in compile times.

This commit is much simpler, more generally applicable, and less invasive than my earlier work on inductive implicits in #6481 which was doing similar pruning almost by accident. It turns out that almost all of the speedups in that earlier PR were due to the pruning rather than to any special casing of inductive patterns.

The compilation benchmark (a shapeless-style "select element by type from a large HList") from that PR is carried over here (in test/induction) and shows the same performance improvements,

1) baseline - scalac 2.13.x
2)            scalac 2.13.x with matchesPtInst

               (1)   (2)
  HList Size
   50           4     3
  100           7     3
  150          15     4
  200          28     4
  250          48     5
  300          81     6
  350         126     8
  400         189    11
  450         322    13
  500         405    16

           Compile time in seconds

As an added bonus users of shapeless and shapeless-based libraries which use shapeless's Lazy type will see benefits immediately without needing to wait for and port to byname implicit arguments.

@scala-jenkins scala-jenkins added this to the 2.13.0-M5 milestone Apr 30, 2018
@milessabin milessabin added the performance the need for speed. usually compiler performance, sometimes runtime performance. label Apr 30, 2018
@milessabin milessabin modified the milestones: 2.13.0-M5, 2.13.0-M4 Apr 30, 2018
milessabin added a commit to milessabin/scala that referenced this pull request Apr 30, 2018
This is a backport to 2.12.x of,

  scala#6580

In rankImplicits, before we attempt to fully typecheck the pending
candidate implicit, we first attempt to partially instantiate type
variables in both the candidate and the target type and check for
compatibility. If the compatibility check fails we can immediately prune
the the candidate without having to fully typecheck it.

In the kinds of implicit searches typical of the inductive style found
in shapeless and related libraries this can result in a drastic
reduction in the search space and a corresponding reduction in compile
times.

This commit is much simpler, more generally applicable, and less
invasive than my earlier work on inductive implicits in,

  scala#6481

which was doing similar pruning almost by accident. It turns out that
almost all of the speedups in that earlier PR were due to the pruning
rather than to any special casing of inductive patterns.

The compilation benchmark (a shapeless-style "select element by type
from a large HList") from that PR is carried over here (in
test/induction) and shows the same performance improvements,

  1) baseline - scalac 2.12.5
  2)            scalac 2.12.5 with matchesPtInst

               (1)   (2)
  HList Size
   50           4     3
  100           7     4
  150          14     4
  200          25     5
  250          46     5
  300          74     6
  350         118     8
  400         183    10
  450         298    12
  500         421    15

           Compile time in seconds

As an added bonus users of shapeless and shapeless based libraries which
use shapeless's Lazy type will see benefits immediately without needing
to wait for and port to byname implicit arguments.
Int ::
Int ::
Int ::
Int ::
Copy link
Member

@dwijnand dwijnand Apr 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could write these 10 per line..
easier to read, easier to count, lose the every-10-lines separator.. win win win

(edit: if you rebase or otherwise need to re-write this commit)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, but 55j works pretty well in vim ;-)

@milessabin
Copy link
Contributor Author

Ported to Dotty here: scala/scala3#4426.

Copy link
Contributor

@adriaanm adriaanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very exciting! I'd like to collect some more data, and have a few suggestions for code tweaks.

// can't be solved for. Rather than attempt to patch things up later we just skip those cases
// altogether.
def isInstantiable =
!isView || (wildPt.isGround && !wildPt.exists { case _: BoundedWildcardType => true ; case _ => false })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd pull out the isInstantiable check to the matchesPtInst call below, and cache whether wildPt contains TypeVars (!isGround) or BoundedWildcardTypes. I believe that test could be written more directly as !wildPt.exists { case _: BoundedWildcardType | _: TypeVar => true ; case _ => false }.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

try {
val tvars = allUndetparams map freshVar
val tpInstantiated = tp.instantiateTypeParams(allUndetparams, tvars)
if(!matchesPt(tpInstantiated, wildPt, allUndetparams)) false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see some statistics of which branch prunes the most in a few projects. Maybe we don't even need to solve types to still get some nice pruning? Could we first (1) check whether tp.instantiateTypeParams(allUndetparams, allUndetparams map (_ => WildcardType)) matches wildPt, then (2) check with type vars, and (3) finally solve?

Would like to see hit counts for branches (1), (2), (3).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried various subsets of what's being done here to see if I could reduce the amount of work being done. I have one more thing to try, but I'm confident that the solving is required. I did some logging of the branches and anecdotally I'd say that pruning occurred mainly post-solving and then about equally between the checkbounds call and the final instantiated matchesPt. Agreed that it'd be useful to quantify this, but obviously it's very dependent on the code being compiled ... it'd be good if we could agree on a suitable corpus.

@@ -0,0 +1,619 @@
// Compiled with ./build/pack/bin/scalac -J-Xss4M -J-Xmx2G test/induction/inductive-implicits-bench.scala
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this to our compiler benchmark project. I don't want tests that are not run in CI -- they will just bitrot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Whereabouts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got the project, unsure of the form in which to include it. Currently asking here ... pointers welcome.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ...

@milessabin
Copy link
Contributor Author

I added some statistics counters to the 2.12.x branch and tested a few projects,

The top-level count is the number of calls of matchesPtInst and the subcounts are of mismatches at the first matchesPt, the checkbounds and the final matchesPt respectively.

test/induction/inductive.sh: // One big induction
implicits instantiated for pruning: 1002
  instantiated mismatches     : 500 (49.9%)

shapeless/coreJVM: compile // Library code, mostly explicit
implicits instantiated for pruning: 9801
  immediate mismatches        : 15 (0.2%)
  checkbounds failures        : 6732 (68.7%)
  instantiated mismatches     : 20 (0.2%)

shapeless/coreJVM: test:compile // Lots of small inductions
implicits instantiated for pruning: 80510
  immediate mismatches        : 6666 (8.3%)
  checkbounds failures        : 5436 (6.8%)
  instantiated mismatches     : 4840 (6.0%)

cats/coreJVM: compile // Library code, mostly explicit
implicits instantiated for pruning: 2208
  immediate mismatches        : 8 (0.4%)
  checkbounds failures        : 43 (1.9%)
  instantiated mismatches     : 4 (0.2%)

cats/coreJVM: test:compile // Surprisingly explicit
implicits instantiated for pruning: 2671
  immediate mismatches        : 28 (1.0%)
  instantiated mismatches     : 1 (0.0%)

@adriaanm I can add the counters on this branch if you think that would be valuable.

@milessabin
Copy link
Contributor Author

Benchmark moved to the compiler-benchmark project here. Also incorporated @dwijnand's layout suggestion ;-)

smarter added a commit to dotty-staging/dotty that referenced this pull request May 2, 2018
When narrowing a constraint, we check if its bounds are now equal to
remove it from the constraint set. This allows the corresponding type
variable to be instantiated in TyperState#gc instead of accumulating
uninstantiated type variables that we traverse again and again in
Inferencing#interpolateTypeVars.

This reduces the running time of the added testcase (copied from
scala/scala#6580) from ~6 seconds to ~4 seconds.
If we fully uncomment the testcase (and increase the JVM stack size),
the running time goes from > 300 seconds to ~24 seconds. This is not as
good as what scala/scala#6580 but is already
much more reasonable that what we did before.
smarter added a commit to dotty-staging/dotty that referenced this pull request May 2, 2018
When narrowing a constraint, we check if its bounds are now equal to
remove it from the constraint set. This allows the corresponding type
variable to be instantiated in TyperState#gc instead of accumulating
uninstantiated type variables that we traverse again and again in
Inferencing#interpolateTypeVars.

This reduces the running time of the added testcase (copied from
scala/scala#6580) from ~6 seconds to ~4 seconds.
If we fully uncomment the testcase (and increase the JVM stack size),
the running time goes from > 300 seconds to ~24 seconds. This is not as
good as what scala/scala#6580 achieves but is
already much more reasonable that what we did before.
milessabin added a commit to milessabin/scala that referenced this pull request May 3, 2018
This is a backport to 2.12.x of,

  scala#6580

In rankImplicits, before we attempt to fully typecheck the pending
candidate implicit, we first attempt to partially instantiate type
variables in both the candidate and the target type and check for
compatibility. If the compatibility check fails we can immediately prune
the the candidate without having to fully typecheck it.

In the kinds of implicit searches typical of the inductive style found
in shapeless and related libraries this can result in a drastic
reduction in the search space and a corresponding reduction in compile
times.

This commit is much simpler, more generally applicable, and less
invasive than my earlier work on inductive implicits in,

  scala#6481

which was doing similar pruning almost by accident. It turns out that
almost all of the speedups in that earlier PR were due to the pruning
rather than to any special casing of inductive patterns.

The compilation benchmark (a shapeless-style "select element by type
from a large HList") from that PR is carried over here (in
test/induction) and shows the same performance improvements,

  1) baseline - scalac 2.12.5
  2)            scalac 2.12.5 with matchesPtInst

               (1)   (2)
  HList Size
   50           4     3
  100           7     4
  150          14     4
  200          25     5
  250          46     5
  300          74     6
  350         118     8
  400         183    10
  450         298    12
  500         421    15

           Compile time in seconds

As an added bonus users of shapeless and shapeless based libraries which
use shapeless's Lazy type will see benefits immediately without needing
to wait for and port to byname implicit arguments.
@milessabin milessabin force-pushed the topic/implicit-poly-prune-2.13.x branch from 765b3ed to 75e7803 Compare May 3, 2018 18:36
@milessabin milessabin dismissed adriaanm’s stale review May 3, 2018 18:38

Addressed issues raised in review.

@milessabin
Copy link
Contributor Author

Rebased, addressed issues raised in review, moved benchmark to compiler-benchmark and fixed an issue with Specs2 revealed in a 2.12.x community build run.

smarter added a commit to smarter/scala that referenced this pull request May 3, 2018
@smarter
Copy link
Member

smarter commented May 3, 2018

This PR puzzled me: in the benchmark file, the only optimization this can possibly be doing is skipping the typechecking of 499 inHead calls, surely that can't be that expensive?

I'm not a JVM performance expert but I know how to use jstack, that was enough to realize that we spent most of our time not typechecking, or doing implicit searches, or anything useful like that. It turns out the real culprit is computing error messages!. These messages will never be displayed since some other branch of the implicit search will succeed, but they're eagerly computed (in Dotty, error messages are always lazy to avoid this sort of things).

As a proof of concept, here's a one-liner fix that gives the same speed-up as this PR: smarter@4b9b95c

@milessabin
Copy link
Contributor Author

@smarter blimey!

Your one-liner doesn't account for all of the speed up of this PR. For me it takes the time down from > 400s to about 25s, whereas this PR takes it down to 14s. But as you say, it does appear that the error message generation accounts for a remarkably large proportion of the time!

@smarter
Copy link
Member

smarter commented May 3, 2018

This was fun. For the record I got a very similar speed-up in Dotty by fixing a completely different issue: scala/scala3#4447 :)

@smarter
Copy link
Member

smarter commented May 3, 2018

Recommendation for an actual fix without a huge refactoring of how error messages are handled: change Implicits#fail and Implicits#failure to take a => String instead of a String.

@mkeskells
Copy link
Contributor

Ah - 94% of the CPU taken generating 'helpful' details.
@smarter 🥇 That must be a record or some form in the ratio of smallest fix to largest benefit.
@milessabin 🥈 at another 40% of what is left that a pretty healthy second place

I am very happy if I can get a solid 1% or 2% in a PR 😢 . I could not find a medal that low to compare

@smarter
Copy link
Member

smarter commented May 3, 2018

I am very happy if I can get a solid 1% or 2% in a PR cry

I think the secret is finding an unusual test case that stresses the compiler in some way :).

@milessabin
Copy link
Contributor Author

I can't reproduce the Jenkins failure locally. Finger's X'd it's spurious ...

@adriaanm
Copy link
Contributor

adriaanm commented Aug 3, 2018

Ping? I don't think this is crucial for M5. It can happen post 2.13.0 as far as I'm concerned.

@milessabin
Copy link
Contributor Author

I'm still trying to track down an issue which showed up in the 2.12.x community build. I'll update here as soon as I have a fix there.

@SethTisue SethTisue modified the milestones: 2.13.0-M5, 2.13.x Aug 7, 2018
@SethTisue
Copy link
Member

(this has now missed M5, but maybe it could still make 2.13.0-RC1. regardless, officially milestoning as 2.13.x based on Adriaan's remark)

@milessabin
Copy link
Contributor Author

Can you give me until COB tomorrow? I think I've fixed the issue revealed in the 2.12.x community build, but just need a little more time to confirm that.

@adriaanm
Copy link
Contributor

adriaanm commented Aug 7, 2018

Sure, ping me when you feel this is good to go and we'll try to get it in.

@milessabin milessabin force-pushed the topic/implicit-poly-prune-2.13.x branch from 1f44f82 to ff6fe9a Compare August 7, 2018 18:45
@milessabin
Copy link
Contributor Author

@adriaanm this is ready to go now.

It roughly halves compilation time for the large hlist select benchmark which is now in scala/compiler-benchmark here. On my machine the compile time is ~32s compared with ~61s on the prior commit.

@milessabin
Copy link
Contributor Author

@adriaan FYI, if you're wondering why this is still quite a bit slower than I was reporting earlier, so was I ...

I've just done a rather laborious git bisect and discovered that it's down to #6530. Currently on plain 2.13.x the benchmark compiles in 22s with #6530 reverted, and this PR compiles the benchmark in 14s with #6530 reverted, which is back to the numbers I originally reported.

That slowdown is completely orthogonal to this PR ... I'll see if there's anything I can do about it.

@milessabin
Copy link
Contributor Author

The slowdown caused by #6530 is fixed in #7012, and combined with this PR the "select from large hlist" benchmark compiles in ~16s again.

// Nothings which can't be solved for. Rather than attempt to patch things up later we
// just skip those cases altogether.
lazy val wildPtIsInstantiable =
!wildPt.exists { case _: BoundedWildcardType | _: TypeVar => true ; case tp if typeIsNothing(tp) => true; case _ => false }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced we should look for Nothing here. Also, we could remove the double negation (it's only used as !wildPtIsInstantiable) and call this wildPtNotFullyDefined. That then leads to another simplification: the RHS could be just !isFullyDefined(wildPt).

Copy link
Contributor Author

@milessabin milessabin Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point taken on the double negation ... I'll fix.

We really do need the check for Nothing though: see test/files/pos/prune-poly-infer-nothing.scala ... without the check for Nothing the correctly inferred Nothing for O would be retracted resulting in a false negative. That test was minimized from a failure in http4s in the 2.12 community build.

Aside from that, isFullyDefined is too restrictive ... it rules out cases which should be allowed (eg. any case where the pt contains a vanilla wildcard).

In rankImplicits, before we attempt to fully typecheck the pending
candidate implicit, we first attempt to partially instantiate type
variables in both the candidate and the target type and check for
compatibility. If the compatibility check fails we can immediately prune
the the candidate without having to fully typecheck it.

In the kinds of implicit searches typical of the inductive style found
in shapeless and related libraries this can result in a drastic
reduction in the search space and a corresponding reduction in compile
times.

As an added bonus users of shapeless and shapeless based libraries which
use shapeless's Lazy type will see benefits immediately without needing
to wait for and port to byname implicit arguments.
@milessabin milessabin force-pushed the topic/implicit-poly-prune-2.13.x branch from ff6fe9a to ce3b37c Compare August 8, 2018 12:23
@milessabin
Copy link
Contributor Author

Also rebased.

@SethTisue
Copy link
Member

does this need a community build before merging?

@milessabin
Copy link
Contributor Author

@SethTisue It wouldn't hurt, but I'm confident that I worked through the issues on the 2.12.x community build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance the need for speed. usually compiler performance, sometimes runtime performance. release-notes worth highlighting in next release notes
Projects
None yet
9 participants