Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subtypes: fast path for Union/Union subtype check #14277

Merged
merged 1 commit into from
Dec 28, 2022

Conversation

huguesb
Copy link
Contributor

@huguesb huguesb commented Dec 10, 2022

Enums are exploded into Union of Literal when narrowed.

Conditional branches on enum values can result in multiple distinct narrowing
of the same enum which are later subject to subtype checks (most notably via
is_same_type, when exiting frame context in the binder). Such checks would
have quadratic complexity: O(N*M) where N and M are the number of entries
in each narrowed enum variable, and led to drastic slowdown if any of the enums
involved has a large number of values.

Implement a linear-time fast path where literals are quickly filtered, with
a fallback to the slow path for more complex values.

In our codebase there is one method with a chain of a dozen if statements
operating on instances of an enum with a hundreds of values. Prior to the
regression it was typechecked in less than 1s. After the regression it takes
over 13min to typecheck. This patch fully fixes the regression for us.

Fixes #13821

@github-actions

This comment has been minimized.

mypy/subtypes.py Outdated
@@ -269,6 +270,11 @@ def is_same_type(
)


class _SubtypeCheck(Protocol):
def __call__(self, left: Type, right: Type, *, subtype_context: SubtypeContext) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too used to mypy's codebase -- is this the convention or is the convention to use mypy's extensions (which include a kwarg type for callables)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use protocols, the extensions are soft-deprecated in my view in favor of callback protocols.

Copy link
Member

@AlexWaygood AlexWaygood Dec 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the extensions are soft-deprecated in my view in favor of callback protocols.

Not just in your view — mypy's documentation officially states that they're deprecated :) https://mypy.readthedocs.io/en/stable/additional_features.html#extended-callable-types

@huguesb
Copy link
Contributor Author

huguesb commented Dec 21, 2022

Added some numbers about perf impact to hopefully motive review/merge: <1s to >13min due to regression, and back to <1s after fix.

@huguesb
Copy link
Contributor Author

huguesb commented Dec 21, 2022

rebased to resolve merge conflict with @JukkaL 's recent changes to _is_subtype in #14325

@github-actions

This comment has been minimized.

@JukkaL
Copy link
Collaborator

JukkaL commented Dec 21, 2022

Could you try the newly added tool misc/perf_compare.py to compare performance against master? I don't doubt that this PR makes things faster, but it would be great to see if the tool works as expected.

@huguesb
Copy link
Contributor Author

huguesb commented Dec 22, 2022

Could you try the newly added tool misc/perf_compare.py to compare performance against master? I don't doubt that this PR makes things faster, but it would be great to see if the tool works as expected.

I did. A few observations:

  • it's very slow
  • the timing is very noisy on my system (macOS 12.5.1, 2018-vintage macbook pro): each run takes ~16.7+/-.8s (i.e noise floor is at ~10%)
  • the final calculation doesn't flag outliers (the first run for my test branch was >1s off because there was another CPU-intensive finishing up at the same time)
  • the reported impact is +2.6% (adjusted to +1.8% taking the outlier into account) but this is below the noise floor for my system...
  • have you considered using hyperfine ?

@JukkaL
Copy link
Collaborator

JukkaL commented Dec 22, 2022

I ran perf_compare on my Linux desktop and here are the results (HEAD is this PR):

master                    9.654s (0.0%)
HEAD                      9.815s (+1.7%)

This is similar to what you measured and seems real. Can you try to optimize this further -- I bet you'd need to use a faster code path for small unions or something? I can also give it a try, but it might take a while because of holidays.

it's very slow

Yeah, that can't be avoided right now, unfortunately. Compilation is kind of slow, and we need to compile to get realistic results. Also without multiple measurements the noise level is too high.

the timing is very noisy on my system (macOS 12.5.1, 2018-vintage macbook pro): each run takes ~16.7+/-.8s (i.e noise floor is at ~10%)

On my 2019-vintage Linux desktop noise can be ~5%, and we can work around that to some degree by taking 15 measurements. Even if each measurement has 10% noise, the average over many measurements will typically have much less expected noise.

If you don't have access to a desktop computer, a big enough cloud instance might produce more precise results and wouldn't be expensive, since you'd only need it for less than one hour. (Obviously this may not be practical.)

the final calculation doesn't flag outliers

Good point, I should add automatic outlier filtering to the tool.

the reported impact is +2.6% (adjusted to +1.8% taking the outlier into account) but this is below the noise floor for my system...

As I discussed above, the result when outliers are taken out may actually be fairly accurate due to averaging over multiple samples.

have you considered using hyperfine?

I haven't seen it before, but it looks interesting. We could add it as an optional "backend" to the script. (I don't want to require any extra deps by default.)

@huguesb
Copy link
Contributor Author

huguesb commented Dec 22, 2022

I ran perf_compare on my Linux desktop and here are the results (HEAD is this PR):

master                    9.654s (0.0%)
HEAD                      9.815s (+1.7%)

This is similar to what you measured and seems real. Can you try to optimize this further -- I bet you'd need to use a faster code path for small unions or something? I can also give it a try, but it might take a while because of holidays.

Probably the biggest thing to do is to decouple the Instance/Union and Union/Union codepath which I combined to reduce duplication. The former is more common by far and the extra set creations and the few extra branches are probably to blame for any observable regression.

I can definitely push an updated version for that. I can also play with further micro-optimizations but that honestly seems a little silly when weighing at slowdown below the noise floor in the common case versus a 7800% speedup in the slow case.

it's very slow

Yeah, that can't be avoided right now, unfortunately. Compilation is kind of slow, and we need to compile to get realistic results. Also without multiple measurements the noise level is too high.

The repeat checkout/compilation cost is a real problem for repeated measurements when iterating through possible approaches though. At the very least you'd want to keep the reference binary around for as long as possible instead of recompiling it every time.

Enums are exploded into Union of Literal when narrowed.

Conditional branches on enum values can result in multiple distinct narrowing
of the same enum which are later subject to subtype checks (most notably via
`is_same_type`, when exiting frame context in the binder). Such checks would
have quadratic complexity: `O(N*M)` where `N` and `M` are the number of entries
in each narrowed enum variable, and led to drastic slowdown if any of the enums
involved has a large number of valuees.

Implemement a linear-time fast path where literals are quickly filtered, with
a fallback to the slow path for more complex values.

In our codebase there is one method with a chain of a dozen if statements
operating on instances of an enum with a hundreds of values. Prior to the
regression it was typechecked in less than 1s. After the regression it takes
over 13min to typecheck. This patch fully fixes the regression for us.

Fixes python#13821
@huguesb
Copy link
Contributor Author

huguesb commented Dec 22, 2022

Result for the newest commit:

master                    15.925s (0.0%)
gh-13821                  15.973s (+0.3%)

@github-actions
Copy link
Contributor

According to mypy_primer, this change has no effect on the checked open source code. 🤖🎉

@JukkaL
Copy link
Collaborator

JukkaL commented Dec 28, 2022

I can also play with further micro-optimizations but that honestly seems a little silly when weighing at slowdown below the noise floor in the common case versus a 7800% speedup in the slow case.

Context: I've spent several days recently reducing the impact of many (mostly) minor performance regressions. The 7800% speedup is amazing, and if we can have the speedup without impacting performance in the average case, it's even better!

A slowdown has an impact even if it's below measurement noise floor. Mypy accumulated maybe 15-20% overall slowdown in 2022 because of many minor performance regressions (see #14358). If we'd have similar yearly slowdowns for 5 years, mypy runtimes would increase by 100%. These small regressions are hard to fix months later, so if I can find a potential regression, my preference is to fix it even before merging the PR.

The repeat checkout/compilation cost is a real problem for repeated measurements when iterating through possible approaches though. At the very least you'd want to keep the reference binary around for as long as possible instead of recompiling it every time.

A good idea. Added an issue to track this: #14359

Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and for persisting with the final tweaks! I love the huge speedup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mypy 0.982 run 15 minutes (previous version needs only one minute)
7 participants