Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mypyc] Speed up in operations for list/tuple #9004

Merged
merged 17 commits into from Sep 28, 2020

Conversation

jdahlin
Copy link
Contributor

@jdahlin jdahlin commented Jun 15, 2020

When right hand side of a in/not in operation is a literal
list/tuple, simplify it into simpler direct equality comparision
expressions and use binary and/or to join them.

Part of mypyc/mypyc#726, but this only speeds up list/tuple.

This is my first contribution to mypy/mypyc, please let me know if there's anything I can do to improve the pull request. I ended up create new tree nodes (OpExpr/ComparisionExpr) inside IRBuilder which is probably not generating as efficient as it can be. If that needs to change let me know and please provide me with some pointers on how to build a more efficient ir. Happy to add tests for the specific IR generated as well if needed.

I didn't do any macro benchmarks on mypy itself, would be happy to know what's normally benchmarked and instructions on how to do so.

# before (x = 10)
2000000 loops, best of 5: 113 nsec per loop  # x in [1]
2000000 loops, best of 5: 116 nsec per loop  # x in [1, 2]
2000000 loops, best of 5: 128 nsec per loop  # x in [1, 2, 3]
2000000 loops, best of 5: 136 nsec per loop  # x in [1, 2, 3, 4]
2000000 loops, best of 5: 145 nsec per loop  # x in [1, 2, 3, 4, 5]

5000000 loops, best of 5: 88.7 nsec per loop # x in (1)
5000000 loops, best of 5: 97.6 nsec per loop # x in (1, 2)
2000000 loops, best of 5: 108 nsec per loop  # x in (1, 2, 3)
2000000 loops, best of 5: 118 nsec per loop  # x in (1, 2, 3, 4)
2000000 loops, best of 5: 129 nsec per loop  # x in (1, 2, 3, 4, 5)

# after (x = 10)
5000000 loops, best of 5: 54.8 nsec per loop  # x in [1]
5000000 loops, best of 5: 55.9 nsec per loop  # x in [1, 2]
5000000 loops, best of 5: 56 nsec per loop  # x in [1, 2, 3]
5000000 loops, best of 5: 55.5 nsec per loop  # x in [1, 2, 3, 4]
5000000 loops, best of 5: 55.1 nsec per loop  # x in [1, 2, 3, 4, 5]

5000000 loops, best of 5: 55 nsec per loop  # x in (1)   
5000000 loops, best of 5: 55.1 nsec per loop    # x in (1, 2)
5000000 loops, best of 5: 54.9 nsec per loop  # x in (1, 2, 3) 
5000000 loops, best of 5: 55.8 nsec per loop  # x in (1, 2, 3, 4) 
5000000 loops, best of 5: 55.6 nsec per loop  # x in (1, 2, 3, 4, 5) 

For reference, using CPython 3.8.2:

5000000 loops, best of 5: 57 nsec per loop  # x in (1)   
5000000 loops, best of 5: 61.6 nsec per loop    # x in (1, 2)
5000000 loops, best of 5: 72 nsec per loop  # x in (1, 2, 3) 
5000000 loops, best of 5: 76.8 nsec per loop  # x in (1, 2, 3, 4) 
5000000 loops, best of 5: 84.6 nsec per loop  # x in (1, 2, 3, 4, 5) 

Copy link
Collaborator

@TH3CHARLie TH3CHARLie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this specialization. cc @JukkaL @msullivan

mypyc/irbuild/expression.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@msullivan msullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks!. I've just got one thing that you can address if you want or just put a TODO in if you don't

elif n_items < 16:
bin_op = 'or' if e.operators[0] == 'in' else 'and'
lhs = e.operands[0]
exprs = (ComparisonExpr([cmp_op], [lhs, item]) for item in items)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I /think/ that since these expressions don't appear in the type table, they will get coerced to object, leading to some kind of pointless boxing. I guess that isn't actually that expensive for bools, but it's worth cleaning up. Fine to just put a TODO in for now if you don't feel like cleaning it up now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @msullivan, I'd be happy to fix this but I would probably need some more specific pointers of what needs to be modified for that to work.

This comment was marked as outdated.

Copy link
Contributor Author

@jdahlin jdahlin Jun 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msullivan I ended up, perhaps somewhat hacky to shortcut all OpExpr/ComparisonExpr without types as bool_primitive. That removed the excessive box/unboxing, with these changes it's significantly faster, I measured somewhere between 46%-78% for micro benchmarks. Seems like it triggers some happy path in the C compiler, as the length of the tuple/list is no longer relevant for performance. (tested sequentes up to 16 items of int)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to hear it is a lot faster! I don't like this hack, though.

I think there are two ways forward:

  • Directly generate the code without creating an AST for it first. One way to do this would involve using shortcircuit_helper, though also it could probably just be done directly.
  • Put all of the generated expressions into the type table. This could probably be done ergonomically by adding a helper method that takes an expression and a type, adds it to the table, and returns the expression.

Historically though we haven't really done AST generation as part of compilation (I think mostly because it would require populating the type table), but it seems fine to allow if it is done tastefully.
(@JukkaL, do you agree?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I'll populate the type table with these types then, shortcut_helper scared me a bit, it seemed more straight forward to me to generate AST nodes for this issue.

Johan Dahlin added 8 commits June 25, 2020 20:08
When right hand side of a in/not in operation is a literal
list/tuple, simplify it into simpler direct equality comparision
expressions and use binary and/or to join them.

Yields speedup of up to 46% in micro benchmarks.
This makes it easier to add type annotations for subtypes.
@msullivan
Copy link
Collaborator

Looks good. Check out the lint failure though

mypyc/irbuild/expression.py Outdated Show resolved Hide resolved
Ensure we only operate on ComparisonExpr with at most one operator.

Co-authored-by: Tomer Chachamu <tomer.chachamu@gmail.com>
@JukkaL
Copy link
Collaborator

JukkaL commented Sep 26, 2020

@msullivan @jdahlin I wonder what needs to be done (beyond fixing merge conflicts) to get this merged? This would be really nice to have.

@jdahlin
Copy link
Contributor Author

jdahlin commented Sep 26, 2020 via email

@jdahlin
Copy link
Contributor Author

jdahlin commented Sep 28, 2020

@JukkaL I've finished merging it with latest master, all tests pass now.

Copy link
Collaborator

@TH3CHARLie TH3CHARLie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for speeding up mypyc!

mypyc/test-data/irbuild-tuple.test Outdated Show resolved Hide resolved
Co-authored-by: Xuanda Yang <th3charlie@gmail.com>
@JukkaL JukkaL merged commit 8bf770d into python:master Sep 28, 2020
@JukkaL
Copy link
Collaborator

JukkaL commented Sep 28, 2020

Thanks! I expect to see some nice improvements in benchmark results after the next nightly run (https://github.com/mypyc/mypyc-benchmarks).

@jdahlin
Copy link
Contributor Author

jdahlin commented Sep 28, 2020

Thanks for merging it @JukkaL! Is there a way to see the results for the nightly build somewhere? I'm also curious about the effect of this on larger pieces of code.

@TH3CHARLie
Copy link
Collaborator

Thanks for merging it @JukkaL! Is there a way to see the results for the nightly build somewhere? I'm also curious about the effect of this on larger pieces of code.

https://github.com/mypyc/mypyc-benchmark-results/blob/master/reports/summary-microbenchmarks.md would be the place to find out the related results. The performance boost would be less significant on larger code though.

@JukkaL
Copy link
Collaborator

JukkaL commented Sep 28, 2020

For many optimizations microbenchmarks work well to estimate the level of improvement. Even for a very good optimization the impact to most major benchmarks can be below the measurement noise floor, or we might have no major benchmarks that happen to use the targeted feature in sufficient volume to be affected.

This should get gradually better as we add more benchmarks. If we have an optimization that isn't reflected in any existing benchmarks, it may be a good idea to add another (micro)benchmark to catch regressions in the future.

@TH3CHARLie
Copy link
Collaborator

From the results of microbenchmark in_list and in_tuple, the performance boost of this PR is huge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants