Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize int division #90564

Closed
gpshead opened this issue Jan 16, 2022 · 5 comments
Closed

optimize int division #90564

gpshead opened this issue Jan 16, 2022 · 5 comments
Labels
3.11 only security fixes performance Performance or resource usage

Comments

@gpshead
Copy link
Member

gpshead commented Jan 16, 2022

BPO 46406
Nosy @tim-one, @rhettinger, @gpshead, @mdickinson
PRs
  • bpo-46406: Faster single digit int division. #30626
  • bpo-46504: faster code for trial quotient in x_divrem() #30856
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2022-01-23.10:01:22.679>
    created_at = <Date 2022-01-16.23:38:29.594>
    labels = ['3.11', 'performance']
    title = 'optimize int division'
    updated_at = <Date 2022-01-25.01:06:10.163>
    user = 'https://github.com/gpshead'

    bugs.python.org fields:

    activity = <Date 2022-01-25.01:06:10.163>
    actor = 'tim.peters'
    assignee = 'none'
    closed = True
    closed_date = <Date 2022-01-23.10:01:22.679>
    closer = 'mark.dickinson'
    components = []
    creation = <Date 2022-01-16.23:38:29.594>
    creator = 'gregory.p.smith'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 46406
    keywords = ['patch']
    message_count = 5.0
    messages = ['410735', '410736', '410739', '411358', '411541']
    nosy_count = 4.0
    nosy_names = ['tim.peters', 'rhettinger', 'gregory.p.smith', 'mark.dickinson']
    pr_nums = ['30626', '30856']
    priority = 'normal'
    resolution = None
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue46406'
    versions = ['Python 3.11']

    @gpshead
    Copy link
    Member Author

    gpshead commented Jan 16, 2022

    Based on a python-dev thread, we've come up with faster int division code for CPython's bignums.

    https://mail.python.org/archives/list/python-dev@python.org/thread/ZICIMX5VFCX4IOFH5NUPVHCUJCQ4Q7QM/#NEUNFZU3TQU4CPTYZNF3WCN7DOJBBTK5

    filing this issue for starters to attach a PR to. details forthcoming.

    @gpshead gpshead added 3.11 only security fixes performance Performance or resource usage labels Jan 16, 2022
    @gpshead
    Copy link
    Member Author

    gpshead commented Jan 16, 2022

    The PR was directly inspired by Mark Dickinson's code in the email thread directly using __asm__ to get the instruction he wanted. There is usually a way to make the compiler actually do what you intend. This appears to be it.

    Interestingly, experimenting with small code snippets rather than the entire longobject.c on gotbolt.org to check various compilers output does not always yield as nice of a result. (clang 11+ showed promise there, but this change benefits gcc equally as well in real world CPython microbenchmark timeit tests). https://godbolt.org/z/63eWPczjx was my playground code.

    $ ./b-clang13/python -m timeit -n 1500000 -s 'x = 10**1000; r=x//10; assert r == 10**999, r' 'x//17'
    1500000 loops, best of 5: 450 nsec per loop
    $ ./b-clang13-new-basic-divrem1/python -m timeit -n 1500000 -s 'x = 10**1000; r=x//10; assert r == 10**999, r' 'x//17'
    1500000 loops, best of 5: 375 nsec per loop
    $ ./b-gcc9/python -m timeit -n 1500000 -s 'x = 10**1000; r=x//10; assert r == 10**999, r' 'x//17'
    1500000 loops, best of 5: 448 nsec per loop
    $ ./b-gcc9-new-basic-divrem1/python -m timeit -n 1500000 -s 'x = 10**1000; r=x//10; assert r == 10**999, r' 'x//17'
    1500000 loops, best of 5: 370 nsec per loop
    

    That's on an AMD zen3 (x86_64). Also tested with other divisors, 17 is not specialized by the compiler. These were not --enable-optimizations builds, though the results remain similar on those for non-specialized values as x//10 turns into when using -fprofile-values on gcc9.

    Performance tests using other architectures forthcoming.

    A pyperformance suite run on a benchmark-stable host is worthwhile. I don't actually expect this to show up as significant in most things there; we'll see.

    The new code is not any more difficult to maintain than the previous code regardless.

    @gpshead
    Copy link
    Member Author

    gpshead commented Jan 17, 2022

    I tested my PR branch on 32-bit arm (raspbian bullseye) and the microbenchmark timing shows no change (within the noise across repeated runs). Unsurprising as division is entirely different on 32-bit arm.

    Raspbian uses armv6 for compatibility with the original rpi and rpi0. armv6 does not have an integer division instruction. (how RISCy of it) But that doesn't make a difference in this code as the final 32-bit arm ISA, armv7-a, only has a 32:32 divider. (armv8 aka aarch64 is 64-bit and uses a UDIV as one would expect)

    anyways, that satisfies me that it isn't making anything worse elsewhere.

    @mdickinson
    Copy link
    Member

    New changeset c7f20f1 by Gregory P. Smith in branch 'main':
    bpo-46406: Faster single digit int division. (bpo-30626)
    c7f20f1

    @tim-one
    Copy link
    Member

    tim-one commented Jan 25, 2022

    New changeset 7c26472 by Tim Peters in branch 'main':
    bpo-46504: faster code for trial quotient in x_divrem() (GH-30856)
    7c26472

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants