Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long int bitwise ops speedup (patch included) #41337

Closed
gregsmith mannequin opened this issue Dec 18, 2004 · 17 comments
Closed

long int bitwise ops speedup (patch included) #41337

gregsmith mannequin opened this issue Dec 18, 2004 · 17 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@gregsmith
Copy link
Mannequin

gregsmith mannequin commented Dec 18, 2004

BPO 1087418
Nosy @tim-one, @loewis, @rhettinger, @mdickinson, @pitrou, @tiran
Files
  • long.diff: diffs to longobject.c (rel 2.4 version)
  • long_bitwise_ops_metd.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/tim-one'
    closed_at = <Date 2009-10-25.20:46:06.436>
    created_at = <Date 2004-12-18.05:22:50.000>
    labels = ['interpreter-core', 'performance']
    title = 'long int bitwise ops speedup (patch included)'
    updated_at = <Date 2009-10-25.20:46:06.434>
    user = 'https://bugs.python.org/gregsmith'

    bugs.python.org fields:

    activity = <Date 2009-10-25.20:46:06.434>
    actor = 'mark.dickinson'
    assignee = 'tim.peters'
    closed = True
    closed_date = <Date 2009-10-25.20:46:06.436>
    closer = 'mark.dickinson'
    components = ['Interpreter Core']
    creation = <Date 2004-12-18.05:22:50.000>
    creator = 'gregsmith'
    dependencies = []
    files = ['6389', '15195']
    hgrepos = []
    issue_num = 1087418
    keywords = ['patch']
    message_count = 17.0
    messages = ['47377', '47378', '47379', '47380', '47381', '47382', '59323', '94400', '94402', '94404', '94405', '94416', '94420', '94435', '94451', '94457', '94461']
    nosy_count = 8.0
    nosy_names = ['tim.peters', 'loewis', 'gregsmith', 'rhettinger', 'mark.dickinson', 'pitrou', 'christian.heimes', 'sopoforic']
    pr_nums = []
    priority = 'normal'
    resolution = 'accepted'
    stage = 'test needed'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue1087418'
    versions = ['Python 2.7', 'Python 3.2']

    @gregsmith
    Copy link
    Mannequin Author

    gregsmith mannequin commented Dec 18, 2004

    The 'inner loop' for applying bitwise ops to longs is quite
    inefficient.

    The improvement in the attached diff is

    • 'a' is never shorter than 'b' (result: only test 1
      loop index condition instead of 3)
    • each operation ( & | ^ ) has its own loop, instead
      of switch inside loop
    • I found that, when this is done, a lot
      of things can be simplified, resulting in further speedup,
      and the resulting code is not very much longer than
      before (my libpython2.4.dll .text got 140 bytes longer).

    Operations on longs of a few thousand bits appear
    to be 2 ... 2.5 times faster with this patch.
    I'm not 100% sure the code is right, but it passes
    test_long.py, anyway.

    @gregsmith gregsmith mannequin assigned tim-one Dec 18, 2004
    @gregsmith gregsmith mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Dec 18, 2004
    @gregsmith gregsmith mannequin assigned tim-one Dec 18, 2004
    @gregsmith gregsmith mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Dec 18, 2004
    @gregsmith
    Copy link
    Mannequin Author

    gregsmith mannequin commented Jan 3, 2005

    Logged In: YES
    user_id=292741

    I originally timed this on a cygwin system, I've since found
    that cygwin timings tend to be strange and possibly
    misleading. On a RH8 system, I'm seeing speedup of x3.5 with
    longs of ~1500 bits and larger, and x1.5 speedup with only
    about 300 bits. Times were measured with timeit.Timer(
    'a|b', 'a=...; b=...')
    Increase in .text size is likewise about 120 bytes.

    @rhettinger
    Copy link
    Contributor

    Logged In: YES
    user_id=80475

    Patch Review
    ------------

    On Windows using MSC 6.0, I could only reproduce about a
    small speedup at around 300 bits.

    While the patch is short, it adds quite a bit of complexity
    to the routine. Its correctness is not self-evident or
    certain. Even if correct, it is likely to encumber future
    maintenance.

    Unless you have important use cases and feel strongly about
    it, I think this one should probably not go in.

    An alternative to submit a patch that limits its scope to
    factoring out the innermost switch/case. I tried that and
    found that the speedup is microscopic. I suspect that that
    one unpredictable branch is not much of a bottleneck. More
    time is likely spent on creating z.

    @gregsmith
    Copy link
    Mannequin Author

    gregsmith mannequin commented Feb 11, 2005

    Logged In: YES
    user_id=292741

    I started by just factoring out the inner switch loop. But then
    it becomes evident that when op = '^', you always have
    maska == maskb, so there's no point in doing the ^mask at all.
    And when op == '|', then maska==maskb==0. So likewise.
    And if you put a check in so that len(a) >= len(b), then the
    calculation of len_z can be simplified. It also becomes easy
    to break the end off the loops, so that, say, or'ing a small
    number with a really long becomes mostly a copy. etc.
    It's was just a series of small simple changes following
    from the refactoring of the loop/switch.

    I see a repeatable 1.5 x speedup at 300 bits, which
    I think is significant (I wasn't using negative #s, which
    of course have their own extra overhead). The difference
    should be even higher on CPUs that don't have several
    100 mW of branch-prediction circuitry.

    One use case is that you can simulate an array
    of hundreds or thousands of simple 1-bit processors
    in pure python using long operations, and get very
    good performance, even better with this fix. This app
    involves all logical ops, with the occasional shift.

    IMHO, I don't think the changed code is more complex; it's a
    little longer, but it's more explicit in what is really
    being done, and it doesn't roll together 3 cases, which
    don't really have that much in common, for the sake of
    brevity. It wasn't obvious to
    me about the masks being redundant until after I did the
    factoring, and this is my point - rolling it together hides
    that.
    The original author may not have noticed the redundancy.

    I see a lot of effort being expended on very complex
    multiply operations, why should the logical ops be left
    behind for
    the sake of a few lines?

    @tim-one
    Copy link
    Member

    tim-one commented May 25, 2006

    Logged In: YES
    user_id=31435

    Assigned to me, changed Category to Performance.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 11, 2006

    Tim, what is your view on this patch?

    @tiran
    Copy link
    Member

    tiran commented Jan 5, 2008

    It's probably useful for Python 3.0 since the old int type is gone.

    @devdanzin devdanzin mannequin added performance Performance or resource usage labels Apr 26, 2009
    @gregsmith
    Copy link
    Mannequin Author

    gregsmith mannequin commented Oct 24, 2009

    Actually, my view for 3.x is this: I do agree hugely with the 'top
    level' decision to only have one type that handles everything, and
    obviously the easiest way to get there is to just use the existing long
    type. However, though the rad-2^15 implementation of the 'long' type was
    a fine thing for the 2.x python where it's only used for relatively
    unusual cases, I think it's too inefficient to use as the primary
    integer type, so I'm wondering if there's a plan to replace it with
    something more efficient (thus rendering this particular issue irrelevant).

    I did spend some time delving into python internals years ago, but
    haven't really looked into 3.x, so bear with me. I am aware of the
    following 'speed tweaks' in Python 2.x integers, aren't these lost now?

    (1) integers all same size, allocated from linked-list pool instead
    of from malloc
    (2) 'add' and 'sub' byte codes do special check in interpreter core
    to see if ints being added, and will go direct to PyInt_FromLong where
    possible

    It would seem to me that a more suitable implementation would be to
    store numbers as 32*k-bit 2's complement integers; I've used this in C++
    libraries. Compared to the rad-15 sign/magnitude approach, it may seem
    trickier to do carries but in practice it's not that big a deal. It also
    basically assumes you have a decently fast 64-bit operation to do
    multiplies and divides with, which I think is now reasonable. This
    implementation will be much easier to optimize for the (now) extremely
    common case where numbers fit into 32 bits.
    It could also be possible to store 'wee' integers in a special pool and
    only use malloc for larger ones.

    I know there's a lot of code in that module, virtually all of which
    would be affected by this. But isn't this type now a big performance
    factor in 3.x?

    @pitrou
    Copy link
    Member

    pitrou commented Oct 24, 2009

    The type is an important performance factor but most uses of it are for
    small ints (< 2**32 or 2**64), so your approach wouldn't make much of a
    difference.
    Besides, there are already some improvements in the py3k branch (for
    example, longs now use 30-bit "digits" on 64-bit systems).

    I am aware of the
    following 'speed tweaks' in Python 2.x integers, aren't these lost now?

    Yes, they are. As a result, calculations on small ints have become a bit
    slower.

    @mdickinson
    Copy link
    Member

    [Greg]

    It would seem to me that a more suitable implementation would be to
    store numbers as 32*k-bit 2's complement integers; I've used this in
    C++ libraries. Compared to the rad-15 sign/magnitude approach, it may
    seem trickier to do carries but in practice it's not that big a deal.

    Do you know of any publicly available code that takes this approach,
    while remaining simple and portable?

    For 32-bit limbs (on a 32-bit platform, say, with no C99 support and no
    uint64_t available), how *do* you deal with carries? You can write
    portable C89 that does the correct thing, but you then have to cross
    your fingers and hope that the compiler can turn it into sensible
    assembler, and it often can't (I'll post an example of this that I ran
    into just yesterday in a second). And even if your compiler can
    optimize this effectively, what about all the other compilers out there?
    The alternative is to use inline assembler for selected platforms; at
    that point, you lose simplicity.

    The sign-magnitude versus two's complement is an orthogonal issue, I
    think. I'm not convinced of the value of two's complement. Certainly
    two's complement would be more convenient for the bitwise operations
    under discussion, and probably also works well for addition and
    subtraction. But it's less convenient for negation, multiplication,
    division, power, conversion to and from human-readable strings, etc.
    It's worth noting that GMP uses sign-magnitude, so I can't believe there
    could be much performance gain in moving to two's complement.
    (Actually, I'm not sure I've seen any arbitrary-precision arithmetic
    package that uses two's complement. Has anyone here?)

    Here's the example promised earlier: yesterday I wanted to add two 128-
    bit unsigned integers a and b, each represented as a pair of type
    uint64_t integers, on a 64-bit machine. In portable C99, the code looks
    something like:

    #include <stdint.h>

    /* *sumhigh:*sumlow = ahigh:alow + bhigh:blow */

    void
    add_128(uint64_t *sumhigh, uint64_t *sumlow,
               uint64_t ahigh, uint64_t alow,
               uint64_t bhigh, uint64_t blow)
    {
      alow += blow;
      ahigh += bhigh + (alow < blow);
      *sumlow = alow;
      *sumhigh = ahigh;
    }

    Ideally, the compiler would manage to optimize this to a simple 'addq,
    adcq' pair of instructions. But gcc-4.4 -O3, on OS X 10.6/x86_64 gives
    me:

    _add_128:
    LFB0:
    leaq (%r9,%rcx), %rcx
    xorl %eax, %eax
    leaq (%r8,%rdx), %rdx
    pushq %rbp
    LCFI0:
    cmpq %r9, %rcx
    movq %rcx, (%rsi)
    setb %al
    movq %rsp, %rbp
    LCFI1:
    addq %rax, %rdx
    movq %rdx, (%rdi)
    leave
    ret

    (Here it looks like alow and blow are in r9 and rcx, ahigh and bhigh are
    in r8 and rdx, and rsi and rdi are sumlow and sumhigh.)

    How do you write the C code in such a way that gcc produces the right
    result? I eventually gave up and just put the assembler in directly.

    @mdickinson
    Copy link
    Member

    Here's the inline assembly version, for comparison:

    #define SUM2_BIN64(sumhigh, sumlow, ahigh, alow, bhigh, blow)    \
      __asm__ ("addq\t%5, %1\n\t"                                    \
               "adcq\t%3, %0"                                        \
               : "=r" (sumhigh), "=&r" (sumlow)                      \
               : "0" ((uint64_t)(ahigh)), "rm" ((uint64_t)(bhigh)),  \
                 "%1" ((uint64_t)(alow)), "rm" ((uint64_t)(blow))    \
               : "cc")
    
    void
    add_128_asm(uint64_t *sumhigh, uint64_t *sumlow,
               uint64_t ahigh, uint64_t alow,
               uint64_t bhigh, uint64_t blow)
    {
      SUM2_BIN64(ahigh, alow, ahigh, alow, bhigh, blow);
      *sumlow = alow;
      *sumhigh = ahigh;
    }

    And the generated output (again gcc-4.4 with -O3):

    _add_128_asm:
    LFB1:
    pushq %rbp
    LCFI2:
    # 29 "test.c" 1
    addq %r9, %rcx
    adcq %r8, %rdx
    # 0 "" 2
    movq %rsp, %rbp
    LCFI3:
    movq %rcx, (%rsi)
    movq %rdx, (%rdi)
    leave
    ret

    @gregsmith
    Copy link
    Mannequin Author

    gregsmith mannequin commented Oct 24, 2009

    Antoine, "most uses of it are for small ints (< 2**32 or 2**64), ",
    that's my point exactly, the current long type is quite slow for those
    cases because of sign-magnitude implementation.

    I don't see a problem with sign-magnitude for old long ints, or for GMP,
    since in those cases the assumption is that you have a fair percentage
    of large #s, this is not valid in Py3.0 where most are small. Most
    operations are add and subtract, and in each such operation you need to
    look at both signs, and decide if you want to really add or subtract,
    and if you are subtracting, you then have to do a magnitude test to see
    which way - all of that before you do any actual computation. That's a
    lot of overhead for e.g. 'i -= 1'. Especially given that the sign &
    length information are rolled together into a single value; you need a
    fair number of tests & conditional branches. With a 2's complement
    implementation you would test for the most common case where both values
    are 1 digit (which would include 0 value) and in that case do an add (or
    subtract) followed by a check for overflow, and that's the whole thing.
    I'm guessing this is 10% - 25% as much time as in the current
    implementation - this is one of those cases where the time for tests and
    setup dominate over the actual computation.

    Mark, what you've written looks fine to me. It would be a bit faster
    with an add-with-carry, but there's no conditional branches in the
    generated code, so it's not bad, it's still faster than it would be if
    you were extracting carries from the msb of each result and masking them
    afterwards. Bear in mind that some RISC processors don't have an
    add-with-carry (lacking condition codes altogether) so the method of
    comparing the sum to the inputs is the only method available. Given that
    in Python the operation is in a loop, I think it's unreasonable to to
    expect a compiler to identify an add-with-carry application, but at the
    same time it's not necessary. Now that the long int is the *only* int
    type, multi-digit ops will be relatively uncommon, and you want to focus
    on reducing the overhead before and after the operation.

    (And, for speed freaks who want to use long ints to implement large bits
    arrays and want fast bit ops - my original motivation for this issue -
    it would be possible to use SSE2 vector instructions on specific
    processors. That can also be done with the 15-bit implementation, but it
    works much better with 32).

    The 'algorithmic C' package
    (http://www.mentor.com/products/esl/high_level_synthesis/ac_datatypes)
    is a C++ template class for arbitrary-length signed/unsigned integers.
    It's not suitable for use in Python because of license issues and
    because its integers are controlled by templates and are fixed-size at
    compile time - but it uses Kx32-bit 2's complement format, and
    implements adds, shifts, multiplies, logic ops, and limited division
    ops. It works very well, with extremely low overhead on small values -
    often the code is exactly the same as if you used int type - that would
    not be possible with a sign/magnitude approach.

    As for multiplies and divides - it's certainly possible to proceed
    through a multiply entirely in 2's complement, so no overhead is needed
    for abs value, or changing sign at the end. For division, it's possible
    if the denominator is > 0, and it may be possible if <0 (but probably
    not worth the hassle).

    I would be happy to do a design doc for this and write some of the inner
    loops, but if (a) it's already being done or (b) there's no chance of it
    being deployed then it would be a waste of time...
    if there was definite interest in it (and reasonable schedule
    expectations) I could write a lot more of it.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 24, 2009

    I would be happy to do a design doc for this and write some of the inner
    loops, but if (a) it's already being done or (b) there's no chance of it
    being deployed then it would be a waste of time...
    if there was definite interest in it (and reasonable schedule
    expectations) I could write a lot more of it.

    I think the only realistic chance in which something may change is that
    somebody sits down and does all the work, from beginning to end. I doubt
    that a design doc and some inner loops alone will make any difference.

    It's not being done (to my knowledge), for the reason alone that it
    would be a lot of work. It has a chance to be deployed only if a)
    it is significantly faster, for small numbers, b) correct, and c)
    only slightly more complicated than the current code.

    I notice that such discussion is off-topic for this bug report, which
    is about your original patch. All we have to decide here is whether to
    accept it (perhaps with modifications), or to reject it.

    @mdickinson
    Copy link
    Member

    Most operations are add and subtract, and in each such operation you
    need to look at both signs, and decide if you want to really add or
    subtract, and if you are subtracting, you then have to do a magnitude
    test to see which way - all of that before you do any actual
    computation. That's a lot of overhead for e.g. 'i -= 1'.

    Hmm. I agree this isn't ideal, and I now see the attraction of
    two's complement. Thanks.

    It would be interesting to see timings from such an approach.
    Maybe one could just implement the basic operations (+, -, *, /)
    to get an idea of whether it's worth considering more seriously.

    @mdickinson
    Copy link
    Member

    I think Greg's patch looks fine, modulo updating it to apply cleanly to
    py3k.

    I couldn't resist tinkering a bit, though: factoring out the complement
    operations (for a, b and z) into a separate function gives (IMO) cleaner
    and more direct code that's free of mask operations.

    Here's the patch.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 25, 2009

    Mark, if you want to get reviews, it might be useful to upload the patch
    to Rietveld (as the diff is difficult to read). However, I would trust
    you to get it right, anyway, so feel free to go ahead and apply it.

    @mdickinson
    Copy link
    Member

    Applied in r75697 (trunk) and r75698 (py3k).

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants