Pythran may make a function slower? #1753

charlotte12l · 2021-04-07T13:23:35Z

I tried to use Pythran to improve this function:

import numpy as np
import math

#pythran export my_kendall_p_exact(int,int)
def my_kendall_p_exact(n, c):
    if n <= 0:
        raise ValueError(f'n ({n:d}) must be positive')
    elif c < 0 or 4*c > n*(n-1):
        raise ValueError(f'c ({c:d}) must satisfy 0 <= 4c <= n(n-1) = {n*(n-1):d}.')
    elif n == 1:
        prob = 1.0
    elif n == 2:
        prob = 1.0
    elif c == 0:
        prob = 2.0/math.factorial(n) if n < 171 else 0.0
    elif c == 1:
        prob = 2.0/math.factorial(n-1) if n < 172 else 0.0
    elif 4*c == n*(n-1):
        prob = 1.0
    elif n < 171:
        new = np.zeros(c+1)
        new[0:2] = 1.0
        for j in range(3,n+1):
            new = np.cumsum(new)
            if j <= c:
                new[j:] -= new[:c+1-j]
        prob = 2.0*np.sum(new)/math.factorial(n)
    else:
        new = np.zeros(c+1)
        new[0:2] = 1.0
        for j in range(3, n+1):
            new = np.cumsum(new)/j
            if j <= c:
                new[j:] -= new[:c+1-j]
        prob = np.sum(new)

    return np.clip(prob, 0, 1)

and the test function is

from new_kendall_p_exact import my_kendall_p_exact
def test_pythran_kendall_p_exact():
    expectations = {(400, 38965): 0.48444283672113314099,
                    (401, 39516): 0.66363159823474837662,
                    (800, 156772): 0.42265448483120932055,
                    (801, 157849): 0.53437553412194416236,
                    (1600, 637472): 0.84200727400323538419,
                    (1601, 630304): 0.34465255088058593946}

    for nc, expected in expectations.items():
        res = my_kendall_p_exact(nc[0], nc[1])
        assert_almost_equal(res, expected)

However, when I compare its performance with the pure python version, I'm surprised to find it is slower...

>>>%timeit -n 10 -r 3 test_orig_kendall_p_exact()
14.5 s ± 247 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 test_pythran_kendall_p_exact()
16.3 s ± 369 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

Through analyzing function kendall_p_exact(n, c) via line_profiler, I noticed that np.cumsum consumes about 75% of the total time, then I guessed that the pythran accelerated cumsum may be slower than the original np.cumsum so that it is reasonable that test_pythran_kendall_p_exact() is slower than test_orig_kendall_p_exact(). I did a small test but find the pythran-cumsum is actually faster than np.cumsum...

>>> a = np.random.random((10000))
>>> %timeit -n 100 -r 3 np.cumsum(a)
56.7 µs ± 11.7 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
>>> %timeit -n 100 -r 3 my_cumsum(a)
46.7 µs ± 7.31 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)

import numpy as np

#pythran export my_cumsum(int64[:])
#pythran export my_cumsum(float64[:])
def my_cumsum(n):
    return np.cumsum(n)

The text was updated successfully, but these errors were encountered:

serge-sans-paille · 2021-04-08T20:08:31Z

Thanks @charlotte12l for the detailed report. Your test case is interesting. Although I fail to reproduce your timings (I still get a slight performance improvement with pythran, see below), we can probably do better, and it's great to explore that.

So first, here is how I run the benchmark:

I modified your test function to remove noise:

def t():
    param = (1601, 630304)
    return my_kendall_p_exact(*param)

and run the test through

python -m timeit -s 'from t import t' 't()'

python kernel: 10 loops, best of 3: 4.36 sec per loop
pythran version: 10 loops, best of 3: 4.24 sec per loop

to understand where the pythran version spends time, I use

perf record python -m timeit -s 'from t import t' 't()'
perf report

And then it's some pythran internal, but basically a third of the time is spent in the cumsum and a third in the numpy expression new[j:] -= new[:c+1-j] where pythran could detect (through range analysis) that there's no overlap and an optimized copy could be used, as hinted in #1737

charlotte12l · 2021-04-09T13:07:34Z

Thanks for your reply! I think optimizing array copy will be very useful! I can help test it after you implement it :)

As for this function, I think the speed may vary on different machines, inputs, and even on different timeit tests... Here is my full report, sometimes the pythran version is better. Sometimes, the pythran version is worse.

The good cases:

>>>%timeit -n 10 -r 3 orig_kendall_p_exact(400, 38965)
63.7 ms ± 2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 my_kendall_p_exact(400, 38965)
62.8 ms ± 2.83 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

>>>%timeit -n 10 -r 3 orig_kendall_p_exact(401, 39516)
72.9 ms ± 2.96 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 my_kendall_p_exact(401, 39516)
70.9 ms ± 1.52 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

>>>%timeit -n 10 -r 3 orig_kendall_p_exact(800, 156772)
584 ms ± 9.92 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 my_kendall_p_exact(800, 156772)
541 ms ± 12.1 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

The bad cases:

>>>%timeit -n 10 -r 3 orig_kendall_p_exact(801, 157849)
547 ms ± 8.58 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 my_kendall_p_exact(801, 157849)
601 ms ± 12.4 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

>>>%timeit -n 10 -r 3 orig_kendall_p_exact(1600, 637472)
6.13 s ± 238 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 my_kendall_p_exact(1600, 637472)
6.15 s ± 83.9 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

>>>%timeit -n 10 -r 3 orig_kendall_p_exact(1601, 630304)
6.05 s ± 115 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 my_kendall_p_exact(1601, 630304)
6.32 s ± 301 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

serge-sans-paille · 2021-08-05T20:54:49Z

Can you rerurn your test now that #1867 is merged in? It may change the performance of the above code.

charlotte12l · 2021-08-08T12:46:23Z

Now the timing results are as follows, seems now Pythran version is slightly better than the Python version. Does that meet your expectation?

	Python	Pythran
(400, 38965)	0.654699879(f0)	0.528766466(f1)
(401, 39516)	0.624093375(f2)	0.53689726(f3)
(800, 156772)	4.659593703(f4)	4.300814603(f5)
(801, 157849)	4.722684181(f6)	4.353008163(f7)
(1600, 637472)	42.491141752(f8)	57.9583183899999(f9)	when only timing f8&f9 in the script, the time of f8 is similar, but f9 will become much different(about ~43s rather than ~58). I'm not sure why.
(1601, 630304)	59.37831872(f10)	57.78442073(f11)

# my timing script
import numpy as np
def orig_kendall_p_exact(n, c):
    if n <= 0:
        raise ValueError('n must be positive')
    elif c < 0 or 4*c > n*(n-1):
        raise ValueError('c must satisfy 0 <= 4c <= n(n-1).')
    elif n == 1:
        prob = 1.0
    elif n == 2:
        prob = 1.0
    elif c == 0:
        prob = 2.0/math.factorial(n) if n < 171 else 0.0
    elif c == 1:
        prob = 2.0/math.factorial(n-1) if n < 172 else 0.0
    elif 4*c == n*(n-1):
        prob = 1.0
    elif n < 171:
        new = np.zeros(c+1)
        new[0:2] = 1.0
        for j in range(3,n+1):
            new = np.cumsum(new)
            if j <= c:
                new[j:] -= new[:c+1-j]
        prob = 2.0*np.sum(new)/math.factorial(n)
    else:
        new = np.zeros(c+1)
        new[0:2] = 1.0
        for j in range(3, n+1):
            new = np.cumsum(new)/j
            if j <= c:
                new[j:] -= new[:c+1-j]
        prob = np.sum(new)

    return np.clip(prob, 0, 1)

from new_kendall_p_exact import my_kendall_p_exact
from timeit import timeit

f0 = lambda: orig_kendall_p_exact(400, 38965)
f1 = lambda: my_kendall_p_exact(400, 38965)

print(timeit(f0,number=10))
print(timeit(f1,number=10))


f2 = lambda: orig_kendall_p_exact(401, 39516)
f3 = lambda: my_kendall_p_exact(401, 39516)

print(timeit(f2,number=10))
print(timeit(f3,number=10))


f4 = lambda: orig_kendall_p_exact(800, 156772)
f5 = lambda: my_kendall_p_exact(800, 156772)

print(timeit(f4,number=10))
print(timeit(f5,number=10))


f6 = lambda: orig_kendall_p_exact(801, 157849)
f7 = lambda: my_kendall_p_exact(801, 157849)

print(timeit(f6,number=10))
print(timeit(f7,number=10))

f8 = lambda: orig_kendall_p_exact(1600, 637472)
f9 = lambda: my_kendall_p_exact(1600, 637472)

print(timeit(f8,number=10))
print(timeit(f9,number=10))

f10 = lambda: orig_kendall_p_exact(1601, 630304)
f11 = lambda: my_kendall_p_exact(1601, 630304)

print(timeit(f10,number=10))
print(timeit(f11,number=10))

The Pythran function:

# new_kendall_p_exact.py
import numpy as np
import math

#pythran export my_kendall_p_exact(int,int)
def my_kendall_p_exact(n, c):
    if n <= 0:
        raise ValueError('n must be positive')
    elif c < 0 or 4*c > n*(n-1):
        raise ValueError('c must satisfy 0 <= 4c <= n(n-1).')
    elif n == 1:
        prob = 1.0
    elif n == 2:
        prob = 1.0
    elif c == 0:
        prob = 2.0/math.factorial(n) if n < 171 else 0.0
    elif c == 1:
        prob = 2.0/math.factorial(n-1) if n < 172 else 0.0
    elif 4*c == n*(n-1):
        prob = 1.0
    elif n < 171:
        new = np.zeros(c+1)
        new[0:2] = 1.0
        for j in range(3,n+1):
            new = np.cumsum(new)
            if j <= c:
                new[j:] -= new[:c+1-j]
        prob = 2.0*np.sum(new)/math.factorial(n)
    else:
        new = np.zeros(c+1)
        new[0:2] = 1.0
        for j in range(3, n+1):
            new = np.cumsum(new)/j
            if j <= c:
                new[j:] -= new[:c+1-j]
        prob = np.sum(new)

    return np.clip(prob, 0, 1)

serge-sans-paille · 2021-08-08T19:17:41Z

yeah, as long as we no longer have a slowdown, the bug can be considered fixed. Let's discuss the profitability of swapping implementation in a scipy ticket.

serge-sans-paille · 2021-08-08T19:18:40Z

amendum: considering f8/f9, there may be cache effects, it's probably to test each function independently.

serge-sans-paille closed this as completed Aug 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pythran may make a function slower? #1753

Pythran may make a function slower? #1753

charlotte12l commented Apr 7, 2021

serge-sans-paille commented Apr 8, 2021

charlotte12l commented Apr 9, 2021

serge-sans-paille commented Aug 5, 2021

charlotte12l commented Aug 8, 2021

serge-sans-paille commented Aug 8, 2021

serge-sans-paille commented Aug 8, 2021

Pythran may make a function slower? #1753

Pythran may make a function slower? #1753

Comments

charlotte12l commented Apr 7, 2021

serge-sans-paille commented Apr 8, 2021

charlotte12l commented Apr 9, 2021

serge-sans-paille commented Aug 5, 2021

charlotte12l commented Aug 8, 2021

serge-sans-paille commented Aug 8, 2021

serge-sans-paille commented Aug 8, 2021