Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pythran may make a function slower? #1753

Closed
charlotte12l opened this issue Apr 7, 2021 · 6 comments
Closed

Pythran may make a function slower? #1753

charlotte12l opened this issue Apr 7, 2021 · 6 comments

Comments

@charlotte12l
Copy link
Contributor

I tried to use Pythran to improve this function:

import numpy as np
import math

#pythran export my_kendall_p_exact(int,int)
def my_kendall_p_exact(n, c):
    if n <= 0:
        raise ValueError(f'n ({n:d}) must be positive')
    elif c < 0 or 4*c > n*(n-1):
        raise ValueError(f'c ({c:d}) must satisfy 0 <= 4c <= n(n-1) = {n*(n-1):d}.')
    elif n == 1:
        prob = 1.0
    elif n == 2:
        prob = 1.0
    elif c == 0:
        prob = 2.0/math.factorial(n) if n < 171 else 0.0
    elif c == 1:
        prob = 2.0/math.factorial(n-1) if n < 172 else 0.0
    elif 4*c == n*(n-1):
        prob = 1.0
    elif n < 171:
        new = np.zeros(c+1)
        new[0:2] = 1.0
        for j in range(3,n+1):
            new = np.cumsum(new)
            if j <= c:
                new[j:] -= new[:c+1-j]
        prob = 2.0*np.sum(new)/math.factorial(n)
    else:
        new = np.zeros(c+1)
        new[0:2] = 1.0
        for j in range(3, n+1):
            new = np.cumsum(new)/j
            if j <= c:
                new[j:] -= new[:c+1-j]
        prob = np.sum(new)

    return np.clip(prob, 0, 1)

and the test function is

from new_kendall_p_exact import my_kendall_p_exact
def test_pythran_kendall_p_exact():
    expectations = {(400, 38965): 0.48444283672113314099,
                    (401, 39516): 0.66363159823474837662,
                    (800, 156772): 0.42265448483120932055,
                    (801, 157849): 0.53437553412194416236,
                    (1600, 637472): 0.84200727400323538419,
                    (1601, 630304): 0.34465255088058593946}

    for nc, expected in expectations.items():
        res = my_kendall_p_exact(nc[0], nc[1])
        assert_almost_equal(res, expected)

However, when I compare its performance with the pure python version, I'm surprised to find it is slower...

>>>%timeit -n 10 -r 3 test_orig_kendall_p_exact()
14.5 s ± 247 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 test_pythran_kendall_p_exact()
16.3 s ± 369 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

Through analyzing function kendall_p_exact(n, c) via line_profiler, I noticed that np.cumsum consumes about 75% of the total time, then I guessed that the pythran accelerated cumsum may be slower than the original np.cumsum so that it is reasonable that test_pythran_kendall_p_exact() is slower than test_orig_kendall_p_exact(). I did a small test but find the pythran-cumsum is actually faster than np.cumsum...

>>> a = np.random.random((10000))
>>> %timeit -n 100 -r 3 np.cumsum(a)
56.7 µs ± 11.7 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
>>> %timeit -n 100 -r 3 my_cumsum(a)
46.7 µs ± 7.31 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
import numpy as np

#pythran export my_cumsum(int64[:])
#pythran export my_cumsum(float64[:])
def my_cumsum(n):
    return np.cumsum(n)
@serge-sans-paille
Copy link
Owner

Thanks @charlotte12l for the detailed report. Your test case is interesting. Although I fail to reproduce your timings (I still get a slight performance improvement with pythran, see below), we can probably do better, and it's great to explore that.

So first, here is how I run the benchmark:

I modified your test function to remove noise:

def t():
    param = (1601, 630304)
    return my_kendall_p_exact(*param)

and run the test through

python -m timeit -s 'from t import t' 't()'

python kernel: 10 loops, best of 3: 4.36 sec per loop
pythran version: 10 loops, best of 3: 4.24 sec per loop

to understand where the pythran version spends time, I use

perf record python -m timeit -s 'from t import t' 't()'
perf report

And then it's some pythran internal, but basically a third of the time is spent in the cumsum and a third in the numpy expression new[j:] -= new[:c+1-j] where pythran could detect (through range analysis) that there's no overlap and an optimized copy could be used, as hinted in #1737

@charlotte12l
Copy link
Contributor Author

Thanks for your reply! I think optimizing array copy will be very useful! I can help test it after you implement it :)

As for this function, I think the speed may vary on different machines, inputs, and even on different timeit tests... Here is my full report, sometimes the pythran version is better. Sometimes, the pythran version is worse.

The good cases:

>>>%timeit -n 10 -r 3 orig_kendall_p_exact(400, 38965)
63.7 ms ± 2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 my_kendall_p_exact(400, 38965)
62.8 ms ± 2.83 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 orig_kendall_p_exact(401, 39516)
72.9 ms ± 2.96 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 my_kendall_p_exact(401, 39516)
70.9 ms ± 1.52 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 orig_kendall_p_exact(800, 156772)
584 ms ± 9.92 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 my_kendall_p_exact(800, 156772)
541 ms ± 12.1 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

The bad cases:

>>>%timeit -n 10 -r 3 orig_kendall_p_exact(801, 157849)
547 ms ± 8.58 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 my_kendall_p_exact(801, 157849)
601 ms ± 12.4 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 orig_kendall_p_exact(1600, 637472)
6.13 s ± 238 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 my_kendall_p_exact(1600, 637472)
6.15 s ± 83.9 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 orig_kendall_p_exact(1601, 630304)
6.05 s ± 115 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
>>>%timeit -n 10 -r 3 my_kendall_p_exact(1601, 630304)
6.32 s ± 301 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

@serge-sans-paille
Copy link
Owner

Can you rerurn your test now that #1867 is merged in? It may change the performance of the above code.

@charlotte12l
Copy link
Contributor Author

Now the timing results are as follows, seems now Pythran version is slightly better than the Python version. Does that meet your expectation?

  Python Pythran  
(400, 38965) 0.654699879(f0) 0.528766466(f1)  
(401, 39516) 0.624093375(f2) 0.53689726(f3)  
(800, 156772) 4.659593703(f4) 4.300814603(f5)  
(801, 157849) 4.722684181(f6) 4.353008163(f7)  
(1600, 637472) 42.491141752(f8) 57.9583183899999(f9) when only timing f8&f9 in the script, the time of f8 is similar, but f9 will become much different(about ~43s rather than ~58). I'm not sure why.
(1601, 630304) 59.37831872(f10) 57.78442073(f11)  
# my timing script
import numpy as np
def orig_kendall_p_exact(n, c):
    if n <= 0:
        raise ValueError('n must be positive')
    elif c < 0 or 4*c > n*(n-1):
        raise ValueError('c must satisfy 0 <= 4c <= n(n-1).')
    elif n == 1:
        prob = 1.0
    elif n == 2:
        prob = 1.0
    elif c == 0:
        prob = 2.0/math.factorial(n) if n < 171 else 0.0
    elif c == 1:
        prob = 2.0/math.factorial(n-1) if n < 172 else 0.0
    elif 4*c == n*(n-1):
        prob = 1.0
    elif n < 171:
        new = np.zeros(c+1)
        new[0:2] = 1.0
        for j in range(3,n+1):
            new = np.cumsum(new)
            if j <= c:
                new[j:] -= new[:c+1-j]
        prob = 2.0*np.sum(new)/math.factorial(n)
    else:
        new = np.zeros(c+1)
        new[0:2] = 1.0
        for j in range(3, n+1):
            new = np.cumsum(new)/j
            if j <= c:
                new[j:] -= new[:c+1-j]
        prob = np.sum(new)

    return np.clip(prob, 0, 1)

from new_kendall_p_exact import my_kendall_p_exact
from timeit import timeit

f0 = lambda: orig_kendall_p_exact(400, 38965)
f1 = lambda: my_kendall_p_exact(400, 38965)

print(timeit(f0,number=10))
print(timeit(f1,number=10))


f2 = lambda: orig_kendall_p_exact(401, 39516)
f3 = lambda: my_kendall_p_exact(401, 39516)

print(timeit(f2,number=10))
print(timeit(f3,number=10))


f4 = lambda: orig_kendall_p_exact(800, 156772)
f5 = lambda: my_kendall_p_exact(800, 156772)

print(timeit(f4,number=10))
print(timeit(f5,number=10))


f6 = lambda: orig_kendall_p_exact(801, 157849)
f7 = lambda: my_kendall_p_exact(801, 157849)

print(timeit(f6,number=10))
print(timeit(f7,number=10))

f8 = lambda: orig_kendall_p_exact(1600, 637472)
f9 = lambda: my_kendall_p_exact(1600, 637472)

print(timeit(f8,number=10))
print(timeit(f9,number=10))

f10 = lambda: orig_kendall_p_exact(1601, 630304)
f11 = lambda: my_kendall_p_exact(1601, 630304)

print(timeit(f10,number=10))
print(timeit(f11,number=10))

The Pythran function:

# new_kendall_p_exact.py
import numpy as np
import math

#pythran export my_kendall_p_exact(int,int)
def my_kendall_p_exact(n, c):
    if n <= 0:
        raise ValueError('n must be positive')
    elif c < 0 or 4*c > n*(n-1):
        raise ValueError('c must satisfy 0 <= 4c <= n(n-1).')
    elif n == 1:
        prob = 1.0
    elif n == 2:
        prob = 1.0
    elif c == 0:
        prob = 2.0/math.factorial(n) if n < 171 else 0.0
    elif c == 1:
        prob = 2.0/math.factorial(n-1) if n < 172 else 0.0
    elif 4*c == n*(n-1):
        prob = 1.0
    elif n < 171:
        new = np.zeros(c+1)
        new[0:2] = 1.0
        for j in range(3,n+1):
            new = np.cumsum(new)
            if j <= c:
                new[j:] -= new[:c+1-j]
        prob = 2.0*np.sum(new)/math.factorial(n)
    else:
        new = np.zeros(c+1)
        new[0:2] = 1.0
        for j in range(3, n+1):
            new = np.cumsum(new)/j
            if j <= c:
                new[j:] -= new[:c+1-j]
        prob = np.sum(new)

    return np.clip(prob, 0, 1)

@serge-sans-paille
Copy link
Owner

yeah, as long as we no longer have a slowdown, the bug can be considered fixed. Let's discuss the profitability of swapping implementation in a scipy ticket.

@serge-sans-paille
Copy link
Owner

amendum: considering f8/f9, there may be cache effects, it's probably to test each function independently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants