New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add stirling2 function to scipy.special
#18103
Conversation
it looks like the windows builds are failing:
I suspect this is because I'm using |
The Also note this failure:
You can fix it by adding the new file to the listing in |
Got it, removed
Done in commit |
scipy/special/cephes/stirling.c
Outdated
int arraySize = n + 1; | ||
int prev[arraySize]; | ||
int curr[arraySize]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick comment, I won't have time to review the math until the weekend.
We should calloc
these instead of declaring them on the stack (making sure to call free
before returning). VLAs are part of the C99 standard, but became optional as of C11. I checked the toolchain roadmap and it looks like we shouldn't be using optional features like this.
Also, by declaring them on the stack, we can blow up the stack and get a segfault even with sane input. On my 64bit Linux machine I found that
import scipy.special as sc
stirling2(2000000, 1999999)
# Segmentation fault (core dumped)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, by declaring them on the stack, we can blow up the stack and get a segfault even with sane input. On my 64bit Linux machine I found that
import scipy.special as sc stirling2(2000000, 1999999) # Segmentation fault (core dumped)
Ok. FWIW the calculation will overflow long before this point so mayvbe it makes sense to raise an error if n > X
where X
is the value of overflow (17 on my machine w/int
being 32 bit IIRc)?
We should calloc these instead of declaring them on the stack (making sure to call free before returning). VLAs are part of the C99 standard, but became optional as of C11. I checked the toolchain roadmap and it looks like we shouldn't be using optional features like this.
Hmm, I'm somewhat confused by the toolchain roadmap... I'll look at the calloc
implementation.
@steppi Thanks very much for your comments
I switched to heap memory from @steppi 's comments (and free) before returning. I ran the new version through valgrind locally:
so I think everything should be good from memory leak perspective AFAICS. Still not sure what we want to do about the overflow issue (large n or k input) though? |
FWIW it looks like the failing windows builds are the same issue that Tyler opened #18108 |
Actually, I'm pretty frightened by having all these allocations in the inner loop of a ufunc. This stackexchange answer seems to give an efficient solution based on the inclusion exclusion based formula. |
I had initially looked into IE-see the issue,. In practice IE doesn't seem to be faster because authors ignore the cost of computing the binomial coefficients-one for each entry from i=1 to k is needed. As for the memory allocations, it's really only 1 per call which doesn't seem too bad to me. The alternative is to do everything in pure python and compute up through an array of We are duplicating work in the ufunc approach though if an array contains two of the same |
The IE based implementation from this gist is recomputing the binomial coefficient from scratch at each iteration instead of using the recurrence from the above link to update it with a few arithmetic operations at each step. The algorithm from the compsci stack exchange link uses fewer overall options than the allocation heavy approach. |
scipy.special
I've come with a plan that might work.
OK, so the size of an The expression won't overflow when using 64 bit ints, but would overflow when using 32 bit ints. We will want to support 64 bit. The current special infrastructure allows types int and long. On 64 bit linux a long is 64 bits, but on windows it's still 32 bits. We could either support both I still think we should be using the fast IE based algorithm from here, not the allocation based one. If one wants to compute many Stirling numbers, the allocation based one will duplicate too much effort, so isn't very efficient actually. For overflow. I think we can check for it inside the compiled function and return Thoughts? |
Oh, I see. Even when speeding up the binomial coefficient calculation, the IE formula is still unworkable because you need to divide by factorial at the end, and the numerator and denominator will both overflow even for reasonable input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to other opinions on this, but I think we should make the return type long
instead of int
, so we can handle larger return values on platforms where long
is 64 bit.
Also, we don't actually need to compute the full triangle so we can save some effort and memory. There's a way to set it up so only the necessary values are computed. See the relevant comment. Hopefully I'll have time to post a pure Python implementation this week that you can refer to I've posted a pure Python implementation you can refer to.
I'm open to more suggestions on this too, but I think it would be good to return -1
in case of overflow or failure to allocate memory. Since the actual result should always be non-negative, this would signal to the users that the value is incorrect. We'd just have to document this behavior well.
scipy/special/cephes/stirling.c
Outdated
for (int i=1; i<arraySize; i++){ | ||
// build up next row in curr | ||
for (int j=1; j<arraySize; j++){ | ||
curr[j] = prev[j - 1]; | ||
curr[j] += prev[j] * j; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're doing too much work here and allocating more than we need to. To compute
If
This also has the benefit of making it so that if an overflow occurs at any time during execution, then the final result will overflow. If we compute the full triangle this won't be the case. We need to be able to detect overflow, and this will simplify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import numpy as np
def stirling2(n, k):
if k <= 0 or k > n or n < 0:
if n == 0:
return 1
return 0
if k <= n - k + 1:
current = [1]*k
for i in range(1, n - k + 1):
for j in range(1, k):
current[j] = (j+1) * current[j] + current[j-1]
else:
current = [1]*(n - k + 1)
for i in range(1, k):
for j in range(1, n - k + 1):
current[j] = (i+1) * current[j-1] + current[j]
return current[-1]
@rlucas7 Here's a Python implementation that only computes the values in the triangle that are actually needed. It should be relatively straightforward to translate into C. Note that we will want malloc
instead of calloc
now because we no longer initialize with zero entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just updated the implementation above to only use one array instead of two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're doing too much work here and allocating more than we need to. To compute {nk} we don't need the full triangle. ... If k≤n−k+1 we can allocate two arrays of size k and compute the values in the parallelogram left to right and then top to bottom. If k>n−k+1 we can allocate two arrays of size n−k+1 and compute the values top to bottom and then left to right.
I understand this as taking lhs or the top (angular) array of 1
s in the Stirling triangle and proceeding to do the calculations these ways, along angled movement of a row.
Clever.
This also has the benefit of making it so that if an overflow occurs at any time during execution, then the final result will overflow.
I agree.
If we compute the full triangle this won't be the case. We need to be able to detect overflow, and this will simplify that.
I don't follow the reasoning of these last 2 sentences. It seems to me that either case it's needed to check for overflow. Unless I'm mistaken the same n,k
values overflow in either case (using array based dynamic programs). Did I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow the reasoning of these last 2 sentences. It seems to me that either case it's relatively straightforward to check. Unless I'm mistaken the same n,k values overflow in either case (using array based dynamic programs). Did I miss something?
I just mean, in the reduced schedule where we only compute stirling numbers over the parallelogram, we just need to check if overflow occurs since every value computed will take part in the final result. If we do the whole triangle up to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... it's just that we need to check for overflow and also that the value will be used in the final result. I like the simpler condition where we just have to check for overflow better.
Got it-Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW After switching to long outputs in C (on my mac) the overflow bumps up from first happening at 17-18 to 25-26,
stirling2(25,0) = 0
stirling2(25,1) = 1
stirling2(25,2) = 16777215
stirling2(25,3) = 141197991025
stirling2(25,4) = 46771289738810
stirling2(25,5) = 2436684974110751
stirling2(25,6) = 37026417000002430
stirling2(25,7) = 227832482998716310
stirling2(25,8) = 690223721118368580
stirling2(25,9) = 1167921451092973005
stirling2(25,10) = 1203163392175387500
stirling2(25,11) = 802355904438462660
stirling2(25,12) = 362262620784874680
stirling2(25,13) = 114485073343744260
stirling2(25,14) = 25958110360896000
stirling2(25,15) = 4299394655347200
stirling2(25,16) = 526655161695960
stirling2(25,17) = 48063331393110
stirling2(25,18) = 3275678594925
stirling2(25,19) = 166218969675
stirling2(25,20) = 6220194750
stirling2(25,21) = 168519505
stirling2(25,22) = 3200450
stirling2(25,23) = 40250
stirling2(25,24) = 300
stirling2(25,25) = 1
stirling2(25,26) = 0
stirling2(26,0) = 0
stirling2(26,1) = 1
stirling2(26,2) = 33554431
stirling2(26,3) = 423610750290
stirling2(26,4) = 187226356946265
stirling2(26,5) = 12230196160292565
stirling2(26,6) = 224595186974125331
stirling2(26,7) = 1631853797991016600
stirling2(26,8) = 5749622251945664950
stirling2(26,9) = -7245227292754425991
stirling2(26,10) = -5247188700862703611
stirling2(26,11) = -8417665732711074856
stirling2(26,12) = 5149507353856958820
stirling2(26,13) = 1850568574253550060
stirling2(26,14) = 477898618396288260
stirling2(26,15) = 90449030191104000
stirling2(26,16) = 12725877242482560
stirling2(26,17) = 1343731795378830
stirling2(26,18) = 107025546101760
stirling2(26,19) = 6433839018750
stirling2(26,20) = 290622864675
stirling2(26,21) = 9759104355
stirling2(26,22) = 238929405
stirling2(26,23) = 4126200
stirling2(26,24) = 47450
stirling2(26,25) = 325
stirling2(26,26) = 1
stirling2(26,27) = 0
Yes. I noticed this too, the |
@steppi Latest commit has your memory optimization-using only 1 array. I'll think a bit more on the overflow checking. Also, to disambiguate the potential-albeit unlikely-malloc error I'm thinking to return a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overflow of unsigned integers in C leads to unsigned undefined behavior. See the suggestion for how to check properly. I also suggested a long comment to explain the reduced version of the dynamic programming algorithm. Let me know what you think of it.
Also, can you add some tests that check overflow is being handled properly?
@rlucas7, I thought it would be good to also support arbitrarily large ints using a pure Python implementation. I've submitted a nested PR rlucas7#1 to make |
I implemented the first and second order asymptotic approximations from Temme in Python just to see how well they work. I've been having fun with this. With the second order approximation relative error doesn't seem terrible. On the order of To implement this, we' can use import numpy as np
import scipy.special as sc
def stirling2_asymptotic(n, k):
mu = k/n
k, n = float(k), float(n)
def f(x):
return mu * x - 1 + np.exp(-x)
def df(x):
return mu - np.exp(-x)
def d2f(x):
return np.exp(-x)
delta = 1/mu * np.exp(-1/mu)
x0 = 1/mu + sc.lambertw(-delta)
t0 = (n - k) / k
F = np.sqrt(t0/((1 + t0)*(x0 - t0)))
A = -n*np.log(x0) + k*np.log(np.exp(x0) - 1) - k*t0 + (n - k)*np.log(t0)
try:
result = np.exp(A)*k**(n - k)*F*sc.binom(n, k)
except OverflowError:
result = np.inf
return result
def stirling2_asymptotic_order2(n, k):
mu = k/n
k, n = float(k), float(n)
def f(x):
return mu * x - 1 + np.exp(-x)
def df(x):
return mu - np.exp(-x)
def d2f(x):
return np.exp(-x)
delta = 1/mu * np.exp(-1/mu)
x0 = 1/mu + sc.lambertw(-delta)
t0 = (n - k) / k
F = np.sqrt(t0/((1 + t0)*(x0 - t0)))
A = -n*np.log(x0) + k*np.log(np.exp(x0) - 1) - k*t0 + (n - k)*np.log(t0)
F1 = (
-2*x0**3 + 2*t0**5 + 4*t0**3 + 4*t0**4 + 3*x0**2 * t0 - 6*x0*t0**4
- 5*x0**2 * t0**2 + 2*x0**4 * t0 + x0**3*t0 - 6*x0**3 * t0**2
+ 8*x0**2 * t0**3
) / (24*F*(1 + t0)**2*(x0 - t0)**4)
try:
result = np.exp(A)*k**(n - k)*sc.binom(n, k)*(F - F1/k)
except OverflowError:
result = np.inf
return result
for n, k in ((100, 20), (112, 45), (200, 33)):
exact = sc.stirling2(n, k)
asymptotic = stirling2_asymptotic(n, k)
asymptotic2 = stirling2_asymptotic_order2(n, k)
rel_error = abs(float(exact) - asymptotic) / abs(float(exact))
order2_rel_error = abs(float(exact) - asymptotic2) / abs(float(exact))
print("-------------------")
print(f"stirling2({n}, {k})")
print("--------------------")
print("rounded exact val:", float(exact))
print("asymptotic:", asymptotic)
print("relative error:", rel_error)
print("asymptotic order2:", asymptotic2)
print("relative error order2:", order2_rel_error)
The output is
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the docstring does need to be raw style.
Docs failures have been resolved. The arpack test, |
Hopefully this can help to get started #18736 :) |
@steppi I worked through Temme's paper this weekend and reproduced the results in that paper. In particular his equation 4.4 on page 10 matches what you have here in the details code dropdown. I reproduce the max error values with my own implementation (I missed the code in the dropdown tab in your comment) so I independently reproduced both the code and the max error values Temme has noted for n=10, 20, etc for the second order approximation. I think that's what we should use here-unless you've got a more accurate one. The choice of (max rel error, K that achieves max error, N) so somewhere between 275 and 300 seems ok as a threshold. incidentally if you want me to look at a pr for the Temme approximation just drop me an email once it's up. I'm comfortable reviewing a PR on that now. Here is the code I used to generate the values noted here: from scipy import optimize
from math import sqrt, comb
## calculated via exact method using dp soln above...
# setting n=10 we can verify that max relative error occurs @ m=4 w/value =0.000466 (rounds to value given in Temme paper)
# also n=20 verifies max relative error in paper too @ m=3 w/value = (0.000116, rounds to 0.00012 in paper)
n = 275
actual = stirling2(n, range(n+1))
m2 = m1 = (-np.inf, -1)
for m in range(1, n):
# calculate t0 as described on page 235 of temme
# this is noted in the writing just after equation 2.6 on page 235
t0 = (n-m)/m
def func(x):
return (m/n)*x - 1 + np.exp(-x)
# calculate x0 by solving root finding problem
# equation 2.5 in Temme
res = optimize.root(func, n/m, method='broyden1', tol=1e-14)
x0 = res.x.take(0)
# now
def phi(x):
return -n * np.log(x) + m*np.log(np.exp(x) - 1)
A = phi(x0) - m*t0 + (n-m)*np.log(t0)
# typically f() will be called w/argument t0
def f(x):
return sqrt(x / ((1+x) * (x0-x)) )
# needed for second order approx
def f1(x):
num = -2*x0**3 + 2*x**5 + 4*x**4 + 4*x**3 + 3*x*x0**2 - 6*x0*x**4 - 5*x0**2*x**2 + 2*x0**4*x + x0**3*x - 6*x0**3*x**2 + 8*x0**2*x**3
denom = 24*f(x)*(1+x)**2*(x0-x)**4
return num/denom
# first order approx gives
first = np.exp(A)*m**(n-m)*f(t0)*comb(n,m)
# second order approx gives
second = np.exp(A)*m**(n-m)*(f(t0)-f1(t0)/m)*comb(n,m)
rel_err1 = (actual[m] - first) / first
rel_err2 = (actual[m] - second) / second
if abs(rel_err1) > m1[0]:
m1 = (abs(rel_err1), m)
if abs(rel_err2) > m2[0]:
m2 = (abs(rel_err2), m)
print(f"n={n}, m={m}, first = {first}, second={second}, actual={actual[m]}, rel_err1={rel_err1}, rel_err2={rel_err2}")
print(m1, m2) |
@rlucas7 Cool. Yeah, I agree we should start with the second order approximation. Later on if we get bandwidth we can try to independently work out higher order approximations. I also have the following approximation based on the inclusion/exclusion formula, and using the Lanczos approximation to tame the factorial, cancelling as much as possible. This can be used for smaller values. It doesn't work as k approaches n, I think we can just do a floating point version of the recurrence in those cases when n is still too small for the asymptotic expansion to be usable. def stirling2_double(n, k):
result = 0
k_choose_j = 1
for j in range(k):
term = (
(-1)**j * k_choose_j * (1 - j/k)**(k + 0.5)
* (k - j)**(n - k - 0.5)
)
result += term
k_choose_j = k_choose_j * (k - j) / (j + 1)
return (
result
* (np.e / (1 + (lanczos_g + 0.5)/k))**(k + 0.5)
/ _lanczos_sum_expg_scaled(k + 1)
) Let's merge the current PR this week, and I'll submit a PR for the approximation next weekend. Update: Improved approximation above by pulling a constant factor out of the sum. |
It looks like the docs still aren't rendering the brace notation (and the There are also approximation method(s) to be added. @steppi if you want I can put together what I had for Temme's second order and add it here to the existing PR? I spent some time studying the Lanczos approaches in Pugh's thesis I'm still working through the Chebyshev approximation part but I get the gist of that particular approach. I didn't have access to the reference so I needed to work out the derivation for Lemma 3.2. In any case I feel comfortable reviewing a PR for that work if you wanted to put up something for lower values of I'm also not sure what the consensus is on the 0-dimension numpy array, do we need to change the behavior in this PR? I think the answer is yes but want to confirm before adding another change. |
Let’s follow @h-vetinari’s precedent for the factorial functions. He made a good case for it in gh-18768.
Let’s save it for a followup PR. Once the doc issues are resolved and the behavior lines up with the factorial functions regarding 0d arrays I think it’ll be ready to merge. |
Besides the 0d array thing, we made |
The reason we used |
Sorry. That’s not what I mean. The factorial functions also work with Python ints internally and return an object array, but the input array must be some kind of numpy int dtype. I’d made a suggestion earlier that would let |
Ah, I see what you mean. IIRC the current implementation raises an error in the case when the input is an object array in the input. There is a test that confirms this behavior in this PR. Is there something that needs to be done for the object input beyond what's already done? |
should fix the 0-dim case and not break the handling for existing scalar case. Co-authored-by: Daniel Schmitz <40656107+dschmitz89@users.noreply.github.com> Co-authored-by: Albert Steppi <albert.steppi@gmail.com>
change test w/decision to *not* accept input array w/dtype object
If it helps (for example with the whole 0d vs. scalar thing), you can probably take a look at the test infrastructure for factorial2 (factorial itself has a bunch of legacy stuff in there and is a bit more distracting), particularly the test for corner cases. PS. The factorial functions (with exact=True) also switch to object dtype when the respective integer types are too small. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I think this is ready. Thanks @rlucas7. If no one objects I'll merge tomorrow evening UTC-4.
Yes, that would be awesome. |
Reference issue
Closes gh-17890
What does this implement/fix?
Implements a ufunc in
scipy.special
for Stirling numbers of the second kind, inscipy/special/cephes
.Additional information
The method follows the discussion in the linked issue.