Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kendall tau implementation uses Python mergesort #5533

Closed
iskandr opened this issue Nov 23, 2015 · 11 comments · Fixed by #5548
Closed

Kendall tau implementation uses Python mergesort #5533

iskandr opened this issue Nov 23, 2015 · 11 comments · Fixed by #5548
Labels
enhancement A new feature or improvement scipy.stats
Milestone

Comments

@iskandr
Copy link

iskandr commented Nov 23, 2015

I just noticed that the implementation of scipy.stats.kendalltau implements its own mergesort. I suspect that any sorting algorithm, especially one that performs recursive function calls, would be orders of magnitude faster using a C implementation.

@rgommers
Copy link
Member

Is this a bottleneck for you? If so, shouldn't be hard to move mergesort to Cython (that's preferred over C here).

@argriffing
Copy link
Contributor

numpy has a few sorts including mergesort I think, but notice that the kendalltau mergesort is instrumented to track the number of exchanges. This exchange count is an ingredient in the tau calculation. Although it wouldn't make sense to switch to an existing non-exchange-counting mergesort implementation, the instrumented mergesort could indeed be implemented in Cython for more speed.

@josef-pkt
Copy link
Member

@argriffing Yes, I remember the algorithm was quite tricky. I wouldn't change anything except literal translation into cython.

@iskandr
Copy link
Author

iskandr commented Nov 24, 2015

It's not a serious bottleneck, the statistic took long enough that I
noticed the delay but it's happening at the end of a much larger
computational pipeline. If the sorting logic is nontrivial then please
leave it as is!
On Nov 24, 2015 5:09 PM, "Josef Perktold" notifications@github.com wrote:

@argriffing https://github.com/argriffing Yes, I remember the algorithm
was quite tricky. I wouldn't change anything except literal translation
into cython.


Reply to this email directly or view it on GitHub
#5533 (comment).

@rgommers rgommers added the enhancement A new feature or improvement label Nov 24, 2015
@rgommers
Copy link
Member

If you noticed it was slow, then it can be sped up. Let's leave this issue open and see if someone feels like moving the thing to Cython without changing the algorithm. Should be a lot faster then.

@behzadnouri
Copy link
Contributor

I opened a pull-request which removes the exchange counting mergesort entirely: #5548

@sturlamolden
Copy link
Contributor

When kendalltau was written we actually experimented with different implementations (three versions, IIRC). One of them was in Cython (which I wrote). I don't recall exactly the reason, but @josef-pkt decided to go with the current. So just a friendly warning, Cython has been tried, but it was not any better when we did.

@josef-pkt
Copy link
Member

@sturlamolden One of the reasons not to go with the cython version was that I was the only stats maintainer, and without any experience with cython I didn't want to get the extra maintenance work. Plus, this was supposed to be a computationally efficient implementation.
Otherwise, I don't remember much of the details. IIRC, there should still be somewhere your cython version for the contingency tables version, or Kendall's D, or something like that.

Now, there several maintainers and they have enough experience with cython to implement or maintain whatever works best.

(Aside, I'm currently in a corner with power and sample size calculations where I just want to get the things to work, without expanding any effort on high performance in large samples. But in that case, we can at least resort to fast asymptotic results for large sample data.)

@josef-pkt
Copy link
Member

(Another aside:
I have currently no intuition for high speed. I partially following the Julia user mailing list, and they are very strongly into using loops with JIT speedup to avoid temporaries and memory allocation. Some examples here seem to indicate that keeping things in numpy with smarter algorithms can still improve performance quite a bit.)

@sturlamolden
Copy link
Contributor

I suggested some improvements for #5548

@sturlamolden
Copy link
Contributor

For the record, here are the Cython versions I wrote in 2009. They cannot be used as they stand (e.g. using C int instead of intp_t), but we can look at them for comparison with the current implementations:
http://projects.scipy.org/scipy/attachment/ticket/893/tau.pyx

@ev-br ev-br added this to the 0.18.0 milestone Apr 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants