New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: cluster: rewrite and optimize vq
in Cython
#3683
Conversation
Looks like a good speedup for |
In the Cython code, the parts for float64 and float32 are the same except for the BLAS calls and dtypes, so I would suggest to use fused types. Did you consider or try that? |
Oh I've considered using fused types but it seems that it is an experimental feature according to Cython docs. |
It's still not 100% reliable, but for this simple case it should work. And otherwise manual templating would still be preferable to duplicating the functions I'd think. @pv indeed had an issue with fused types when replacing the SWIG wrappers in scipy.sparse. Pauli, can you comment on whether or not to use fused types here? |
When I test this without cblas.h being found, the build works but I get three test failures:
We should be able to get rid of the conditional import. Just not sure yet what the best way is. |
@rgommers: the problems in scipy.sparse were mainly due to the large number of data types needed to be supported for scipy.sparse, and the fact that the Cython dispatch code was directly user-facing. I don't think those problems would manifest here (and if they do, templating as in scipy/sparse/_csparsetools.pyx.in may be a better solution than manual code duplication). |
Thanks, that's what I thought. |
Regarding the test failures, the first two should have shown up before if importing Reading up on the threads on using function pointers exposed by |
@richardtsai here is the change needed for the Bento build: rgommers@22e70ba12a349b |
numpy has its own |
I just noticed that MKL provides a |
|
On OS X this also doesn't work out of the box. There's also warnings from Cython there that are easy to resolve:
|
After |
vq
in Cythonvq
in Cython
The plain BLAS doesn't optimized well for |
I think we should optimize the performance for the case where an accelerated BLAS is available. |
Agreed that performance is not important in that case, but it should work. So the plain Python code needs a fix for those test failures, and a sanity check (it should also not be unusably slow for reasonable input sizes). |
@pv do you think just adding |
I think we can't assume the C BLAS interface is always available, as we only require Fortran BLAS. |
Also, since |
Updated. Use fused types now and it should be a bit clearer. |
It seems that |
Hmm, totally missed that. I usually work with numpy 1.5.1, but apparently when I checked |
@richardtsai I would suggest to start making slightly smaller commits. The last one I would have split up in a change for |
int32_t *codes, vq_type *low_dist): | ||
""" | ||
Vector quantization for float32 using naive algorithm. | ||
This is perfered when nfeat is small. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: perfered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rgommers What is scipy policy for documenting private functions? I know we don't tend to document c functions, but we should.
LGTM overall. My only comment would be that the N=5 cutoff point for the naive algorithm is going to probably be processor dependent. This might be an okay heuristic, but I'm not sure. However, this excerpt from the Zen of Python comes to mind:
Perhaps a way to manually control the strategy would be desirable, with the default being to use the heuristic? |
offset += nfeat | ||
|
||
|
||
def vq(np.ndarray obs, np.ndarray codes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the visible function, correct? If so, it needs more documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users would not be aware of this function. They should call cluster.vq.vq
instead. But of course more documentation would make it more maintainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks OK to me.
@dwf The magic cutoff number doesn't bother me, although it should be set someplace prominant and documented. I suspect the time difference is due to function call overhead -- the blas calls are rather heavy -- and thus not too sensitive to processor type. |
@charris @dwf
It seems that when nfeat is small, the BLAS calls are not hotspots. I suspect that it is the code that updates |
Rewrite the underlying part of cluster.vq in Cython. Besides, an optimized algorithm is implemented and it will improve the performance for datasets with large nfeat when built with an optimized BLAS library.
Split up some commits. |
@richardtsai no problem. This looks good to me. |
Re |
Sounds fine to me. Chuck was suggesting it probably isn't that CPU |
ENH: cluster: rewrite and optimize `vq` in Cython
OK, let's call this good. Thanks Richard, all. |
Rewrite the underlying part of cluster.vq in Cython.
Besides, an optimized algorithm is implemented and it will improve the
performance for datasets with large nfeat when built with an optimized
BLAS library.
I tested the new algorithm on datasets of different size and I noticed that there's almost no speedup (even worse) when nfeat is small. So I switch back to naive algorithm when nfeat < 5.
(N is #obs, M is #feat, K is #codes)
Performance comparison is as follows. (
vq.kmeans(obs, k)
, ATLAS 3.11.13, i5-3470)The speedup of
vq.kmeans
is not so significant as that ofvq.vq
. It is because the update step of kmeans is still time-consuming.