ENH: cluster: rewrite and optimize `vq` in Cython #3683

cairijun · 2014-05-26T07:12:21Z

Rewrite the underlying part of cluster.vq in Cython.

Besides, an optimized algorithm is implemented and it will improve the
performance for datasets with large nfeat when built with an optimized
BLAS library.

I tested the new algorithm on datasets of different size and I noticed that there's almost no speedup (even worse) when nfeat is small. So I switch back to naive algorithm when nfeat < 5.

(N is #obs, M is #feat, K is #codes)

Performance comparison is as follows. (vq.kmeans(obs, k), ATLAS 3.11.13, i5-3470)

dataset	old (ms)	new (ms)	speedup
30000 x 100, k = 10, float64	28636.2594	16950.3884	1.69
30000 x 100, k = 10, float32	13575.0532	6483.4734	2.09
10000 x 30, k = 10, float64	1923.3946	1146.7348	1.68
10000 x 30, k = 10, float32	1491.8790	989.2090	1.51
100000 x 3, k = 4, float64	1663.1430	1181.5912	1.41
100000 x 3, k = 4, float32	1278.0330	967.4396	1.32
10000 x 1, k = 3, float64	117.6908	94.0622	1.25
10000 x 1, k = 3, float32	110.2874	98.6826	1.12
1000 x 3, k = 4, float64	29.2300	28.6970	1.02
1000 x 3, k = 4, float32	27.5242	29.2488	0.94

The speedup of vq.kmeans is not so significant as that of vq.vq. It is because the update step of kmeans is still time-consuming.

rgommers · 2014-05-26T07:54:36Z

Looks like a good speedup for vq. I like the detailed benchmark here.

rgommers · 2014-05-26T07:55:32Z

In the Cython code, the parts for float64 and float32 are the same except for the BLAS calls and dtypes, so I would suggest to use fused types. Did you consider or try that?

coveralls · 2014-05-26T08:01:03Z

Coverage decreased (-0.0%) when pulling 812bcf1 on richardtsai:vq_rewrite into ceb1a12 on scipy:master.

cairijun · 2014-05-26T08:08:02Z

Oh I've considered using fused types but it seems that it is an experimental feature according to Cython docs.

rgommers · 2014-05-26T08:11:35Z

It's still not 100% reliable, but for this simple case it should work. And otherwise manual templating would still be preferable to duplicating the functions I'd think.

@pv indeed had an issue with fused types when replacing the SWIG wrappers in scipy.sparse. Pauli, can you comment on whether or not to use fused types here?

rgommers · 2014-05-26T08:19:25Z

When I test this without cblas.h being found, the build works but I get three test failures:

======================================================================
ERROR: Testing simple call to kmeans2 with rank 1 data.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rgommers/Code/scipy/scipy/cluster/tests/test_vq.py", line 191, in test_kmeans2_rank1
    code1 = kmeans2(data1, code, iter=1)[0]
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 723, in kmeans2
    return _kmeans2(data, clusters, iter, nc, _valid_miss_meth[missing])
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 735, in _kmeans2
    label = vq(data, code)[0]
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 213, in vq
    results = py_vq(obs, code_book)
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 256, in py_vq
    return _py_vq_1d(obs, code_book)
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 299, in _py_vq_1d
    raise RuntimeError("_py_vq_1d buggy, do not use rank 1 arrays for now")
RuntimeError: _py_vq_1d buggy, do not use rank 1 arrays for now

======================================================================
ERROR: Testing simple call to kmeans2 with rank 1 data.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rgommers/Code/scipy/scipy/cluster/tests/test_vq.py", line 200, in test_kmeans2_rank1_2
    code1 = kmeans2(data1, 2, iter=1)
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 723, in kmeans2
    return _kmeans2(data, clusters, iter, nc, _valid_miss_meth[missing])
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 735, in _kmeans2
    label = vq(data, code)[0]
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 213, in vq
    results = py_vq(obs, code_book)
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 256, in py_vq
    return _py_vq_1d(obs, code_book)
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 299, in _py_vq_1d
    raise RuntimeError("_py_vq_1d buggy, do not use rank 1 arrays for now")
RuntimeError: _py_vq_1d buggy, do not use rank 1 arrays for now

======================================================================
ERROR: test_vq_large_nfeat (test_vq.TestVq)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rgommers/Code/scipy/scipy/cluster/tests/test_vq.py", line 118, in test_vq_large_nfeat
    codes0, dis0 = _vq.vq(X, code_book)
NameError: global name '_vq' is not defined

----------------------------------------------------------------------

We should be able to get rid of the conditional import. Just not sure yet what the best way is.

pv · 2014-05-26T08:32:53Z

@rgommers: the problems in scipy.sparse were mainly due to the large number of data types needed to be supported for scipy.sparse, and the fact that the Cython dispatch code was directly user-facing. I don't think those problems would manifest here (and if they do, templating as in scipy/sparse/_csparsetools.pyx.in may be a better solution than manual code duplication).

rgommers · 2014-05-26T08:33:29Z

Thanks, that's what I thought.

rgommers · 2014-05-26T08:36:53Z

Regarding the test failures, the first two should have shown up before if importing _vq had failed, which it apparently never did. Now with the usage of cblas.h I guess it may, though not sure about when.

Reading up on the threads on using function pointers exposed by scipy.linalg, that also doesn't sound like fun.

rgommers · 2014-05-26T08:48:01Z

@richardtsai here is the change needed for the Bento build: rgommers@22e70ba12a349b

cairijun · 2014-05-26T08:49:35Z

numpy has its own cblas.h in numpy/core/blasdot and that is what I generated cblas.pxd from. Maybe we can also leave a cblas.h (only contains the essential declarations) in scipy/cluster? BLAS is a build dependency of scipy so these BLAS routines will be always available.

cairijun · 2014-05-26T09:02:50Z

I just noticed that MKL provides a mkl_cblas.h instead of cblas.h so the current version may fail with MKL.

rgommers · 2014-05-26T09:04:02Z

numpy.core._dotblas only gets built with ATLAS, not with plain BLAS/LAPACK. I think we have the same issue here.

rgommers · 2014-05-26T09:25:47Z

On OS X this also doesn't work out of the box. There's also warnings from Cython there that are easy to resolve:

Cythonizing sources
Processing scipy/cluster/_vq.pyx
warning: _vq.pyx:46:19: Non-trivial type declarators in shared declaration (e.g. mix of pointers and values). Each pointer declaration should be on its own line.
warning: _vq.pyx:46:27: Non-trivial type declarators in shared declaration (e.g. mix of pointers and values). Each pointer declaration should be on its own line.
warning: _vq.pyx:142:19: Non-trivial type declarators in shared declaration (e.g. mix of pointers and values). Each pointer declaration should be on its own line.
warning: _vq.pyx:142:27: Non-trivial type declarators in shared declaration (e.g. mix of pointers and values). Each pointer declaration should be on its own line.

rgommers · 2014-05-26T09:35:10Z

After cp ../numpy/numpy/core/blasdot/cblas.h scipy/cluster/ build works fine on OS X and all tests pass.

cairijun · 2014-05-26T09:38:37Z

The plain BLAS doesn't optimized well for gemm so the new algorithm may be useless due to overhead. We can just fall back to naive algorithm using conditional compilation when users don't have ATLAS, openblas, Accelerate or MKL.

pv · 2014-05-26T09:42:46Z

I think we should optimize the performance for the case where an accelerated BLAS is available.
Not having an optimized BLAS installed is equal to deciding that performance does not matter at all,
and in this case we don't need to care about performance numbers.

rgommers · 2014-05-26T09:50:38Z

Agreed that performance is not important in that case, but it should work. So the plain Python code needs a fix for those test failures, and a sanity check (it should also not be unusably slow for reasonable input sizes).

rgommers · 2014-05-26T09:52:10Z

@pv do you think just adding cblas.h (named vq_cblas.h I guess, like in spatial) is enough?

pv · 2014-05-26T10:06:42Z

I think we can't assume the C BLAS interface is always available, as we only require Fortran BLAS.
spatial calls the Fortran ABI

pv · 2014-05-26T10:07:50Z

Also, since sdot is used here, one needs to replace sdot -> wsdot and use our BLAS wrappers.

cairijun · 2014-05-27T15:15:18Z

Updated. Use fused types now and it should be a bit clearer.
And I use the BLAS Fortran ABI and it can get rid of cblas now. I've tested it with plain blas/lapack.
As for sdot/ddot, I dropped it because it seems that calling a fortran function in Cython is not so easy as calling a fortran subprogram, and it seems not to be a performance hotspot. I wrote an inline function to substitute it and it works fine without performance loss.

coveralls · 2014-05-27T15:47:28Z

Coverage increased (+0.0%) when pulling c14f29c on richardtsai:vq_rewrite into 569b1af on scipy:master.

coveralls · 2014-05-27T16:00:56Z

Coverage increased (+0.0%) when pulling c14f29c on richardtsai:vq_rewrite into 569b1af on scipy:master.

cairijun · 2014-05-27T16:17:13Z

It seems that copy is not supported in ndarray.astype prior to numpy 1.7.

rgommers · 2014-05-27T22:07:15Z

Hmm, totally missed that. I usually work with numpy 1.5.1, but apparently when I checked astype I happened to have an ipython shell with numpy master open. Guess then your code was the right thing to do; maybe just add a comment to replace it with astype(ct, copy=False) when we get to numpy 1.7.0

rgommers · 2014-05-27T22:10:31Z

@richardtsai I would suggest to start making slightly smaller commits. The last one I would have split up in a change for astype, one for fused types and one for changing to Fortran BLAS. If one is then not OK it's easier to undo or change.

rgommers · 2014-05-27T22:33:26Z

scipy/cluster/_vq.pyx

+                       int32_t *codes, vq_type *low_dist):
+    """
+    Vector quantization for float32 using naive algorithm.
+    This is perfered when nfeat is small.


typo: perfered

@rgommers What is scipy policy for documenting private functions? I know we don't tend to document c functions, but we should.

rgommers · 2014-05-27T22:47:57Z

@dwf @charris any comments on the algorithm changes?

dwf · 2014-05-27T23:33:07Z

LGTM overall. My only comment would be that the N=5 cutoff point for the naive algorithm is going to probably be processor dependent. This might be an okay heuristic, but I'm not sure. However, this excerpt from the Zen of Python comes to mind:

In the face of ambiguity, refuse the temptation to guess.

Perhaps a way to manually control the strategy would be desirable, with the default being to use the heuristic?

charris · 2014-05-29T02:29:47Z

scipy/cluster/_vq.pyx

+        offset += nfeat
+
+
+def vq(np.ndarray obs, np.ndarray codes):


This is the visible function, correct? If so, it needs more documentation.

Users would not be aware of this function. They should call cluster.vq.vq instead. But of course more documentation would make it more maintainable.

This looks OK to me.

charris · 2014-05-29T02:39:33Z

@dwf The magic cutoff number doesn't bother me, although it should be set someplace prominant and documented. I suspect the time difference is due to function call overhead -- the blas calls are rather heavy -- and thus not too sensitive to processor type.

cairijun · 2014-05-30T17:48:48Z

@charris @dwf
I've done some profiling.

In [19]: obs.shape
Out[19]: (100000, 5)

In [20]: codes.shape
Out[20]: (30, 5)

In [21]: %prun vq.vq(obs, codes)
         200022 function calls in 0.062 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.044    0.044    0.062    0.062 _vq.pyx:66(__pyx_fuse_1_vq)
   100030    0.008    0.000    0.008    0.000 _vq.pyx:41(__pyx_fuse_1vec_sqr)
    99970    0.006    0.000    0.006    0.000 _vq.pyx:34(__pyx_fuse_1_sqrt)
        1    0.004    0.004    0.004    0.004 _vq.pyx:49(__pyx_fuse_1cal_M)
        1    0.000    0.000    0.062    0.062 vq.py:142(vq)
...

In [24]: obs.shape
Out[24]: (100000, 4)

In [25]: codes.shape
Out[25]: (30, 4)

In [26]: %prun vq.vq(obs, codes)
         100022 function calls in 0.045 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.039    0.039    0.045    0.045 _vq.pyx:122(__pyx_fuse_1_vq_small_nf)
   100000    0.006    0.000    0.006    0.000 _vq.pyx:34(__pyx_fuse_1_sqrt)
        1    0.000    0.000    0.045    0.045 _vq.pyx:165(vq)
        1    0.000    0.000    0.045    0.045 vq.py:142(vq)
...

It seems that when nfeat is small, the BLAS calls are not hotspots. I suspect that it is the code that updates low_dist and codes becomes the most time-consuming part and since this part is shared by the two algorithms, I think the the cutoff is not so processor dependent.

coveralls · 2014-05-31T16:54:08Z

Coverage decreased (-0.0%) when pulling a3736e2 on richardtsai:vq_rewrite into 93656a8 on scipy:master.

coveralls · 2014-05-31T16:58:46Z

Coverage decreased (-0.0%) when pulling a3736e2 on richardtsai:vq_rewrite into 93656a8 on scipy:master.

coveralls · 2014-06-01T05:55:27Z

Coverage decreased (-0.0%) when pulling ed1d7a2 on richardtsai:vq_rewrite into 93656a8 on scipy:master.

Rewrite the underlying part of cluster.vq in Cython. Besides, an optimized algorithm is implemented and it will improve the performance for datasets with large nfeat when built with an optimized BLAS library.

cairijun · 2014-06-01T07:10:28Z

Split up some commits.
Comments are added.
Manually import NPY_INFINITY from numpy/arrayobject.h since numpy/math.pxd is not available prior to Cython 0.20.
@rgommers I've tried to split the changes to fused types and BLAS Fortran ABI into two commits but failed since I wrote them together.

coveralls · 2014-06-01T07:58:29Z

Coverage decreased (-0.0%) when pulling dd8a93f on richardtsai:vq_rewrite into 93656a8 on scipy:master.

rgommers · 2014-06-01T08:08:33Z

@richardtsai no problem. This looks good to me.

rgommers · 2014-06-01T08:17:21Z

@charris @dwf are you OK with the cutoff of 5 as is given the profiling results above?

rgommers · 2014-06-01T08:29:16Z

Re NPY_INFINITY: this is fine, but I wouldn't be opposed to just bumping the Cython version to 0.20. That prevents possible bug hunting for already fixed issues (wouldn't be the first time), and Python 3.4 support needs 0.20 anyway: https://github.com/cython/cython/blob/0.20.x/CHANGES.rst

dwf · 2014-06-01T08:43:55Z

Sounds fine to me. Chuck was suggesting it probably isn't that CPU
dependent anyway.

ENH: cluster: rewrite and optimize `vq` in Cython

rgommers · 2014-06-01T11:26:45Z

OK, let's call this good. Thanks Richard, all.

rgommers added PR labels May 26, 2014

rgommers added this to the 0.15.0 milestone May 26, 2014

rgommers changed the title ~~ENH: cluster: rewrite and optimized vq in Cython~~ ENH: cluster: rewrite and optimize vq in Cython May 26, 2014

jaimefrio mentioned this pull request May 27, 2014

STY: Use .astype's copy kwarg in np.tri numpy/numpy#4750

Merged

rgommers reviewed May 27, 2014
View reviewed changes

charris reviewed May 29, 2014
View reviewed changes

cairijun added 4 commits June 1, 2014 14:59

ENH: cluster: rewrite and optimized vq in Cython

808045b

Rewrite the underlying part of cluster.vq in Cython. Besides, an optimized algorithm is implemented and it will improve the performance for datasets with large nfeat when built with an optimized BLAS library.

BLD: Bento build scripts for new cluster._vq

9cf4e32

MAINT: cluster: rewrite vq with fused types and BLAS Fortran ABI

a59aa5c

MAINT: cluster: avoid copying when dtype is the same in vq

dd8a93f

rgommers added a commit that referenced this pull request Jun 1, 2014

Merge pull request #3683 from richardtsai/vq_rewrite

99aff1e

ENH: cluster: rewrite and optimize `vq` in Cython

rgommers merged commit 99aff1e into scipy:master Jun 1, 2014

cairijun deleted the vq_rewrite branch June 3, 2014 05:18

ENH: cluster: rewrite and optimize vq in Cython #3683

ENH: cluster: rewrite and optimize vq in Cython #3683

Conversation

cairijun commented May 26, 2014

rgommers commented May 26, 2014

rgommers commented May 26, 2014

coveralls commented May 26, 2014

cairijun commented May 26, 2014

rgommers commented May 26, 2014

rgommers commented May 26, 2014

pv commented May 26, 2014

rgommers commented May 26, 2014

rgommers commented May 26, 2014

rgommers commented May 26, 2014

cairijun commented May 26, 2014

cairijun commented May 26, 2014

rgommers commented May 26, 2014

rgommers commented May 26, 2014

rgommers commented May 26, 2014

cairijun commented May 26, 2014

pv commented May 26, 2014

rgommers commented May 26, 2014

rgommers commented May 26, 2014

pv commented May 26, 2014

pv commented May 26, 2014

cairijun commented May 27, 2014

coveralls commented May 27, 2014

coveralls commented May 27, 2014

cairijun commented May 27, 2014

rgommers commented May 27, 2014

rgommers commented May 27, 2014

rgommers May 27, 2014

Choose a reason for hiding this comment

charris May 29, 2014

Choose a reason for hiding this comment

rgommers commented May 27, 2014

dwf commented May 27, 2014

charris May 29, 2014

Choose a reason for hiding this comment

cairijun May 29, 2014

Choose a reason for hiding this comment

rgommers Jun 1, 2014

Choose a reason for hiding this comment

charris commented May 29, 2014

cairijun commented May 30, 2014

coveralls commented May 31, 2014

coveralls commented May 31, 2014

coveralls commented Jun 1, 2014

cairijun commented Jun 1, 2014

coveralls commented Jun 1, 2014

rgommers commented Jun 1, 2014

rgommers commented Jun 1, 2014

rgommers commented Jun 1, 2014

dwf commented Jun 1, 2014

rgommers commented Jun 1, 2014

ENH: cluster: rewrite and optimize `vq` in Cython #3683

ENH: cluster: rewrite and optimize `vq` in Cython #3683