Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: cluster: rewrite and optimize vq in Cython #3683

Merged
merged 4 commits into from Jun 1, 2014
Merged

ENH: cluster: rewrite and optimize vq in Cython #3683

merged 4 commits into from Jun 1, 2014

Conversation

cairijun
Copy link
Contributor

Rewrite the underlying part of cluster.vq in Cython.

Besides, an optimized algorithm is implemented and it will improve the
performance for datasets with large nfeat when built with an optimized
BLAS library.

I tested the new algorithm on datasets of different size and I noticed that there's almost no speedup (even worse) when nfeat is small. So I switch back to naive algorithm when nfeat < 5.

speedup_test

(N is #obs, M is #feat, K is #codes)

Performance comparison is as follows. (vq.kmeans(obs, k), ATLAS 3.11.13, i5-3470)

dataset old (ms) new (ms) speedup
30000 x 100, k = 10, float64 28636.2594 16950.3884 1.69
30000 x 100, k = 10, float32 13575.0532 6483.4734 2.09
10000 x 30, k = 10, float64 1923.3946 1146.7348 1.68
10000 x 30, k = 10, float32 1491.8790 989.2090 1.51
100000 x 3, k = 4, float64 1663.1430 1181.5912 1.41
100000 x 3, k = 4, float32 1278.0330 967.4396 1.32
10000 x 1, k = 3, float64 117.6908 94.0622 1.25
10000 x 1, k = 3, float32 110.2874 98.6826 1.12
1000 x 3, k = 4, float64 29.2300 28.6970 1.02
1000 x 3, k = 4, float32 27.5242 29.2488 0.94

The speedup of vq.kmeans is not so significant as that of vq.vq. It is because the update step of kmeans is still time-consuming.

@rgommers rgommers added this to the 0.15.0 milestone May 26, 2014
@rgommers
Copy link
Member

Looks like a good speedup for vq. I like the detailed benchmark here.

@rgommers
Copy link
Member

In the Cython code, the parts for float64 and float32 are the same except for the BLAS calls and dtypes, so I would suggest to use fused types. Did you consider or try that?

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) when pulling 812bcf1 on richardtsai:vq_rewrite into ceb1a12 on scipy:master.

@cairijun
Copy link
Contributor Author

Oh I've considered using fused types but it seems that it is an experimental feature according to Cython docs.

@rgommers
Copy link
Member

It's still not 100% reliable, but for this simple case it should work. And otherwise manual templating would still be preferable to duplicating the functions I'd think.

@pv indeed had an issue with fused types when replacing the SWIG wrappers in scipy.sparse. Pauli, can you comment on whether or not to use fused types here?

@rgommers
Copy link
Member

When I test this without cblas.h being found, the build works but I get three test failures:

======================================================================
ERROR: Testing simple call to kmeans2 with rank 1 data.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rgommers/Code/scipy/scipy/cluster/tests/test_vq.py", line 191, in test_kmeans2_rank1
    code1 = kmeans2(data1, code, iter=1)[0]
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 723, in kmeans2
    return _kmeans2(data, clusters, iter, nc, _valid_miss_meth[missing])
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 735, in _kmeans2
    label = vq(data, code)[0]
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 213, in vq
    results = py_vq(obs, code_book)
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 256, in py_vq
    return _py_vq_1d(obs, code_book)
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 299, in _py_vq_1d
    raise RuntimeError("_py_vq_1d buggy, do not use rank 1 arrays for now")
RuntimeError: _py_vq_1d buggy, do not use rank 1 arrays for now

======================================================================
ERROR: Testing simple call to kmeans2 with rank 1 data.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rgommers/Code/scipy/scipy/cluster/tests/test_vq.py", line 200, in test_kmeans2_rank1_2
    code1 = kmeans2(data1, 2, iter=1)
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 723, in kmeans2
    return _kmeans2(data, clusters, iter, nc, _valid_miss_meth[missing])
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 735, in _kmeans2
    label = vq(data, code)[0]
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 213, in vq
    results = py_vq(obs, code_book)
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 256, in py_vq
    return _py_vq_1d(obs, code_book)
  File "/home/rgommers/Code/scipy/scipy/cluster/vq.py", line 299, in _py_vq_1d
    raise RuntimeError("_py_vq_1d buggy, do not use rank 1 arrays for now")
RuntimeError: _py_vq_1d buggy, do not use rank 1 arrays for now

======================================================================
ERROR: test_vq_large_nfeat (test_vq.TestVq)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rgommers/Code/scipy/scipy/cluster/tests/test_vq.py", line 118, in test_vq_large_nfeat
    codes0, dis0 = _vq.vq(X, code_book)
NameError: global name '_vq' is not defined

----------------------------------------------------------------------

We should be able to get rid of the conditional import. Just not sure yet what the best way is.

@pv
Copy link
Member

pv commented May 26, 2014

@rgommers: the problems in scipy.sparse were mainly due to the large number of data types needed to be supported for scipy.sparse, and the fact that the Cython dispatch code was directly user-facing. I don't think those problems would manifest here (and if they do, templating as in scipy/sparse/_csparsetools.pyx.in may be a better solution than manual code duplication).

@rgommers
Copy link
Member

Thanks, that's what I thought.

@rgommers
Copy link
Member

Regarding the test failures, the first two should have shown up before if importing _vq had failed, which it apparently never did. Now with the usage of cblas.h I guess it may, though not sure about when.

Reading up on the threads on using function pointers exposed by scipy.linalg, that also doesn't sound like fun.

@rgommers
Copy link
Member

@richardtsai here is the change needed for the Bento build: rgommers@22e70ba12a349b

@cairijun
Copy link
Contributor Author

numpy has its own cblas.h in numpy/core/blasdot and that is what I generated cblas.pxd from. Maybe we can also leave a cblas.h (only contains the essential declarations) in scipy/cluster? BLAS is a build dependency of scipy so these BLAS routines will be always available.

@cairijun
Copy link
Contributor Author

I just noticed that MKL provides a mkl_cblas.h instead of cblas.h so the current version may fail with MKL.

@rgommers
Copy link
Member

numpy.core._dotblas only gets built with ATLAS, not with plain BLAS/LAPACK. I think we have the same issue here.

@rgommers
Copy link
Member

On OS X this also doesn't work out of the box. There's also warnings from Cython there that are easy to resolve:

Cythonizing sources
Processing scipy/cluster/_vq.pyx
warning: _vq.pyx:46:19: Non-trivial type declarators in shared declaration (e.g. mix of pointers and values). Each pointer declaration should be on its own line.
warning: _vq.pyx:46:27: Non-trivial type declarators in shared declaration (e.g. mix of pointers and values). Each pointer declaration should be on its own line.
warning: _vq.pyx:142:19: Non-trivial type declarators in shared declaration (e.g. mix of pointers and values). Each pointer declaration should be on its own line.
warning: _vq.pyx:142:27: Non-trivial type declarators in shared declaration (e.g. mix of pointers and values). Each pointer declaration should be on its own line.

@rgommers
Copy link
Member

After cp ../numpy/numpy/core/blasdot/cblas.h scipy/cluster/ build works fine on OS X and all tests pass.

@rgommers rgommers changed the title ENH: cluster: rewrite and optimized vq in Cython ENH: cluster: rewrite and optimize vq in Cython May 26, 2014
@cairijun
Copy link
Contributor Author

The plain BLAS doesn't optimized well for gemm so the new algorithm may be useless due to overhead. We can just fall back to naive algorithm using conditional compilation when users don't have ATLAS, openblas, Accelerate or MKL.

@pv
Copy link
Member

pv commented May 26, 2014

I think we should optimize the performance for the case where an accelerated BLAS is available.
Not having an optimized BLAS installed is equal to deciding that performance does not matter at all,
and in this case we don't need to care about performance numbers.

@rgommers
Copy link
Member

Agreed that performance is not important in that case, but it should work. So the plain Python code needs a fix for those test failures, and a sanity check (it should also not be unusably slow for reasonable input sizes).

@rgommers
Copy link
Member

@pv do you think just adding cblas.h (named vq_cblas.h I guess, like in spatial) is enough?

@pv
Copy link
Member

pv commented May 26, 2014

I think we can't assume the C BLAS interface is always available, as we only require Fortran BLAS.
spatial calls the Fortran ABI

@pv
Copy link
Member

pv commented May 26, 2014

Also, since sdot is used here, one needs to replace sdot -> wsdot and use our BLAS wrappers.

@cairijun
Copy link
Contributor Author

Updated. Use fused types now and it should be a bit clearer.
And I use the BLAS Fortran ABI and it can get rid of cblas now. I've tested it with plain blas/lapack.
As for sdot/ddot, I dropped it because it seems that calling a fortran function in Cython is not so easy as calling a fortran subprogram, and it seems not to be a performance hotspot. I wrote an inline function to substitute it and it works fine without performance loss.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling c14f29c on richardtsai:vq_rewrite into 569b1af on scipy:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling c14f29c on richardtsai:vq_rewrite into 569b1af on scipy:master.

@cairijun
Copy link
Contributor Author

It seems that copy is not supported in ndarray.astype prior to numpy 1.7.

@rgommers
Copy link
Member

Hmm, totally missed that. I usually work with numpy 1.5.1, but apparently when I checked astype I happened to have an ipython shell with numpy master open. Guess then your code was the right thing to do; maybe just add a comment to replace it with astype(ct, copy=False) when we get to numpy 1.7.0

@rgommers
Copy link
Member

@richardtsai I would suggest to start making slightly smaller commits. The last one I would have split up in a change for astype, one for fused types and one for changing to Fortran BLAS. If one is then not OK it's easier to undo or change.

int32_t *codes, vq_type *low_dist):
"""
Vector quantization for float32 using naive algorithm.
This is perfered when nfeat is small.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: perfered

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rgommers What is scipy policy for documenting private functions? I know we don't tend to document c functions, but we should.

@rgommers
Copy link
Member

@dwf @charris any comments on the algorithm changes?

@dwf
Copy link
Contributor

dwf commented May 27, 2014

LGTM overall. My only comment would be that the N=5 cutoff point for the naive algorithm is going to probably be processor dependent. This might be an okay heuristic, but I'm not sure. However, this excerpt from the Zen of Python comes to mind:

In the face of ambiguity, refuse the temptation to guess.

Perhaps a way to manually control the strategy would be desirable, with the default being to use the heuristic?

offset += nfeat


def vq(np.ndarray obs, np.ndarray codes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the visible function, correct? If so, it needs more documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users would not be aware of this function. They should call cluster.vq.vq instead. But of course more documentation would make it more maintainable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks OK to me.

@charris
Copy link
Member

charris commented May 29, 2014

@dwf The magic cutoff number doesn't bother me, although it should be set someplace prominant and documented. I suspect the time difference is due to function call overhead -- the blas calls are rather heavy -- and thus not too sensitive to processor type.

@cairijun
Copy link
Contributor Author

@charris @dwf
I've done some profiling.

In [19]: obs.shape
Out[19]: (100000, 5)

In [20]: codes.shape
Out[20]: (30, 5)

In [21]: %prun vq.vq(obs, codes)
         200022 function calls in 0.062 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.044    0.044    0.062    0.062 _vq.pyx:66(__pyx_fuse_1_vq)
   100030    0.008    0.000    0.008    0.000 _vq.pyx:41(__pyx_fuse_1vec_sqr)
    99970    0.006    0.000    0.006    0.000 _vq.pyx:34(__pyx_fuse_1_sqrt)
        1    0.004    0.004    0.004    0.004 _vq.pyx:49(__pyx_fuse_1cal_M)
        1    0.000    0.000    0.062    0.062 vq.py:142(vq)
...
In [24]: obs.shape
Out[24]: (100000, 4)

In [25]: codes.shape
Out[25]: (30, 4)

In [26]: %prun vq.vq(obs, codes)
         100022 function calls in 0.045 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.039    0.039    0.045    0.045 _vq.pyx:122(__pyx_fuse_1_vq_small_nf)
   100000    0.006    0.000    0.006    0.000 _vq.pyx:34(__pyx_fuse_1_sqrt)
        1    0.000    0.000    0.045    0.045 _vq.pyx:165(vq)
        1    0.000    0.000    0.045    0.045 vq.py:142(vq)
...

It seems that when nfeat is small, the BLAS calls are not hotspots. I suspect that it is the code that updates low_dist and codes becomes the most time-consuming part and since this part is shared by the two algorithms, I think the the cutoff is not so processor dependent.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) when pulling a3736e2 on richardtsai:vq_rewrite into 93656a8 on scipy:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) when pulling a3736e2 on richardtsai:vq_rewrite into 93656a8 on scipy:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) when pulling ed1d7a2 on richardtsai:vq_rewrite into 93656a8 on scipy:master.

Rewrite the underlying part of cluster.vq in Cython.
Besides, an optimized algorithm is implemented and it will improve the
performance for datasets with large nfeat when built with an optimized
BLAS library.
@cairijun
Copy link
Contributor Author

cairijun commented Jun 1, 2014

Split up some commits.
Comments are added.
Manually import NPY_INFINITY from numpy/arrayobject.h since numpy/math.pxd is not available prior to Cython 0.20.
@rgommers I've tried to split the changes to fused types and BLAS Fortran ABI into two commits but failed since I wrote them together.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) when pulling dd8a93f on richardtsai:vq_rewrite into 93656a8 on scipy:master.

@rgommers
Copy link
Member

rgommers commented Jun 1, 2014

@richardtsai no problem. This looks good to me.

@rgommers
Copy link
Member

rgommers commented Jun 1, 2014

@charris @dwf are you OK with the cutoff of 5 as is given the profiling results above?

@rgommers
Copy link
Member

rgommers commented Jun 1, 2014

Re NPY_INFINITY: this is fine, but I wouldn't be opposed to just bumping the Cython version to 0.20. That prevents possible bug hunting for already fixed issues (wouldn't be the first time), and Python 3.4 support needs 0.20 anyway: https://github.com/cython/cython/blob/0.20.x/CHANGES.rst

@dwf
Copy link
Contributor

dwf commented Jun 1, 2014

Sounds fine to me. Chuck was suggesting it probably isn't that CPU
dependent anyway.

rgommers added a commit that referenced this pull request Jun 1, 2014
ENH: cluster: rewrite and optimize `vq` in Cython
@rgommers rgommers merged commit 99aff1e into scipy:master Jun 1, 2014
@rgommers
Copy link
Member

rgommers commented Jun 1, 2014

OK, let's call this good. Thanks Richard, all.

@cairijun cairijun deleted the vq_rewrite branch June 3, 2014 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.cluster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants