[MRG] SGD Cython improvements #2426

larsmans · 2013-09-07T14:46:02Z

Here are some safe bits from #2419, which can be pulled without waiting for my threading experiments.

nogil declarations where possible (but no with nogil blocks, since the performance benefit of those is not clear)
use BLAS where appropriate
replaced DOUBLE by double, INTEGER and int32 by int/np.intc, since that's the type used by scipy.sparse
explicit checks that dataset dims don't exceed INT_MAX
replaced an np.all by an explicit loop (npymath library for C99 isfinite)

I also cleaned up the code a little bit (e.g. min and max no longer need to be defined for current Cython).

* nogil declarations where possible (but no with nogil blocks, since the performance benefit of those is not clear) * use BLAS where appropriate * replaced DOUBLE by double, INTEGER and int32 by int/np.intc, since that's the type used by scipy.sparse * explicit checks that dataset dims don't exceed INT_MAX * replaced an np.all by an explicit loop (npymath library for C99 isfinite)

larsmans · 2013-09-07T16:25:55Z

Wow. The first version of this PR used Cython memoryviews, which added a whopping 14 kLOC of generated C to the repo. Switched back to bare pointers to keep the compile times within bounds.

pprett · 2013-09-08T07:34:53Z

@larsmans while you're in the process of looking at the dtypes - we should also keep #2393 in mind.

GaelVaroquaux · 2013-09-08T09:11:57Z

Wow. The first version of this PR used Cython memoryviews, which added a
whopping 14 kLOC of generated C to the repo. Switched back to bare pointers to
keep the compile times within bounds.

It would be interesting to hear the Cython team's point of view on that.
@sturlamolden @robertwb @markflorisson88 @dagss

sturlamolden · 2013-09-08T12:22:50Z

I am not surprised. Typed memoryviews are bloatware generators if you look at lines of C code (but so is the rest of Cython as well). But keep in mind that the C compiler will remove most of this bloat from the binary. You can expect the execution speed to be at least 90% of hand-written C. One solution is to use Cython as a build-time dependency, so you don't need the generated C files in the repo.

larsmans · 2013-09-08T12:42:12Z

Cython as a build-time dependency has been discussed before, but it's not acceptable (we want to use the latest released version without forcing users to upgrade all the time).

Anyway, I just replaced the memoryview with a pointer and integer size. It was a trivial change.

sturlamolden · 2013-09-08T12:45:47Z

I would suggest that you use PyArray_DATA instead of the deprecated NumPy syntax. Your code is bound to break before the release of Cython 1.0.

sturlamolden · 2013-09-08T13:06:56Z

nogil declarations do nothing without nogil blocks. nogil declarations just say it is ok to use this function without owning the GIL. But it is a nogil block that actually releases and re-acquires the GIL.

Performance wise, releasing and acquiring the GIL can be quite expensive. It is not just a simple critical section. Serial code will be faster if you don't release the GIL. But if you do release it, you don't lock up the interpreter and can use Python threads for concurrency.

larsmans · 2013-09-08T13:51:41Z

@sturlamolden: I am aware of that. I'm modifying the code piece by piece so we can later figure out a way to do multithreading with nogil blocks. Getting as many functions as possible to run without the gil also forces us to not call into Python code when it's not necessary (hence the NumPy calls that were replaced with CBLAS calls).

larsmans · 2013-09-08T13:52:15Z

@sturlamolden Which part of the syntax is going to break? The .data attribute syntax?

sturlamolden · 2013-09-08T15:10:44Z

Anything of this sort will break:

cdef np.ndarray[double] foo

That is, this buffer syntax is deprecated in favor of typed memoryviews. When np.ndarray is just an alias for PyArrayObject*, the .data attribute will break too. And indexing will return a different ndarray as opposed to an element in the array. So get rid of it as soon as you can.

Note that as the .data attribute is acquired with PyArray_DATA, it is impervious to changes in the NumPy C API making array->data an error.

ogrisel · 2013-09-08T18:30:34Z

@larsmans have you benchmarked the effect of calling blas directly vs master? what is the speed difference?

This PR LGTM. We can always address the switch to typed memory views in another PR if we decide to do so.

amueller · 2013-09-08T18:44:25Z

On 09/08/2013 03:06 PM, Sturla Molden wrote:

nogil declarations do nothing without nogil blocks. nogil declarations
just say it is ok to use this function without owning the GIL. But it
is a nogil block that actually releases and re-acquires the GIL.

I am pretty sure that is not true.
I thought that, too and had an argument with @temporaer about that once, but the proved me wrong.
Look at the generated C code. adding the nogil declaration simplifies
the C code. In particular it removes the nannys.

This allows the C compiler to do better optimization. @temporaer had an
example where introducing nogil allowed the
compiler to optimize a tail recursion away, leading to a huge speedup.
While this will probably not happen in practice,
it certainly doesn't make the code slower.

larsmans · 2013-09-08T19:26:33Z

@pprett I closed #2393 as unrelated, but there is now a ValueError if you throw in more than INT_MAX features or samples. There's no test for that, as a test would need to allocate a vector of several GB.

pprett · 2013-09-08T19:27:41Z

Thanks!

larsmans · 2013-09-08T19:28:36Z

@sturlamolden Thanks for the warning, we use the NumPy array syntax practically everywhere. Perhaps we should switch, perhaps we shoul fork numpy.pxd and maintain it ourselves. I must confess I don't feel like the latter, but the memory view switch is going to be heavy on our project.

(But why does Cython not warn when this syntax is used?)

sturlamolden · 2013-09-08T22:04:59Z

@amueller The refnanny is used to test Cython. It is preprocessor macros that expands to nothing when we use Cython for other purposes than running Cython's test suite. If these empty macros prevent optimizing tail recursion I suggest you try a different C compiler. Note that as the refnanny is macros, it is not sufficient to just look at the C. You must look at the compiled opcode. But if what you are saying is true, I would really like to see an example where this happens.

sturlamolden · 2013-09-08T22:31:14Z

By the way, is the problem with the typed memoryviews actually the number of pyx files? I.e. that the utility code is repeated in multiple C files?

larsmans · 2013-09-08T23:20:09Z

The problem with typed memoryviews is simply that they're different from what we were doing. We'll have to change a lot of code and habits, being careful not to lose performance due to memoryview overhead.

robertwb · 2013-09-10T00:15:52Z

I started a discussion about memoryview overhead at https://mail.python.org/pipermail/cython-devel/2013-September/003807.html

larsmans · 2013-09-11T15:57:59Z

@ogrisel On covertype, this isn't any faster than the old version. Still, I'd like to merge it because it improves safety and (IMHO) cleans up the code.

pprett · 2013-09-11T16:09:35Z

+1 for safety (I can do a review in the next days if you need one more)

2013/9/11 Lars Buitinck notifications@github.com

@ogrisel https://github.com/ogrisel On covertype, this isn't any faster
than the old version. Still, I'd like to merge it because it improves
safety and (IMHO) cleans up the code.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2426#issuecomment-24252421
.

Peter Prettenhofer

larsmans · 2013-09-11T16:16:23Z

Yes, please!

scoder · 2013-09-12T17:45:03Z

"""
adding the nogil declaration simplifies the C code. In particular it removes the nannys. This allows the C compiler to do better optimization. @temporaer had an example where introducing nogil allowed the compiler to optimize a tail recursion away, leading to a huge speedup.
"""

I agree with Sturla that this sounds rather unrealistic. Maybe adding the "nogil" modifier wasn't the only change. A related change like preventing the return type from propagating exceptions would be a more likely cause.

sturlamolden · 2013-09-12T19:13:30Z

Unless there is a bug in Cython that causes some of the stuff that the "except" modifer should add to cdef functions to always be included unless we use "nogil". But if it does it's a bug, not a feature. I will investigate and report back.

But I don't believe it's the refnanny, if by all unlikelyhood this actually happens.

sturlamolden · 2013-09-13T01:01:07Z

Here is my testing code.

tailcallbench.pyx:

## Three recursive cdef functions for testing tail call optimisation

cdef double recursive_sum(int n, double *x):
    if n == 0:
        return 0
    else:
        return x[0] + recursive_sum(n-1, x+1)


cdef double recursive_sum_nogil(int n, double *x) nogil:
    if n == 0:
        return 0
    else:
        return x[0] + recursive_sum_nogil(n-1, x+1)


cdef double recursive_sum_except(int n, double *x) except -1:
    if n == 0:
        return 0
    else:
        return x[0] + recursive_sum_except(n-1, x+1)    


## Manual transformation to an iterative loop

cdef double iterative_sum(int n, double *x):
    cdef double y = 0.0
    for i in range(n):
       y += x[i]
    return y


## Performance timers for MacOSX

## On Windows use QueryPerformanceCounter and QueryPerformanceFrequency
## instead of mach_absolute_time and mach_timebase_info.

cdef extern from "mach/mach.h":
    pass

cdef extern from "mach/mach_time.h":
    ctypedef unsigned long uint64_t
    uint64_t mach_absolute_time()

    ctypedef struct mach_timebase_info_data_t:
        uint64_t numer
        uint64_t denom

    void mach_timebase_info(mach_timebase_info_data_t*)

cdef double nanodiff(uint64_t t1, uint64_t t0):
    cdef mach_timebase_info_data_t base
    cdef uint64_t elapsed 
    elapsed = t1-t0
    mach_timebase_info(&base)    
    return (<double> elapsed * <double> base.numer / <double> base.denom)


## Here is the actual benchmark we run from Python    

import numpy as np

def timeit():

    cdef uint64_t t1, t0
    cdef int n = 100000
    cdef double[::1] buf = np.random.rand(n)
    cdef double s0, s1, s2, s3

    m = 1000
    timings = np.zeros(m)

    print "timing %d tail calls" % (n,)

    for i in range(m):
        t0 = mach_absolute_time()
        s0 = recursive_sum(n, &buf[0])
        t1 = mach_absolute_time()
        timings[i] = nanodiff(t1,t0)
    print "plain cdef:        mean of %d = %.4g ns, std = %.4g ns" % (m, timings.mean(), timings.std())

    for i in range(m):
        t0 = mach_absolute_time()
        s1 = recursive_sum_except(n, &buf[0])
        t1 = mach_absolute_time()        
        timings[i] = nanodiff(t1,t0)
    print "cdef with except:  mean of %d = %.4g ns, std = %.4g ns" % (m, timings.mean(), timings.std())

    for i in range(m):
        t0 = mach_absolute_time()
        s2 = recursive_sum_nogil(n, &buf[0])
        t1 = mach_absolute_time()
        timings[i] = nanodiff(t1,t0)
    print "cdef with nogil:   mean of %d = %.4g ns, std = %.4g ns" % (m, timings.mean(), timings.std())

    print "\nbaseline:"

    for i in range(m):
        t0 = mach_absolute_time()
        s3 = iterative_sum(n, &buf[0])
        t1 = mach_absolute_time()
        timings[i] = nanodiff(t1,t0)
    print "manual iteration:  mean of %d = %.4g ns, std = %.4g ns" % (m, timings.mean(), timings.std())

    return s0,s1,s2,s3

setup.py:

#!/usr/bin/env python

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

so = Extension("test", ["tailcallbench.pyx"], extra_compile_args=['-O3'])

setup(
    author="Sturla Molden",
    name="tailcallbench",
    version="2013-09-12",
    license='scipy license (http://scipy.org)', 
    description="Cython tail call benchmark",
    url='',
    classifiers=[
        "Development Status :: 3 - alpha, research",
        "Intended Audience :: Scientific programmers",
        "License :: scipy",
        "Operating System :: mac"],
    packages=["tailcallbench"],
    cmdclass = {'build_ext': build_ext},
    ext_modules = [so],
)

sturlamolden · 2013-09-13T02:12:44Z

Intel i7 @ 2.4 GHz (MacBook Pro)
Cython 0.91.1
Python 2.7.3 64-bit (Enthought Canopy)

GCC version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)

$ export CC=gcc
$ python setup.py build_ext --inplace
[...]
$ python -c "import test; test.timeit()"
timing 100000 tail calls
plain cdef:        mean of 1000 = 1.903e+05 ns, std = 2.622e+04 ns
cdef with except:  mean of 1000 = 2.349e+05 ns, std = 1.894e+04 ns
cdef with nogil:   mean of 1000 = 1.903e+05 ns, std = 1.186e+04 ns

baseline:
manual iteration:  mean of 1000 = 2.717e+05 ns, std = 1.008e+04 ns

Intel icc version 13.0.2 (gcc version 4.2.1 compatibility)

$ export CC=icc
$ rm *.so
$ python setup.py build_ext --inplace
[...]
$ python -c "import test; test.timeit()"
timing 100000 tail calls
plain cdef:        mean of 1000 = 2.473e+05 ns, std = 2.977e+04 ns
cdef with except:  mean of 1000 = 2.931e+05 ns, std = 4.35e+04 ns
cdef with nogil:   mean of 1000 = 2.436e+05 ns, std = 1.731e+04 ns

baseline:
manual iteration:  mean of 1000 = 2.576e+04 ns, std = 1947 ns

In conclusion, omitting nogil does not incur any extra overhead. The performance is similar to an iterative loop, indicating that tail-calls were eliminated.

If the cdef is declared to propagate Python exceptions, there is a small extra overhead (roughly 15%), which might be due to missed tail-call optimisation.

ogrisel · 2013-09-13T08:36:28Z

Thanks for checking @sturlamolden!

sturlamolden · 2013-09-13T10:06:33Z

By the way, notice how much faster code (10x speedup) Intel C generates for the iterative loop. This is probably due to autovectorization, which demonstrates the importance of testing with multiple compilers. Otherwise we could be fooled into believing that tailcall optimization is always omitted.

And of course it also shows why it's a good idea to use Intel compilers for scientific programming.

larsmans · 2013-09-13T10:10:17Z

Yes, but then our main aim is to get scikit-learn to work well with the default compilers that come with OSs. If users install scikit-learn and it's slow, we can't tell them to get an Intel license and recompile the whole thing.

sturlamolden · 2013-09-13T10:12:19Z

It does work well with Apple's GCC/LLVM compiler. The tail calls execute faster than an equivalent iterative loop. The tail calls are even faster with except than with pure iteration.

larsmans · 2013-09-13T10:19:27Z

Sure. But anyway, we were digressing: there are no tail calls in our SGD code.

Any reviews?

sturlamolden · 2013-09-13T10:25:40Z

The main thing I would change in your code now is to replace array.data with np.PyArray_DATA(array) or &array[0]. Preferably the former.

sturlamolden · 2013-09-13T10:43:20Z

I don't really understand the LOC argument against typed memoryviews. If you cimport numpy you replace the typed memoryview overhead with NumPy headers, which is an even bigger compile-time dependency.

BTW: If you use typed memoryviews you don't need to cimport numpy.

temporaer · 2013-09-13T10:58:51Z

I understand that discussion about the recursion part is closed now, but I looked up and would like to clarify what @amueller said I've observed. According to my notes there was no significant effect of gil/nogil in regular cdef functions. There were problems when I used default arguments to the cdef functions, which sometimes prevented optimization (they do not directly translate to C[++] default arguments). Take-home message for me was: Do not use default args in cdef functions to be on the safe side.

amueller · 2013-09-13T11:18:11Z

Sorry for creating work for @sturlamolden and thanks a lot for double checking my claims.

larsmans · 2013-09-13T11:22:43Z

@sturlamolden I still need the NumPy headers for the NumPy core math library (the isfinite function in particular). But I'll switch to the new syntax.

sturlamolden · 2013-09-13T11:30:15Z

That's ok @amueller. If there were such a bug in Cython we would need to get rid of it. Thanks for the clarification, @temporaer, that makes more sense.

@larsmans You only need numpy/npy_math.h for the core math library, not all of the NumPy headers.

http://docs.scipy.org/doc/numpy/reference/c-api.coremath.html

ogrisel · 2013-09-20T08:12:39Z

Shall we merge this? I think this is already a net improvement to the current state. @pprett @mblondel any feedback?

@larsmans could you add an entry in the developers doc to summarize when to use which which dtype in cython code (both for floating point feature values and indices for arrays)?

I think the switch to the typed memory view API can be discussed in a distinct PR.

pprett · 2013-09-21T08:47:05Z

sklearn/utils/seq_dataset.pyx

                        mode='c'] feature_indices = np.arange(0, self.n_features,
-                                                              dtype=np.int32)
+                                                              dtype=np.intc)


np.intc is the type used by scipy.sparse? is it just a typedef for int? please add a comment so that my future self knows whats going on here.

Yes, intc is per definition the C type int.

pprett · 2013-09-21T09:13:41Z

Ran my RCV1 benchmark [1] - everything fine - thanks @larsmans for cleaning up and taking care of the dtype mess.

+1 for merge

[1] https://gist.github.com/pprett/4150519

pprett · 2013-09-21T09:17:36Z

regarding the typed memory view issue: as far as I understood using the data array directly is deprecated and we should instead move to memory views - is this correct? This has implications throughout our code base - we should find a consensus how to deal with it in the project as a whole!

larsmans · 2013-09-21T09:40:54Z

If I understand @sturlamolden correctly, we can still get raw pointers to NumPy arrays' contents, it's just the np.ndarray syntax that's disappearing. (What a shame!)

scoder · 2013-09-21T09:46:47Z

it's just the np.ndarray syntax that's disappearing.

I don't think it's decided what will be done with it in Cython (and if, when). But I think it's generally considered legacy syntax.

dagss · 2013-09-21T11:54:38Z

There's loads and loads of Cython code out there using the np.ndarray syntax, and Cython has been very careful about not breaking backwards compatability without very good reasons in the past, so it disappearing is not something I would worry about. My advice would be to not spend time switching unless there's real benefits to switching now.

This topic is a bit muddy now. It used to be clear-cut: The memoryview syntax allowed lots of extra optimizations, @markflorisson88 had a branch that did highly efficient vectorized arithmetic inlined in Cython (more efficient than Intel Fortran in a couple of synthetic cases). However, as noone had the bandwidth to integrate and maintain that work, it was decided to not merge it.

So in a way, it used to be the case that the memoryview syntax was the way of the future, but since the work on "numerical Cython" has been somewhat abandoned, the case is less clear cut now. The memoryviews have some performance benefits in some situations, but np.ndarray can make for more convenient programming (if you want to do arithmetic with NumPy).

larsmans · 2013-09-21T15:39:03Z

I'm merging this as-is for now. The scikit-learn devs will have to discuss at some point whether we want to switch to memoryviews, maybe after experimenting with them in new Cython extensions. (I saw the vectorized arithmetic PR, and it's a shame that didn't get merged yet. It's a killer feature!)

@pprett I haven't put in the comment; I promise I'll write the "which type goes were in Cython" manual that @ogrisel asked for, to put it in the dev docs, instead of doing it inline in this file.

[MRG] SGD Cython improvements

sturlamolden · 2013-09-24T17:39:26Z

Thanks Dag (@dagss). Does this mean we should avoid typed memory views for numerical work? I.e. just use Cython as a wrapping tool, and write plain C (with NumPy's C API) or Fortran? Or something totally different like JIT'ing Python with Numba?

I am getting confused...

But vectorized memoryview expressions probably never were a good idea, given that the competition is Fortran 90 compilers and Intel's cilkplus.

dagss · 2013-09-24T20:15:08Z

@sturlamolden I don't have much advice for you. There's no "Cython strategy" for numerics any longer after me and Mark pretty much stopped working on it. I hope Numba will become what we need down the road, but it's probably not there yet. Of course, the rise of Numba led to loss of interest, but we're in a transition period where people's codebase is written in Cython but only Numba gets further developed in the area, and we won't know yet to what degree Numba succeeds... so very difficult to give advice. (Perhaps even calling out to Julia is the way of the future?).

I use C or Fortran and use Cython to wrap, myself.

Anyone interested in further discussing this should start a thread on the Cython-dev list I guess.

scoder · 2013-09-27T04:34:23Z

Agreed with Dag's last sentence. Could we please stop this FUDdy discussion in this ticket? If you want to discuss the current status and/or future developments of Cython, please move to one of the Cython-specific lists, so that the right people can give proper comments. Cython-dev sounds ok for me here, although it might be generally interesting for users as well.

larsmans mentioned this pull request Sep 7, 2013

[WIP] nogil decls in SGD and WeightVector code #2419

Closed

pprett reviewed Sep 21, 2013
View reviewed changes

larsmans added a commit that referenced this pull request Sep 21, 2013

Merge pull request #2426 from larsmans/sgd-improvements

aa2d045

[MRG] SGD Cython improvements

larsmans merged commit aa2d045 into scikit-learn:master Sep 21, 2013

larsmans deleted the sgd-improvements branch October 2, 2015 14:21

Uh oh!

[MRG] SGD Cython improvements #2426

[MRG] SGD Cython improvements #2426

Uh oh!

Conversation

larsmans commented Sep 7, 2013

Uh oh!

larsmans commented Sep 7, 2013

Uh oh!

pprett commented Sep 8, 2013

Uh oh!

GaelVaroquaux commented Sep 8, 2013

Uh oh!

sturlamolden commented Sep 8, 2013

Uh oh!

larsmans commented Sep 8, 2013

Uh oh!

sturlamolden commented Sep 8, 2013

Uh oh!

sturlamolden commented Sep 8, 2013

Uh oh!

larsmans commented Sep 8, 2013

Uh oh!

larsmans commented Sep 8, 2013

Uh oh!

sturlamolden commented Sep 8, 2013

Uh oh!

ogrisel commented Sep 8, 2013

Uh oh!

amueller commented Sep 8, 2013

Uh oh!

larsmans commented Sep 8, 2013

Uh oh!

pprett commented Sep 8, 2013

Uh oh!

larsmans commented Sep 8, 2013

Uh oh!

sturlamolden commented Sep 8, 2013

Uh oh!

sturlamolden commented Sep 8, 2013

Uh oh!

larsmans commented Sep 8, 2013

Uh oh!

robertwb commented Sep 10, 2013

Uh oh!

larsmans commented Sep 11, 2013

Uh oh!

pprett commented Sep 11, 2013

Uh oh!

larsmans commented Sep 11, 2013

Uh oh!

scoder commented Sep 12, 2013

Uh oh!

sturlamolden commented Sep 12, 2013

Uh oh!

sturlamolden commented Sep 13, 2013

Uh oh!

sturlamolden commented Sep 13, 2013

Uh oh!

ogrisel commented Sep 13, 2013

Uh oh!

sturlamolden commented Sep 13, 2013

Uh oh!

larsmans commented Sep 13, 2013

Uh oh!

sturlamolden commented Sep 13, 2013

Uh oh!

larsmans commented Sep 13, 2013

Uh oh!

sturlamolden commented Sep 13, 2013

Uh oh!

sturlamolden commented Sep 13, 2013

Uh oh!

temporaer commented Sep 13, 2013

Uh oh!

amueller commented Sep 13, 2013

Uh oh!

larsmans commented Sep 13, 2013

Uh oh!

sturlamolden commented Sep 13, 2013