Skip to content

Conversation

larsmans
Copy link
Member

@larsmans larsmans commented Sep 7, 2013

Here are some safe bits from #2419, which can be pulled without waiting for my threading experiments.

  • nogil declarations where possible (but no with nogil blocks, since the performance benefit of those is not clear)
  • use BLAS where appropriate
  • replaced DOUBLE by double, INTEGER and int32 by int/np.intc, since that's the type used by scipy.sparse
  • explicit checks that dataset dims don't exceed INT_MAX
  • replaced an np.all by an explicit loop (npymath library for C99 isfinite)

I also cleaned up the code a little bit (e.g. min and max no longer need to be defined for current Cython).

* nogil declarations where possible (but no with nogil blocks, since the
  performance benefit of those is not clear)
* use BLAS where appropriate
* replaced DOUBLE by double, INTEGER and int32 by int/np.intc, since that's
  the type used by scipy.sparse
* explicit checks that dataset dims don't exceed INT_MAX
* replaced an np.all by an explicit loop (npymath library for C99 isfinite)
@larsmans
Copy link
Member Author

larsmans commented Sep 7, 2013

Wow. The first version of this PR used Cython memoryviews, which added a whopping 14 kLOC of generated C to the repo. Switched back to bare pointers to keep the compile times within bounds.

@pprett
Copy link
Member

pprett commented Sep 8, 2013

@larsmans while you're in the process of looking at the dtypes - we should also keep #2393 in mind.

@GaelVaroquaux
Copy link
Member

Wow. The first version of this PR used Cython memoryviews, which added a
whopping 14 kLOC of generated C to the repo. Switched back to bare pointers to
keep the compile times within bounds.

It would be interesting to hear the Cython team's point of view on that.
@sturlamolden @robertwb @markflorisson88 @dagss

@sturlamolden
Copy link
Contributor

I am not surprised. Typed memoryviews are bloatware generators if you look at lines of C code (but so is the rest of Cython as well). But keep in mind that the C compiler will remove most of this bloat from the binary. You can expect the execution speed to be at least 90% of hand-written C. One solution is to use Cython as a build-time dependency, so you don't need the generated C files in the repo.

@larsmans
Copy link
Member Author

larsmans commented Sep 8, 2013

Cython as a build-time dependency has been discussed before, but it's not acceptable (we want to use the latest released version without forcing users to upgrade all the time).

Anyway, I just replaced the memoryview with a pointer and integer size. It was a trivial change.

@sturlamolden
Copy link
Contributor

I would suggest that you use PyArray_DATA instead of the deprecated NumPy syntax. Your code is bound to break before the release of Cython 1.0.

@sturlamolden
Copy link
Contributor

nogil declarations do nothing without nogil blocks. nogil declarations just say it is ok to use this function without owning the GIL. But it is a nogil block that actually releases and re-acquires the GIL.

Performance wise, releasing and acquiring the GIL can be quite expensive. It is not just a simple critical section. Serial code will be faster if you don't release the GIL. But if you do release it, you don't lock up the interpreter and can use Python threads for concurrency.

@larsmans
Copy link
Member Author

larsmans commented Sep 8, 2013

@sturlamolden: I am aware of that. I'm modifying the code piece by piece so we can later figure out a way to do multithreading with nogil blocks. Getting as many functions as possible to run without the gil also forces us to not call into Python code when it's not necessary (hence the NumPy calls that were replaced with CBLAS calls).

@larsmans
Copy link
Member Author

larsmans commented Sep 8, 2013

@sturlamolden Which part of the syntax is going to break? The .data attribute syntax?

@sturlamolden
Copy link
Contributor

Anything of this sort will break:

cdef np.ndarray[double] foo

That is, this buffer syntax is deprecated in favor of typed memoryviews. When np.ndarray is just an alias for PyArrayObject*, the .data attribute will break too. And indexing will return a different ndarray as opposed to an element in the array. So get rid of it as soon as you can.

Note that as the .data attribute is acquired with PyArray_DATA, it is impervious to changes in the NumPy C API making array->data an error.

@ogrisel
Copy link
Member

ogrisel commented Sep 8, 2013

@larsmans have you benchmarked the effect of calling blas directly vs master? what is the speed difference?

This PR LGTM. We can always address the switch to typed memory views in another PR if we decide to do so.

@amueller
Copy link
Member

amueller commented Sep 8, 2013

On 09/08/2013 03:06 PM, Sturla Molden wrote:

nogil declarations do nothing without nogil blocks. nogil declarations
just say it is ok to use this function without owning the GIL. But it
is a nogil block that actually releases and re-acquires the GIL.

I am pretty sure that is not true.
I thought that, too and had an argument with @temporaer about that once, but the proved me wrong.
Look at the generated C code. adding the nogil declaration simplifies
the C code. In particular it removes the nannys.

This allows the C compiler to do better optimization. @temporaer had an
example where introducing nogil allowed the
compiler to optimize a tail recursion away, leading to a huge speedup.
While this will probably not happen in practice,
it certainly doesn't make the code slower.

@larsmans
Copy link
Member Author

larsmans commented Sep 8, 2013

@pprett I closed #2393 as unrelated, but there is now a ValueError if you throw in more than INT_MAX features or samples. There's no test for that, as a test would need to allocate a vector of several GB.

@pprett
Copy link
Member

pprett commented Sep 8, 2013

Thanks!

@larsmans
Copy link
Member Author

larsmans commented Sep 8, 2013

@sturlamolden Thanks for the warning, we use the NumPy array syntax practically everywhere. Perhaps we should switch, perhaps we shoul fork numpy.pxd and maintain it ourselves. I must confess I don't feel like the latter, but the memory view switch is going to be heavy on our project.

(But why does Cython not warn when this syntax is used?)

@sturlamolden
Copy link
Contributor

@amueller The refnanny is used to test Cython. It is preprocessor macros that expands to nothing when we use Cython for other purposes than running Cython's test suite. If these empty macros prevent optimizing tail recursion I suggest you try a different C compiler. Note that as the refnanny is macros, it is not sufficient to just look at the C. You must look at the compiled opcode. But if what you are saying is true, I would really like to see an example where this happens.

@sturlamolden
Copy link
Contributor

By the way, is the problem with the typed memoryviews actually the number of pyx files? I.e. that the utility code is repeated in multiple C files?

@larsmans
Copy link
Member Author

larsmans commented Sep 8, 2013

The problem with typed memoryviews is simply that they're different from what we were doing. We'll have to change a lot of code and habits, being careful not to lose performance due to memoryview overhead.

@robertwb
Copy link
Contributor

I started a discussion about memoryview overhead at https://mail.python.org/pipermail/cython-devel/2013-September/003807.html

@larsmans
Copy link
Member Author

@ogrisel On covertype, this isn't any faster than the old version. Still, I'd like to merge it because it improves safety and (IMHO) cleans up the code.

@pprett
Copy link
Member

pprett commented Sep 11, 2013

+1 for safety (I can do a review in the next days if you need one more)

2013/9/11 Lars Buitinck notifications@github.com

@ogrisel https://github.com/ogrisel On covertype, this isn't any faster
than the old version. Still, I'd like to merge it because it improves
safety and (IMHO) cleans up the code.


Reply to this email directly or view it on GitHubhttps://github.com//pull/2426#issuecomment-24252421
.

Peter Prettenhofer

@larsmans
Copy link
Member Author

Yes, please!

@scoder
Copy link

scoder commented Sep 12, 2013

"""
adding the nogil declaration simplifies the C code. In particular it removes the nannys. This allows the C compiler to do better optimization. @temporaer had an example where introducing nogil allowed the compiler to optimize a tail recursion away, leading to a huge speedup.
"""

I agree with Sturla that this sounds rather unrealistic. Maybe adding the "nogil" modifier wasn't the only change. A related change like preventing the return type from propagating exceptions would be a more likely cause.

@sturlamolden
Copy link
Contributor

Unless there is a bug in Cython that causes some of the stuff that the "except" modifer should add to cdef functions to always be included unless we use "nogil". But if it does it's a bug, not a feature. I will investigate and report back.

But I don't believe it's the refnanny, if by all unlikelyhood this actually happens.

@sturlamolden
Copy link
Contributor

Here is my testing code.

tailcallbench.pyx:

## Three recursive cdef functions for testing tail call optimisation

cdef double recursive_sum(int n, double *x):
    if n == 0:
        return 0
    else:
        return x[0] + recursive_sum(n-1, x+1)


cdef double recursive_sum_nogil(int n, double *x) nogil:
    if n == 0:
        return 0
    else:
        return x[0] + recursive_sum_nogil(n-1, x+1)


cdef double recursive_sum_except(int n, double *x) except -1:
    if n == 0:
        return 0
    else:
        return x[0] + recursive_sum_except(n-1, x+1)    


## Manual transformation to an iterative loop

cdef double iterative_sum(int n, double *x):
    cdef double y = 0.0
    for i in range(n):
       y += x[i]
    return y


## Performance timers for MacOSX

## On Windows use QueryPerformanceCounter and QueryPerformanceFrequency
## instead of mach_absolute_time and mach_timebase_info.

cdef extern from "mach/mach.h":
    pass

cdef extern from "mach/mach_time.h":
    ctypedef unsigned long uint64_t
    uint64_t mach_absolute_time()

    ctypedef struct mach_timebase_info_data_t:
        uint64_t numer
        uint64_t denom

    void mach_timebase_info(mach_timebase_info_data_t*)

cdef double nanodiff(uint64_t t1, uint64_t t0):
    cdef mach_timebase_info_data_t base
    cdef uint64_t elapsed 
    elapsed = t1-t0
    mach_timebase_info(&base)    
    return (<double> elapsed * <double> base.numer / <double> base.denom)


## Here is the actual benchmark we run from Python    

import numpy as np

def timeit():

    cdef uint64_t t1, t0
    cdef int n = 100000
    cdef double[::1] buf = np.random.rand(n)
    cdef double s0, s1, s2, s3

    m = 1000
    timings = np.zeros(m)

    print "timing %d tail calls" % (n,)

    for i in range(m):
        t0 = mach_absolute_time()
        s0 = recursive_sum(n, &buf[0])
        t1 = mach_absolute_time()
        timings[i] = nanodiff(t1,t0)
    print "plain cdef:        mean of %d = %.4g ns, std = %.4g ns" % (m, timings.mean(), timings.std())

    for i in range(m):
        t0 = mach_absolute_time()
        s1 = recursive_sum_except(n, &buf[0])
        t1 = mach_absolute_time()        
        timings[i] = nanodiff(t1,t0)
    print "cdef with except:  mean of %d = %.4g ns, std = %.4g ns" % (m, timings.mean(), timings.std())

    for i in range(m):
        t0 = mach_absolute_time()
        s2 = recursive_sum_nogil(n, &buf[0])
        t1 = mach_absolute_time()
        timings[i] = nanodiff(t1,t0)
    print "cdef with nogil:   mean of %d = %.4g ns, std = %.4g ns" % (m, timings.mean(), timings.std())

    print "\nbaseline:"

    for i in range(m):
        t0 = mach_absolute_time()
        s3 = iterative_sum(n, &buf[0])
        t1 = mach_absolute_time()
        timings[i] = nanodiff(t1,t0)
    print "manual iteration:  mean of %d = %.4g ns, std = %.4g ns" % (m, timings.mean(), timings.std())

    return s0,s1,s2,s3

setup.py:

#!/usr/bin/env python

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

so = Extension("test", ["tailcallbench.pyx"], extra_compile_args=['-O3'])

setup(
    author="Sturla Molden",
    name="tailcallbench",
    version="2013-09-12",
    license='scipy license (http://scipy.org)', 
    description="Cython tail call benchmark",
    url='',
    classifiers=[
        "Development Status :: 3 - alpha, research",
        "Intended Audience :: Scientific programmers",
        "License :: scipy",
        "Operating System :: mac"],
    packages=["tailcallbench"],
    cmdclass = {'build_ext': build_ext},
    ext_modules = [so],
)

@sturlamolden
Copy link
Contributor

Intel i7 @ 2.4 GHz (MacBook Pro)
Cython 0.91.1
Python 2.7.3 64-bit (Enthought Canopy)

GCC version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)

$ export CC=gcc
$ python setup.py build_ext --inplace
[...]
$ python -c "import test; test.timeit()"
timing 100000 tail calls
plain cdef:        mean of 1000 = 1.903e+05 ns, std = 2.622e+04 ns
cdef with except:  mean of 1000 = 2.349e+05 ns, std = 1.894e+04 ns
cdef with nogil:   mean of 1000 = 1.903e+05 ns, std = 1.186e+04 ns

baseline:
manual iteration:  mean of 1000 = 2.717e+05 ns, std = 1.008e+04 ns

Intel icc version 13.0.2 (gcc version 4.2.1 compatibility)

$ export CC=icc
$ rm *.so
$ python setup.py build_ext --inplace
[...]
$ python -c "import test; test.timeit()"
timing 100000 tail calls
plain cdef:        mean of 1000 = 2.473e+05 ns, std = 2.977e+04 ns
cdef with except:  mean of 1000 = 2.931e+05 ns, std = 4.35e+04 ns
cdef with nogil:   mean of 1000 = 2.436e+05 ns, std = 1.731e+04 ns

baseline:
manual iteration:  mean of 1000 = 2.576e+04 ns, std = 1947 ns

In conclusion, omitting nogil does not incur any extra overhead. The performance is similar to an iterative loop, indicating that tail-calls were eliminated.

If the cdef is declared to propagate Python exceptions, there is a small extra overhead (roughly 15%), which might be due to missed tail-call optimisation.

@ogrisel
Copy link
Member

ogrisel commented Sep 13, 2013

Thanks for checking @sturlamolden!

@sturlamolden
Copy link
Contributor

By the way, notice how much faster code (10x speedup) Intel C generates for the iterative loop. This is probably due to autovectorization, which demonstrates the importance of testing with multiple compilers. Otherwise we could be fooled into believing that tailcall optimization is always omitted.

And of course it also shows why it's a good idea to use Intel compilers for scientific programming.

@larsmans
Copy link
Member Author

Yes, but then our main aim is to get scikit-learn to work well with the default compilers that come with OSs. If users install scikit-learn and it's slow, we can't tell them to get an Intel license and recompile the whole thing.

@sturlamolden
Copy link
Contributor

It does work well with Apple's GCC/LLVM compiler. The tail calls execute faster than an equivalent iterative loop. The tail calls are even faster with except than with pure iteration.

@larsmans
Copy link
Member Author

Sure. But anyway, we were digressing: there are no tail calls in our SGD code.

Any reviews?

@sturlamolden
Copy link
Contributor

The main thing I would change in your code now is to replace array.data with np.PyArray_DATA(array) or &array[0]. Preferably the former.

@sturlamolden
Copy link
Contributor

I don't really understand the LOC argument against typed memoryviews. If you cimport numpy you replace the typed memoryview overhead with NumPy headers, which is an even bigger compile-time dependency.

BTW: If you use typed memoryviews you don't need to cimport numpy.

@temporaer
Copy link
Contributor

I understand that discussion about the recursion part is closed now, but I looked up and would like to clarify what @amueller said I've observed. According to my notes there was no significant effect of gil/nogil in regular cdef functions. There were problems when I used default arguments to the cdef functions, which sometimes prevented optimization (they do not directly translate to C[++] default arguments). Take-home message for me was: Do not use default args in cdef functions to be on the safe side.

@amueller
Copy link
Member

Sorry for creating work for @sturlamolden and thanks a lot for double checking my claims.

@larsmans
Copy link
Member Author

@sturlamolden I still need the NumPy headers for the NumPy core math library (the isfinite function in particular). But I'll switch to the new syntax.

@sturlamolden
Copy link
Contributor

That's ok @amueller. If there were such a bug in Cython we would need to get rid of it. Thanks for the clarification, @temporaer, that makes more sense.

@larsmans You only need numpy/npy_math.h for the core math library, not all of the NumPy headers.

http://docs.scipy.org/doc/numpy/reference/c-api.coremath.html

@ogrisel
Copy link
Member

ogrisel commented Sep 20, 2013

Shall we merge this? I think this is already a net improvement to the current state. @pprett @mblondel any feedback?

@larsmans could you add an entry in the developers doc to summarize when to use which which dtype in cython code (both for floating point feature values and indices for arrays)?

I think the switch to the typed memory view API can be discussed in a distinct PR.

mode='c'] feature_indices = np.arange(0, self.n_features,
dtype=np.int32)
dtype=np.intc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.intc is the type used by scipy.sparse? is it just a typedef for int? please add a comment so that my future self knows whats going on here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, intc is per definition the C type int.

@pprett
Copy link
Member

pprett commented Sep 21, 2013

Ran my RCV1 benchmark [1] - everything fine - thanks @larsmans for cleaning up and taking care of the dtype mess.

+1 for merge

[1] https://gist.github.com/pprett/4150519

@pprett
Copy link
Member

pprett commented Sep 21, 2013

regarding the typed memory view issue: as far as I understood using the data array directly is deprecated and we should instead move to memory views - is this correct? This has implications throughout our code base - we should find a consensus how to deal with it in the project as a whole!

@larsmans
Copy link
Member Author

If I understand @sturlamolden correctly, we can still get raw pointers to NumPy arrays' contents, it's just the np.ndarray syntax that's disappearing. (What a shame!)

@scoder
Copy link

scoder commented Sep 21, 2013

it's just the np.ndarray syntax that's disappearing.

I don't think it's decided what will be done with it in Cython (and if, when). But I think it's generally considered legacy syntax.

@dagss
Copy link

dagss commented Sep 21, 2013

There's loads and loads of Cython code out there using the np.ndarray syntax, and Cython has been very careful about not breaking backwards compatability without very good reasons in the past, so it disappearing is not something I would worry about. My advice would be to not spend time switching unless there's real benefits to switching now.

This topic is a bit muddy now. It used to be clear-cut: The memoryview syntax allowed lots of extra optimizations, @markflorisson88 had a branch that did highly efficient vectorized arithmetic inlined in Cython (more efficient than Intel Fortran in a couple of synthetic cases). However, as noone had the bandwidth to integrate and maintain that work, it was decided to not merge it.

So in a way, it used to be the case that the memoryview syntax was the way of the future, but since the work on "numerical Cython" has been somewhat abandoned, the case is less clear cut now. The memoryviews have some performance benefits in some situations, but np.ndarray can make for more convenient programming (if you want to do arithmetic with NumPy).

@larsmans
Copy link
Member Author

I'm merging this as-is for now. The scikit-learn devs will have to discuss at some point whether we want to switch to memoryviews, maybe after experimenting with them in new Cython extensions. (I saw the vectorized arithmetic PR, and it's a shame that didn't get merged yet. It's a killer feature!)

@pprett I haven't put in the comment; I promise I'll write the "which type goes were in Cython" manual that @ogrisel asked for, to put it in the dev docs, instead of doing it inline in this file.

larsmans added a commit that referenced this pull request Sep 21, 2013
@larsmans larsmans merged commit aa2d045 into scikit-learn:master Sep 21, 2013
@sturlamolden
Copy link
Contributor

Thanks Dag (@dagss). Does this mean we should avoid typed memory views for numerical work? I.e. just use Cython as a wrapping tool, and write plain C (with NumPy's C API) or Fortran? Or something totally different like JIT'ing Python with Numba?

I am getting confused...

But vectorized memoryview expressions probably never were a good idea, given that the competition is Fortran 90 compilers and Intel's cilkplus.

@dagss
Copy link

dagss commented Sep 24, 2013

@sturlamolden I don't have much advice for you. There's no "Cython strategy" for numerics any longer after me and Mark pretty much stopped working on it. I hope Numba will become what we need down the road, but it's probably not there yet. Of course, the rise of Numba led to loss of interest, but we're in a transition period where people's codebase is written in Cython but only Numba gets further developed in the area, and we won't know yet to what degree Numba succeeds... so very difficult to give advice. (Perhaps even calling out to Julia is the way of the future?).

I use C or Fortran and use Cython to wrap, myself.

Anyone interested in further discussing this should start a thread on the Cython-dev list I guess.

@scoder
Copy link

scoder commented Sep 27, 2013

Agreed with Dag's last sentence. Could we please stop this FUDdy discussion in this ticket? If you want to discuss the current status and/or future developments of Cython, please move to one of the Cython-specific lists, so that the right people can give proper comments. Cython-dev sounds ok for me here, although it might be generally interesting for users as well.

@larsmans larsmans deleted the sgd-improvements branch October 2, 2015 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants