-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG] SGD Cython improvements #2426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* nogil declarations where possible (but no with nogil blocks, since the performance benefit of those is not clear) * use BLAS where appropriate * replaced DOUBLE by double, INTEGER and int32 by int/np.intc, since that's the type used by scipy.sparse * explicit checks that dataset dims don't exceed INT_MAX * replaced an np.all by an explicit loop (npymath library for C99 isfinite)
Wow. The first version of this PR used Cython memoryviews, which added a whopping 14 kLOC of generated C to the repo. Switched back to bare pointers to keep the compile times within bounds. |
It would be interesting to hear the Cython team's point of view on that. |
I am not surprised. Typed memoryviews are bloatware generators if you look at lines of C code (but so is the rest of Cython as well). But keep in mind that the C compiler will remove most of this bloat from the binary. You can expect the execution speed to be at least 90% of hand-written C. One solution is to use Cython as a build-time dependency, so you don't need the generated C files in the repo. |
Cython as a build-time dependency has been discussed before, but it's not acceptable (we want to use the latest released version without forcing users to upgrade all the time). Anyway, I just replaced the memoryview with a pointer and integer size. It was a trivial change. |
I would suggest that you use PyArray_DATA instead of the deprecated NumPy syntax. Your code is bound to break before the release of Cython 1.0. |
nogil declarations do nothing without nogil blocks. nogil declarations just say it is ok to use this function without owning the GIL. But it is a nogil block that actually releases and re-acquires the GIL. Performance wise, releasing and acquiring the GIL can be quite expensive. It is not just a simple critical section. Serial code will be faster if you don't release the GIL. But if you do release it, you don't lock up the interpreter and can use Python threads for concurrency. |
@sturlamolden: I am aware of that. I'm modifying the code piece by piece so we can later figure out a way to do multithreading with |
@sturlamolden Which part of the syntax is going to break? The |
Anything of this sort will break:
That is, this buffer syntax is deprecated in favor of typed memoryviews. When np.ndarray is just an alias for PyArrayObject*, the .data attribute will break too. And indexing will return a different ndarray as opposed to an element in the array. So get rid of it as soon as you can. Note that as the .data attribute is acquired with PyArray_DATA, it is impervious to changes in the NumPy C API making array->data an error. |
@larsmans have you benchmarked the effect of calling blas directly vs master? what is the speed difference? This PR LGTM. We can always address the switch to typed memory views in another PR if we decide to do so. |
On 09/08/2013 03:06 PM, Sturla Molden wrote:
This allows the C compiler to do better optimization. @temporaer had an |
Thanks! |
@sturlamolden Thanks for the warning, we use the NumPy array syntax practically everywhere. Perhaps we should switch, perhaps we shoul fork (But why does Cython not warn when this syntax is used?) |
@amueller The refnanny is used to test Cython. It is preprocessor macros that expands to nothing when we use Cython for other purposes than running Cython's test suite. If these empty macros prevent optimizing tail recursion I suggest you try a different C compiler. Note that as the refnanny is macros, it is not sufficient to just look at the C. You must look at the compiled opcode. But if what you are saying is true, I would really like to see an example where this happens. |
By the way, is the problem with the typed memoryviews actually the number of pyx files? I.e. that the utility code is repeated in multiple C files? |
The problem with typed memoryviews is simply that they're different from what we were doing. We'll have to change a lot of code and habits, being careful not to lose performance due to memoryview overhead. |
I started a discussion about memoryview overhead at https://mail.python.org/pipermail/cython-devel/2013-September/003807.html |
@ogrisel On covertype, this isn't any faster than the old version. Still, I'd like to merge it because it improves safety and (IMHO) cleans up the code. |
+1 for safety (I can do a review in the next days if you need one more) 2013/9/11 Lars Buitinck notifications@github.com
Peter Prettenhofer |
Yes, please! |
""" I agree with Sturla that this sounds rather unrealistic. Maybe adding the "nogil" modifier wasn't the only change. A related change like preventing the return type from propagating exceptions would be a more likely cause. |
Unless there is a bug in Cython that causes some of the stuff that the "except" modifer should add to cdef functions to always be included unless we use "nogil". But if it does it's a bug, not a feature. I will investigate and report back. But I don't believe it's the refnanny, if by all unlikelyhood this actually happens. |
Here is my testing code. tailcallbench.pyx: ## Three recursive cdef functions for testing tail call optimisation
cdef double recursive_sum(int n, double *x):
if n == 0:
return 0
else:
return x[0] + recursive_sum(n-1, x+1)
cdef double recursive_sum_nogil(int n, double *x) nogil:
if n == 0:
return 0
else:
return x[0] + recursive_sum_nogil(n-1, x+1)
cdef double recursive_sum_except(int n, double *x) except -1:
if n == 0:
return 0
else:
return x[0] + recursive_sum_except(n-1, x+1)
## Manual transformation to an iterative loop
cdef double iterative_sum(int n, double *x):
cdef double y = 0.0
for i in range(n):
y += x[i]
return y
## Performance timers for MacOSX
## On Windows use QueryPerformanceCounter and QueryPerformanceFrequency
## instead of mach_absolute_time and mach_timebase_info.
cdef extern from "mach/mach.h":
pass
cdef extern from "mach/mach_time.h":
ctypedef unsigned long uint64_t
uint64_t mach_absolute_time()
ctypedef struct mach_timebase_info_data_t:
uint64_t numer
uint64_t denom
void mach_timebase_info(mach_timebase_info_data_t*)
cdef double nanodiff(uint64_t t1, uint64_t t0):
cdef mach_timebase_info_data_t base
cdef uint64_t elapsed
elapsed = t1-t0
mach_timebase_info(&base)
return (<double> elapsed * <double> base.numer / <double> base.denom)
## Here is the actual benchmark we run from Python
import numpy as np
def timeit():
cdef uint64_t t1, t0
cdef int n = 100000
cdef double[::1] buf = np.random.rand(n)
cdef double s0, s1, s2, s3
m = 1000
timings = np.zeros(m)
print "timing %d tail calls" % (n,)
for i in range(m):
t0 = mach_absolute_time()
s0 = recursive_sum(n, &buf[0])
t1 = mach_absolute_time()
timings[i] = nanodiff(t1,t0)
print "plain cdef: mean of %d = %.4g ns, std = %.4g ns" % (m, timings.mean(), timings.std())
for i in range(m):
t0 = mach_absolute_time()
s1 = recursive_sum_except(n, &buf[0])
t1 = mach_absolute_time()
timings[i] = nanodiff(t1,t0)
print "cdef with except: mean of %d = %.4g ns, std = %.4g ns" % (m, timings.mean(), timings.std())
for i in range(m):
t0 = mach_absolute_time()
s2 = recursive_sum_nogil(n, &buf[0])
t1 = mach_absolute_time()
timings[i] = nanodiff(t1,t0)
print "cdef with nogil: mean of %d = %.4g ns, std = %.4g ns" % (m, timings.mean(), timings.std())
print "\nbaseline:"
for i in range(m):
t0 = mach_absolute_time()
s3 = iterative_sum(n, &buf[0])
t1 = mach_absolute_time()
timings[i] = nanodiff(t1,t0)
print "manual iteration: mean of %d = %.4g ns, std = %.4g ns" % (m, timings.mean(), timings.std())
return s0,s1,s2,s3 setup.py: #!/usr/bin/env python
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
so = Extension("test", ["tailcallbench.pyx"], extra_compile_args=['-O3'])
setup(
author="Sturla Molden",
name="tailcallbench",
version="2013-09-12",
license='scipy license (http://scipy.org)',
description="Cython tail call benchmark",
url='',
classifiers=[
"Development Status :: 3 - alpha, research",
"Intended Audience :: Scientific programmers",
"License :: scipy",
"Operating System :: mac"],
packages=["tailcallbench"],
cmdclass = {'build_ext': build_ext},
ext_modules = [so],
) |
Intel i7 @ 2.4 GHz (MacBook Pro) GCC version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00) $ export CC=gcc
$ python setup.py build_ext --inplace
[...]
$ python -c "import test; test.timeit()"
timing 100000 tail calls
plain cdef: mean of 1000 = 1.903e+05 ns, std = 2.622e+04 ns
cdef with except: mean of 1000 = 2.349e+05 ns, std = 1.894e+04 ns
cdef with nogil: mean of 1000 = 1.903e+05 ns, std = 1.186e+04 ns
baseline:
manual iteration: mean of 1000 = 2.717e+05 ns, std = 1.008e+04 ns Intel icc version 13.0.2 (gcc version 4.2.1 compatibility) $ export CC=icc
$ rm *.so
$ python setup.py build_ext --inplace
[...]
$ python -c "import test; test.timeit()"
timing 100000 tail calls
plain cdef: mean of 1000 = 2.473e+05 ns, std = 2.977e+04 ns
cdef with except: mean of 1000 = 2.931e+05 ns, std = 4.35e+04 ns
cdef with nogil: mean of 1000 = 2.436e+05 ns, std = 1.731e+04 ns
baseline:
manual iteration: mean of 1000 = 2.576e+04 ns, std = 1947 ns In conclusion, omitting If the |
Thanks for checking @sturlamolden! |
By the way, notice how much faster code (10x speedup) Intel C generates for the iterative loop. This is probably due to autovectorization, which demonstrates the importance of testing with multiple compilers. Otherwise we could be fooled into believing that tailcall optimization is always omitted. And of course it also shows why it's a good idea to use Intel compilers for scientific programming. |
Yes, but then our main aim is to get scikit-learn to work well with the default compilers that come with OSs. If users install scikit-learn and it's slow, we can't tell them to get an Intel license and recompile the whole thing. |
It does work well with Apple's GCC/LLVM compiler. The tail calls execute faster than an equivalent iterative loop. The tail calls are even faster with |
Sure. But anyway, we were digressing: there are no tail calls in our SGD code. Any reviews? |
The main thing I would change in your code now is to replace array.data with np.PyArray_DATA(array) or &array[0]. Preferably the former. |
I don't really understand the LOC argument against typed memoryviews. If you BTW: If you use typed memoryviews you don't need to |
I understand that discussion about the recursion part is closed now, but I looked up and would like to clarify what @amueller said I've observed. According to my notes there was no significant effect of gil/nogil in regular |
Sorry for creating work for @sturlamolden and thanks a lot for double checking my claims. |
@sturlamolden I still need the NumPy headers for the NumPy core math library (the |
That's ok @amueller. If there were such a bug in Cython we would need to get rid of it. Thanks for the clarification, @temporaer, that makes more sense. @larsmans You only need numpy/npy_math.h for the core math library, not all of the NumPy headers. http://docs.scipy.org/doc/numpy/reference/c-api.coremath.html |
Shall we merge this? I think this is already a net improvement to the current state. @pprett @mblondel any feedback? @larsmans could you add an entry in the developers doc to summarize when to use which which dtype in cython code (both for floating point feature values and indices for arrays)? I think the switch to the typed memory view API can be discussed in a distinct PR. |
mode='c'] feature_indices = np.arange(0, self.n_features, | ||
dtype=np.int32) | ||
dtype=np.intc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.intc is the type used by scipy.sparse? is it just a typedef for int? please add a comment so that my future self knows whats going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, intc
is per definition the C type int
.
Ran my RCV1 benchmark [1] - everything fine - thanks @larsmans for cleaning up and taking care of the dtype mess. +1 for merge |
regarding the typed memory view issue: as far as I understood using the data array directly is deprecated and we should instead move to memory views - is this correct? This has implications throughout our code base - we should find a consensus how to deal with it in the project as a whole! |
If I understand @sturlamolden correctly, we can still get raw pointers to NumPy arrays' contents, it's just the |
I don't think it's decided what will be done with it in Cython (and if, when). But I think it's generally considered legacy syntax. |
There's loads and loads of Cython code out there using the np.ndarray syntax, and Cython has been very careful about not breaking backwards compatability without very good reasons in the past, so it disappearing is not something I would worry about. My advice would be to not spend time switching unless there's real benefits to switching now. This topic is a bit muddy now. It used to be clear-cut: The memoryview syntax allowed lots of extra optimizations, @markflorisson88 had a branch that did highly efficient vectorized arithmetic inlined in Cython (more efficient than Intel Fortran in a couple of synthetic cases). However, as noone had the bandwidth to integrate and maintain that work, it was decided to not merge it. So in a way, it used to be the case that the memoryview syntax was the way of the future, but since the work on "numerical Cython" has been somewhat abandoned, the case is less clear cut now. The memoryviews have some performance benefits in some situations, but np.ndarray can make for more convenient programming (if you want to do arithmetic with NumPy). |
I'm merging this as-is for now. The scikit-learn devs will have to discuss at some point whether we want to switch to memoryviews, maybe after experimenting with them in new Cython extensions. (I saw the vectorized arithmetic PR, and it's a shame that didn't get merged yet. It's a killer feature!) @pprett I haven't put in the comment; I promise I'll write the "which type goes were in Cython" manual that @ogrisel asked for, to put it in the dev docs, instead of doing it inline in this file. |
[MRG] SGD Cython improvements
Thanks Dag (@dagss). Does this mean we should avoid typed memory views for numerical work? I.e. just use Cython as a wrapping tool, and write plain C (with NumPy's C API) or Fortran? Or something totally different like JIT'ing Python with Numba? I am getting confused... But vectorized memoryview expressions probably never were a good idea, given that the competition is Fortran 90 compilers and Intel's cilkplus. |
@sturlamolden I don't have much advice for you. There's no "Cython strategy" for numerics any longer after me and Mark pretty much stopped working on it. I hope Numba will become what we need down the road, but it's probably not there yet. Of course, the rise of Numba led to loss of interest, but we're in a transition period where people's codebase is written in Cython but only Numba gets further developed in the area, and we won't know yet to what degree Numba succeeds... so very difficult to give advice. (Perhaps even calling out to Julia is the way of the future?). I use C or Fortran and use Cython to wrap, myself. Anyone interested in further discussing this should start a thread on the Cython-dev list I guess. |
Agreed with Dag's last sentence. Could we please stop this FUDdy discussion in this ticket? If you want to discuss the current status and/or future developments of Cython, please move to one of the Cython-specific lists, so that the right people can give proper comments. Cython-dev sounds ok for me here, although it might be generally interesting for users as well. |
Here are some safe bits from #2419, which can be pulled without waiting for my threading experiments.
I also cleaned up the code a little bit (e.g.
min
andmax
no longer need to be defined for current Cython).