Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Cythonize _assert_all_finite using stop-on-first strategy #23197

Merged
merged 58 commits into from
Jul 6, 2022

Conversation

Micky774
Copy link
Contributor

@Micky774 Micky774 commented Apr 22, 2022

Reference Issues/PRs
Fixes #11681

What does this implement/fix? Explain your changes.
Implements the code developed by jakirkham and extends it to meet requirements for current _assert_all_finite function.

Any other comments?
Currently struggling to adapt the function to work with np.float16 arrays.

To Do

  • Compare performance as "second pass" algorithm- Compare op speed by replacing w/ equality check
  • Benchmark w/ non-finite arrays of varying density
  • not np.isfinite--> np.isinf() and np.isnan()

@Micky774 Micky774 changed the title ENH Cythonize _assert_all_finite using reduction scheme ENH Cythonize _assert_all_finite using reduction scheme Apr 22, 2022
@Micky774 Micky774 changed the title ENH Cythonize _assert_all_finite using reduction scheme WIP ENH Cythonize _assert_all_finite using reduction scheme Apr 22, 2022
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

sklearn/utils/isfinite.pyx Outdated Show resolved Hide resolved
raise TypeError("Unsupported array type: %s" % repr(numpy.PyArray_TypeObjectFromType(a_type)))


cdef inline bint c_isfinite(const_fprecision* a_ptr, size_t step, size_t size, bint_enum disallow_nan) nogil:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of indirection for getting disallow_nan passed through. Although, I do not see the motivation for the bint_enum.

I suspect the original implementation is trying to template out the bint_type so the overhead of checking for bint is compiled away in the loop. I am interested to see what the overhead looks like without this optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll run some benchmarks once the rest of the pattern is stable -- I'm also interested to see the realized performance difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context the bint_type was intended to move a deeply nested branch from a repeated runtime check to a one time mostly compiled check. It borrows from C++ templating tricks. That said, there are other ways to achieve the same behavior.

sklearn/utils/isfinite.pyx Outdated Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
sklearn/utils/isfinite.pyx Outdated Show resolved Hide resolved
sklearn/utils/isfinite.pyx Outdated Show resolved Hide resolved
@Micky774
Copy link
Contributor Author

Micky774 commented Apr 23, 2022

Current benchmarks indicate that the optimization is dependent on dtype and that the current implementation is preferable once the number of elements ~>5000. I don't think we need to really offer a choice between the cython/python implementations since the preferable algorithm is fairly clear cut for by far most cases.

Code for benchmark:

from sklearn.utils.validation import _assert_all_finite
import numpy as np

dtype = np.float32
X = np.random.rand(1000,100).astype(dtype)

%timeit _assert_all_finite(X)

@Micky774 Micky774 changed the title WIP ENH Cythonize _assert_all_finite using reduction scheme WIP ENH Cythonize _assert_all_finite using stop-on-first strategy Apr 24, 2022
@ogrisel
Copy link
Member

ogrisel commented Apr 29, 2022

Could you try to use prange / OpenMP parallelism to try to see if parallelism can further increase the processing speed on a multicore machine?

@thomasjpfan
Copy link
Member

Could you try to use prange / OpenMP parallelism to try to see if parallelism can further increase the processing speed on a multicore machine?

Even if it is faster with prange, I prefer not to have a drop in single thread performance compared to main. OMP_NUM_THREADS is commonly set to 1 in libraries such as dask or ray to prevent oversubscription.

We can work around this by using np.isfinite + np.sum for single threaded and use this Cython version if there are enough threads. Although, I prefer not to go down this route.

@ogrisel
Copy link
Member

ogrisel commented Apr 29, 2022

Even if it is faster with prange, I prefer not to have a drop in single thread performance compared to main. OMP_NUM_THREADS is commonly set to 1 in libraries such as dask or ray to prevent oversubscription.

Are you sure we will get a drop in single-threaded performance with prangewhen OMP_NUM_THREADS=1? I think it's worth measuring it.

@thomasjpfan
Copy link
Member

thomasjpfan commented Apr 29, 2022

With a fairly simple implementation in a Jupyter notebook: (Using %load_ext Cython):

Code for only using Cython + prange
%%cython --compile-args=-fopenmp

from libc.math cimport isfinite as c_isfinite
cimport cython
from cython.parallel cimport prange
from cython cimport floating

cdef int cy_isfinite(floating[:, ::1] a, int n_threads):
    cdef:
        int size = a.size
        int i
        floating* a_ptr = &a[0, 0]
        int output = 1
        
    for i in prange(size, num_threads=n_threads, nogil=True):
        if c_isfinite(a_ptr[i]) == 0:
            output = 0
            break
    return output

def my_isfinite(floating[:, ::1] a, int n_threads=1):
    return bool(cy_isfinite(a, n_threads=n_threads))

and these Python version:

from sklearn.utils.extmath import _safe_accumulator_op
import numpy as np

def sk_isfinite(X):
    return np.isfinite(_safe_accumulator_op(np.sum, X))

def np_isfinite_all(X):
    return np.isfinite(X).all()

Running on a fairly sized X:

rng = np.random.RandomState(42)
X = rng.rand(100_000, 100).astype(dtype)

%timeit sk_isfinite(X)
# 1.87 ms ± 8.21 µs per loop

%timeit np_isfinite_all(X)
# 4.71 ms ± 60.5 µs per loop

%timeit my_isfinite(X, n_threads=1)
# 15.8 ms ± 62.8 µs per loop

%timeit my_isfinite(X, n_threads=8)
# 2.47 ms ± 311 µs per loop

# Only using range
%timeit my_isfinite_single_with_range(X)
# 6.25 ms ± 17.6 µs per loop

From the results above scikit-learn's current np.sum + np.isfinite performs better than all Cython implementations. I think it's strange how using range is better compared to using prange with n_threads=1.

Code for only using Cython + range
%%cython

from libc.math cimport isfinite as c_isfinite
cimport cython
from cython cimport floating

cdef int cy_isfinite_with_range(floating[:, ::1] a):
    cdef:
        int size = a.size
        int i
        floating* a_ptr = &a[0, 0]
        int output = 1
        
    with nogil:
        for i in range(size):
            if c_isfinite(a_ptr[i]) == 0:
                output = 0
                break
    return output

def my_isfinite_single_with_range(floating[:, ::1] a):
    return bool(cy_isfinite_with_range(a))

@ogrisel
Copy link
Member

ogrisel commented May 2, 2022

Thanks @thomasjpfan. It's possible that the np.sum function was recently optimized with SIMD instructions that make more efficient than alternatives.

@ogrisel
Copy link
Member

ogrisel commented May 2, 2022

Indeed, using py-spy top --native to monitor a process that runs your sk_isfinite on a float32 data array in a while True loop:

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)                                                                                                                                                      
 60.00%  60.00%   38.92s    38.92s   _aligned_contig_cast_float_to_double (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
 40.00%  40.00%   25.05s    25.05s   DOUBLE_pairwise_sum (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
  0.00%  60.00%   0.310s    39.23s   npyiter_copy_to_buffers (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
  0.00%   0.00%   0.210s    0.210s   npyiter_goto_iterindex (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
  0.00%   0.00%   0.150s    0.150s   npyiter_copy_from_buffers (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
  0.00%  60.00%   0.090s    39.67s   npyiter_buffered_reduce_iternext_iters2 (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
  0.00%  40.00%   0.040s    25.09s   DOUBLE_add_AVX512F (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
  0.00% 100.00%   0.030s    64.84s   reduce_loop (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
  0.00%  40.00%   0.030s    25.12s   generic_wrapped_legacy_loop (numpy/core/_multiarray_umath.cpython-310-x86_64-

the DOUBLE_add_AVX512F is SIMD optimized. However I do not understand why numpy converts the float32 data to float64... This seems like a huge waste.

@ogrisel
Copy link
Member

ogrisel commented May 2, 2022

However I do not understand why numpy converts the float32 data to float64... This seems like a huge waste.

Actually this is done on purpose in _safe_accumulator_op. Maybe we could not do that when the goal is to detect nan or inf values. We could call np.sum() directly and only return an exact check if the sum finds a nan or inf.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Cython implementation is already a net improvement and further improvements can be made in follow up PRs.

sklearn/utils/validation.py Outdated Show resolved Hide resolved
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

sklearn/utils/validation.py Outdated Show resolved Hide resolved
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
sklearn/utils/setup.py Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

sklearn/utils/_isfinite.pyx Outdated Show resolved Hide resolved
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@ogrisel ogrisel merged commit dfaef0c into scikit-learn:main Jul 6, 2022
@ogrisel
Copy link
Member

ogrisel commented Jul 6, 2022

Merged, thanks @Micky774 and others!

@Micky774 Micky774 deleted the cython_assert_isfinite branch July 6, 2022 16:05
Harsh14901 pushed a commit to Harsh14901/scikit-learn that referenced this pull request Jul 6, 2022
…it-learn#23197)



Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: jakirkham <jakirkham@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@jakirkham
Copy link
Contributor

Did some very minor tidying of the code in PR ( #23849 ). Hope that is ok. Please let me know what you think 🙂

ogrisel added a commit to ogrisel/scikit-learn that referenced this pull request Jul 11, 2022
…it-learn#23197)



Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: jakirkham <jakirkham@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimizing assert_all_finite check
5 participants