Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Faster manhattan_distances() for sparse matrices #15049

Merged
merged 8 commits into from Oct 5, 2019

Conversation

@ptocca
Copy link
Contributor

ptocca commented Sep 21, 2019

Reference Issues/PRs

See also PR #14986

What does this implement/fix? Explain your changes.

Provides a faster implementation of manhattan_distances() for sparse matrices. Originally discussed in PR #14986 (which also targeted the dense case)
The cython implementation in pairwise_fast.pyx iterates over only the non-zero elements in the rows.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 23, 2019

Why is this WIP? Is there work you intend to do before it is fully considered for merge?

Copy link
Member

jnothman left a comment

Otherwise LGTM

cdef int n = D.shape[1]

# We scan the matrices row by row.
# Given row px in X and row py in Y, we find the positions (i and j respectively), in .indices where the indices

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 23, 2019

Member

Try to keep this under 80 chars per line.

# Below the avoidance of inplace operators is intentional.
# When prange is used, the inplace operator has a special meaning, i.e. it signals a "reduction"

for px in prange(m,nogil=True):

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 23, 2019

Member

space after comma, please

@@ -765,10 +765,12 @@ def manhattan_distances(X, Y=None, sum_over_features=True):

X = csr_matrix(X, copy=False)
Y = csr_matrix(Y, copy=False)
X.sort_indices()

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 23, 2019

Member

formally you might require sum_duplicates too??? Perhaps we should make a copy if the matrix indices are unsorted, rather than modifying inplace without permission..

This comment has been minimized.

Copy link
@rth

rth Sep 24, 2019

Member

While I agree that doing inplace modification of input is bad, I can't think of a case where it would be bad to sort_indices and sum_duplicates for a CSR array. scipy is somewhat ambiguous about doing that at initialization https://github.com/scipy/scipy/blob/26b7d3f40905d85845a3af75a67c318a8d441ed1/scipy/sparse/compressed.py#L198

Maybe adding a note to the docstring about it could be enough? Of course if the arrays are read-only a copy still need to be triggered.

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 24, 2019

Member

So let's sum duplicates here and document the in place operation

@rth

This comment has been minimized.

Copy link
Member

rth commented Sep 24, 2019

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

Added entry to change log
@ptocca ptocca changed the title [WIP] Faster manhattan_distances() for sparse matrices [MRG] Faster manhattan_distances() for sparse matrices Sep 27, 2019
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 28, 2019

I like this, but you neeed to resolve merge conflicts with master.

…hattan

# Conflicts:
#	doc/whats_new/v0.22.rst
@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Oct 1, 2019

I am a little concerned with how this will interact with NearestNeighbors and n_jobs. Will this lead to another over-subscription problem?

@rth

This comment has been minimized.

Copy link
Member

rth commented Oct 1, 2019

I am a little concerned with how this will interact with NearestNeighbors and n_jobs. Will this lead to another over-subscription problem?

Well it's not too different from Euclidean distances with BLAS in that respect . Anything that applied there (e.g. pairwise_distances slower with n_jobs>1) will probably also apply here.

Though we could also add a single thread implementation for now and add the prange later once #14196 and related discussions are concluded. I don't have a strong opinion about it.

@jeremiedbb

This comment has been minimized.

Copy link
Member

jeremiedbb commented Oct 1, 2019

Though we could also add a single thread implementation for now and add the prange later once #14196 and related discussions are concluded. I don't have a strong opinion about it.

I feel like we need to think about parallelism in pairwise distances. Adding the prange now is good for internal use of manhattan_distances but will probably be detrimental for pairwise distances (unless n_jobs=1).

pairwise_distances is a wrapper around scipy metrics (sequential) and sklearn metrics (some multi-threaded, some sequential). I think one solution could be to only use joblib parallelism for metrics we now are sequential.

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Oct 1, 2019

Just noticed, https://github.com/joblib/joblib/blob/master/CHANGES.rst for joblib 0.14.0, which has joblib/joblib#940 This may be a non issue now.

@jeremiedbb

This comment has been minimized.

Copy link
Member

jeremiedbb commented Oct 1, 2019

It is because pairwise distances uses the threading backend which is not covered by joblib/joblib#940

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Oct 1, 2019

Though we could also add a single thread implementation for now and add the prange later once #14196 and related discussions are concluded. I don't have a strong opinion about it.

pairwise_distances is a wrapper around scipy metrics (sequential) and sklearn metrics (some multi-threaded, some sequential). I think one solution could be to only use joblib parallelism for metrics we now are sequential.

I agree with keeping this PR sequential for now.

@ptocca

This comment has been minimized.

Copy link
Contributor Author

ptocca commented Oct 1, 2019

My initial motivation for this PR was the intense frustration felt when computing a Laplacian kernel matrix on a 24-core machine and seeing that only one core was used (the Laplacian kernel is computed using the Manhattan distance). When I then looked at the code, I found that in the sparse case the implementation could be improved.
Multicore machines are the norm today, so my view is that we should try to make software that distributes computation across the available resources.
If one wants to limit the number of threads, there are ways to do that (OMP_NUM_THREADS or in this case setting num_threads in the prange call).
By the way, the Gaussian kernel computation appears to be multithreaded. So, why should the Laplacian be any different?

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Oct 1, 2019

@ptocca What do you think of the following?

    cdef np.npy_intp px, py, i, j, ix, iy, X_indptr_end, Y_indptr_end
    cdef double d = 0.0

    cdef int m = D.shape[0]
    cdef int n = D.shape[1]
    for px in range(m):
        X_indptr_end = X_indptr[px + 1]
        for py in range(n):
            d = 0.0
            Y_indptr_end = Y_indptr[py + 1]
            i = X_indptr[px]
            j = Y_indptr[py]

            while i < X_indptr_end and j < Y_indptr_end:
                ix = X_indices[i]
                iy = Y_indices[j]

                if ix == iy:
                    d += fabs(X_data[i] - Y_data[j])
                    i += 1
                    j += 1
                elif ix < iy:
                    d += fabs(X_data[i])
                    i += 1
                else:
                    d += fabs(Y_data[j])
                    j += 1

            if i == X_indptr_end:
                while j < Y_indptr_end:
                    d += fabs(Y_data[j])
                    j += 1
            else:
                while i < X_indptr_end:
                    d += fabs(X_data[i])
                    i += 1

            D[px, py] = d
@rth

This comment has been minimized.

Copy link
Member

rth commented Oct 2, 2019

Multicore machines are the norm today, so my view is that we should try to make software that distributes computation across the available resources.

Indeed. However the issue starts when multiple level of parallelism are used. For instance, pairwise_distances has an n_jobs parameter. On a machine with lots of CPU, if one uses that parameter to use all CPUs where each job starts N_CPU threads, one would end up with N_CPU**2 threads, and that can freeze a PC due to CPU oversubscription, with N_CPU large enough.

Some of this is addressed joblib, but not for the threading backend as mentioned by @jeremiedbb above.

Overall, I think we should handle this in pairwise_distances in any case, since it's currently an issue for Euclidean distances as well.

+1 to keep the prange, but could you please benchmark this implementation, with OMP_NUM_THREAD of 1, 2, 4, 8, 16 (if available) on a reasonably sized dataset? To make sure that the scaling with num threads is reasonably good. Thanks!

Copy link
Member

thomasjpfan left a comment

Okay lets keep the prange and deal with the joblib threading issue at the pairwise_distances level.

Comments based on #15049 (comment)

j = Y_indptr[py]
d = 0.0
while i < X_indptr[px + 1] and j < Y_indptr[py + 1]:
if i < X_indptr[px + 1]:

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 2, 2019

Member

This condition can be removed, because it is always true in this loop

This comment has been minimized.

Copy link
@ptocca

ptocca Oct 3, 2019

Author Contributor

Oops, you're absolutely right!

while i < X_indptr[px + 1] and j < Y_indptr[py + 1]:
if i < X_indptr[px + 1]:
ix = X_indices[i]
if j < Y_indptr[py + 1]:

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 2, 2019

Member

This condition can be removed because it is always true in this loop

sklearn/metrics/pairwise_fast.pyx Show resolved Hide resolved

if i == X_indptr[px + 1]:
while j < Y_indptr[py + 1]:
iy = Y_indices[j]

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 2, 2019

Member

iy = Y_indices[j] is unneeded

j = j + 1
else:
while i < X_indptr[px + 1]:
ix = X_indices[i]

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 2, 2019

Member

ix = X_indices[i] is unneeded.

i = X_indptr[px]
j = Y_indptr[py]
d = 0.0
while i < X_indptr[px + 1] and j < Y_indptr[py + 1]:

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 2, 2019

Member

Can we define X_indptr[px + 1] before the range(n) loop and Y_indptr[py + 1] before the while loop and then reference them everywhere?

ptocca added 2 commits Oct 3, 2019
Avoided repeated computation of last index
@ptocca

This comment has been minimized.

Copy link
Contributor Author

ptocca commented Oct 3, 2019

@rth: I benchmarked the new implementation with the code that you can see in this gist. I ran it on my laptop and on a node of an HPC system.
My laptop (L) is an Intel(R) Core(TM) i5-7300U CPU @ 2.60GHz, with 4 cores.
The HPC node (H) is an Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, with 32 cores (but I only had 12 allocated to me).
The benchmark consists in computing 10 times the manhattan_distances() for two 2000x3000 random sparse matrices with 1% density.

The old implementation ran in 34s on my laptop and in 44s on the HPC node.

#cores User (L) Wall (L) User (H) Wall (H)
1 11.8 11.90 14.8 14.80
2 12.0 6.04 14.9 7.51
3 14.4 5.16 14.9 5.06
4 16.3 4.25 15.0 3.83
6 NaN NaN 15.0 2.60
8 NaN NaN 15.1 1.97
10 NaN NaN 15.1 1.61
12 NaN NaN 15.3 1.38

Somewhat surprisingly, my puny laptop is faster than the Xeon in the single-threaded case. it does have a faster clock, but I thought that, for starters, the memory subsystem would be much slower.
Where the HPC node really shines is in multi-threaded computation.
At least up to 12 cores, there is almost no multi-threading overhead (compare the "user" totals) and the speed-up is almost linear.

@rth

This comment has been minimized.

Copy link
Member

rth commented Oct 3, 2019

Thanks for doing the benchmarks @ptocca. I can confirm your conclusions while re-running those,

  • for a (2000, 3000) dataset with density of 0.01, this implementation is >2x faster than master on my desktop. The speed-up when going from 1 to 10 threads is 8.7 which is great particularly on such small dataset.
  • for a (2000, 10000) dataset with a density of 0.0001, this implementation is ~23x faster than master, with still a good scaling with num_threads.
@rth
rth approved these changes Oct 3, 2019
Copy link
Member

rth left a comment

Very nice work @ptocca !

I wonder if applying a similar approach to euclidean distances would also be faster.

@rth

This comment has been minimized.

Copy link
Member

rth commented Oct 3, 2019

I wonder if applying a similar approach to euclidean distances would also be faster.

Seems unlikely, as current euclidean_distances is up to 10-30x faster than the manhattan_distances in this PR. A different metric, but it still sounds difficult to do better, in Cython. Although I really wish sparse dot product (used in euclidean-distances) was multi-threaded in scipy.

Copy link
Member

thomasjpfan left a comment

Small nit

for px in prange(m, nogil=True):
for py in range(n):
i = X_indptr[px]
j = Y_indptr[py]
d = 0.0
Comment on lines 74 to 78

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 4, 2019

Member

What do you think of doing:

for px i nprange(m, nogil=True):
    X_indptr_end = X_indptr[px + 1]
    for py in range(n):
        Y_indptr_end = Y_indptr[py + 1]

And then using X_indptr_end and Y_indptr_end everywhere?

@thomasjpfan thomasjpfan merged commit 24a50e5 into scikit-learn:master Oct 5, 2019
17 checks passed
17 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
scikit-learn.scikit-learn Build #20191004.30 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl) Linux pylatest_conda_mkl succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Oct 5, 2019

Thank you @ptocca !

@ptocca

This comment has been minimized.

Copy link
Contributor Author

ptocca commented Oct 5, 2019

@thomasjpfan It was my privilege! Thanks everyone for your careful checking and your constructive suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.