Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce threading in parallel pairwise #13310

Conversation

7 participants
@pierreglaser
Copy link
Contributor

commented Feb 27, 2019

pairwise_distances makes a direct call to _parallel_pairwise, that computes the distance matrix chunk by chunk. However, it uses loky, the default backend of joblib, which generates a lot of overhead due to data communication. I propose to enforce a threading backend in _parallel_pairwise. I also discard the np.hstack concatenation call and write to the matrix in-place, to further optimize for speed.

Running this benchmark and limiting over-subscription by setting MKL_NUM_THREADS to 1 yielded these results:

parallel_pairwise_plot

@jeremiedbb @zanospi

@zanospi

This comment has been minimized.

Copy link
Contributor

commented Feb 27, 2019

Full benchmark on server with 40 threads & MKL using:

  • loky, multiprocessing and threading as backend
  • euclidean, cosine, correlation & manhattan
  • either dense or sparse matrices with increasing number of features
  • either unlimited num of threads per job, or only 1 thread

(had crashes with sparse manhattan for dim >= 1000 with loky & multiprocessing that I did not investigate yet)

=========================================================
=== Joblib_backend = Loky, MKL_NUM_THREAD not limited ===
=========================================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     646~0ms   3.99~0s   3.26~0s   3.94~0s
     dense         euclidean    746~0ms   4.29~0s   3.23~0s   3.95~0s
     dense         manhattan    33.9~0s   17.8~0s   11.1~0s   6.18~0s
     dense        correlation   15.7~0s   17.4~0s   10.6~0s   5.02~0s
     sparse          cosine     2.34~0s   3.83~0s   3.00~0s   3.63~0s
     sparse        euclidean    2.65~0s   3.90~0s   3.11~0s   3.52~0s
     sparse        manhattan    27.0~0s   17.9~0s   11.1~0s   6.01~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     813~0ms   4.83~0s   3.52~0s   3.82~0s
     dense         euclidean    971~0ms   4.91~0s   3.51~0s   3.71~0s
     dense         manhattan    1.13~0m   37.4~0s   21.3~0s   11.6~0s
     dense        correlation   32.9~0s   35.9~0s   20.6~0s   9.96~0s
     sparse          cosine     3.42~0s   3.80~0s   2.82~0s   3.07~0s
     sparse        euclidean    4.18~0s   4.01~0s   2.95~0s   3.21~0s
     sparse        manhattan    51.3~0s    failed    failed    failed
	 
Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     765~0ms   4.86~0s   3.55~0s   3.71~0s
     dense         euclidean    974~0ms   4.91~0s   3.47~0s   3.64~0s
     dense         manhattan    1.14~0m   39.0~0s   21.2~0s   12.4~0s
     dense        correlation   39.0~0s   37.8~0s   20.5~0s   9.82~0s
     sparse          cosine     3.64~0s   3.80~0s   2.84~0s   3.62~0s
     sparse        euclidean    3.82~0s   4.38~0s   3.32~0s   3.74~0s
     sparse        manhattan    51.3~0s    failed    failed    failed

====================================================================
=== Joblib_backend = multiprocessing, MKL_NUM_THREAD not limited ===
====================================================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     658~0ms   2.99~0s   2.33~0s   1.90~0s
     dense         euclidean    763~0ms   3.04~0s   2.35~0s   1.91~0s
     dense         manhattan    33.9~0s   16.3~0s   9.60~0s   3.97~0s
     dense        correlation   15.5~0s   16.2~0s   9.60~0s   3.97~0s
     sparse          cosine     2.20~0s   2.57~0s   1.99~0s   1.70~0s
     sparse        euclidean    2.65~0s   2.81~0s   2.12~0s   1.77~0s
     sparse        manhattan    26.8~0s   16.2~0s   9.66~0s   4.00~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
     dense           cosine     731~0ms   4.33~0s   3.02~0s   2.34~0s
     dense         euclidean    976~0ms   4.34~0s   3.01~0s   2.12~0s
     dense         manhattan    1.13~0m   36.5~0s   20.7~0s   9.94~0s
     dense        correlation   32.6~0s   35.8~0s   20.4~0s   10.4~0s
     sparse          cosine     3.29~0s   3.11~0s   2.27~0s   1.66~0s
     sparse        euclidean    3.75~0s   3.39~0s   2.40~0s   1.78~0s
     sparse        manhattan    51.0~0s    failed    failed    failed

Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     735~0ms   4.33~0s   3.04~0s   2.28~0s
     dense         euclidean    966~0ms   4.26~0s   3.01~0s   2.11~0s
     dense         manhattan    1.13~0m   36.6~0s   21.0~0s   10.2~0s
     dense        correlation   32.8~0s   35.8~0s   20.4~0s   10.5~0s
     sparse          cosine     3.30~0s   3.12~0s   2.29~0s   1.69~0s
     sparse        euclidean    3.73~0s   3.38~0s   2.38~0s   1.72~0s
     sparse        manhattan    51.0~0s    failed    failed    failed

==============================================================
=== Joblib_backend = threading, MKL_NUM_THREAD not limited ===
==============================================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     709~0ms   732~0ms   878~0ms   3.29~0s
     dense         euclidean    707~0ms   889~0ms   935~0ms   3.27~0s
     dense         manhattan    33.9~0s   13.6~0s   7.56~0s   1.98~0s
     dense        correlation   15.5~0s   14.0~0s   7.78~0s   2.10~0s
     sparse          cosine     2.19~0s   1.45~0s   1.12~0s   1.66~0s
     sparse        euclidean    2.64~0s   1.57~0s   1.17~0s   1.83~0s
     sparse        manhattan    26.8~0s   13.8~0s   7.48~0s   1.79~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
     dense           cosine     761~0ms   921~0ms   1.20~0s   3.31~0s
     dense         euclidean    958~0ms   997~0ms   1.08~0s   3.51~0s
     dense         manhattan    1.13~0m   34.7~0s   19.1~0s   8.46~0s
     dense        correlation   32.7~0s   33.6~0s   18.6~0s   8.62~0s
     sparse          cosine     3.33~0s   1.98~0s   1.29~0s   1.62~0s
     sparse        euclidean    3.71~0s   2.08~0s   1.44~0s   1.66~0s
     sparse        manhattan    51.0~0s   26.0~0s   14.0~0s   3.03~0s

Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     771~0ms   972~0ms   1.03~0s   3.31~0s
     dense         euclidean    966~0ms   1.00~0s   1.03~0s   3.51~0s
     dense         manhattan    1.13~0m   34.7~0s   19.1~0s   8.48~0s
     dense        correlation   32.8~0s   33.6~0s   18.7~0s   8.59~0s
     sparse          cosine     3.28~0s   1.99~0s   1.30~0s   1.79~0s
     sparse        euclidean    3.72~0s   2.08~0s   1.39~0s   1.93~0s
     sparse        manhattan    51.0~0s   26.0~0s   14.1~0s   3.02~0s

===============================================
=== Joblib_backend = Loky, MKL_NUM_THREAD 1 ===
===============================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     2.38~0s   5.18~0s   4.17~0s   4.03~0s
     dense         euclidean    3.14~0s   5.13~0s   4.22~0s   4.06~0s
     dense         manhattan    36.2~0s   19.4~0s   12.3~0s   6.71~0s
     dense        correlation   17.6~0s   19.1~0s   12.1~0s   5.34~0s
     sparse          cosine     2.95~0s   4.69~0s   3.91~0s   3.85~0s
     sparse        euclidean    3.30~0s   4.62~0s   3.68~0s   3.93~0s
     sparse        manhattan    27.2~0s   18.5~0s   12.0~0s   6.58~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     3.53~0s   6.01~0s   3.97~0s   4.03~0s
     dense         euclidean    3.70~0s   6.02~0s   4.05~0s   3.82~0s
     dense         manhattan    1.19~0m   39.6~0s   22.8~0s   12.8~0s
     dense        correlation   35.8~0s   38.5~0s   21.4~0s   10.3~0s
     sparse          cosine     3.84~0s   4.52~0s   3.00~0s   3.20~0s
     sparse        euclidean    4.31~0s   4.39~0s   3.66~0s   3.41~0s
     sparse        manhattan    51.4~0s    failed    failed    failed

Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     3.59~0s   6.03~0s   4.14~0s   3.94~0s
     dense         euclidean    3.77~0s   5.60~0s   4.28~0s   3.67~0s
     dense         manhattan    1.36~0m   39.3~0s   22.9~0s   12.7~0s
     dense        correlation   37.9~0s   38.3~0s   21.8~0s   9.12~0s
     sparse          cosine     3.84~0s   4.91~0s   3.74~0s   3.92~0s
     sparse        euclidean    4.26~0s   4.88~0s   3.63~0s   4.17~0s
     sparse        manhattan    51.4~0s    failed    failed    failed

==========================================================
=== Joblib_backend = multiprocessing, MKL_NUM_THREAD 1 ===
==========================================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     2.10~0s   3.48~0s   2.66~0s   2.06~0s
     dense         euclidean    2.85~0s   3.70~0s   2.65~0s   1.96~0s
     dense         manhattan    34.0~0s   17.4~0s   10.7~0s   4.58~0s
     dense        correlation   15.4~0s   17.6~0s   10.7~0s   4.45~0s
     sparse          cosine     2.56~0s   3.17~0s   2.13~0s   1.87~0s
     sparse        euclidean    2.72~0s   3.40~0s   2.29~0s   2.14~0s
     sparse        manhattan    26.9~0s   18.0~0s   10.9~0s   4.38~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
     dense           cosine     3.39~0s   4.81~0s   3.42~0s   2.54~0s
     dense         euclidean    3.10~0s   4.90~0s   3.30~0s   2.41~0s
     dense         manhattan    1.13~0m   38.3~0s   22.6~0s   10.3~0s
     dense        correlation   32.8~0s   38.4~0s   21.9~0s   10.8~0s
     sparse          cosine     3.36~0s   3.96~0s   2.53~0s   1.86~0s
     sparse        euclidean    4.00~0s   4.10~0s   3.08~0s   1.86~0s
     sparse        manhattan    51.1~0s    failed    failed    failed

Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     3.37~0s   4.96~0s   3.35~0s   2.57~0s
     dense         euclidean    3.10~0s   5.12~0s   3.42~0s   2.19~0s
     dense         manhattan    1.13~0m   37.7~0s   22.4~0s   10.4~0s
     dense        correlation   32.8~0s   38.4~0s   21.5~0s   11.3~0s
     sparse          cosine     3.34~0s   3.93~0s   2.46~0s   1.69~0s
     sparse        euclidean    3.87~0s   4.08~0s   2.55~0s   2.12~0s
     sparse        manhattan    51.1~0s    failed    failed    failed

====================================================
=== Joblib_backend = threading, MKL_NUM_THREAD 1 ===
====================================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     2.27~0s   1.70~0s   1.14~0s   937~0ms
     dense         euclidean    3.01~0s   1.81~0s   1.17~0s   837~0ms
     dense         manhattan    35.0~0s   14.0~0s   7.97~0s   2.40~0s
     dense        correlation   16.8~0s   14.3~0s   8.18~0s   2.55~0s
     sparse          cosine     2.59~0s   1.61~0s   1.06~0s   825~0ms
     sparse        euclidean    3.06~0s   1.85~0s   1.08~0s   689~0ms
     sparse        manhattan    27.1~0s   14.1~0s   7.82~0s   2.07~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
     dense           cosine     3.44~0s   2.74~0s   1.78~0s   1.11~0s
     dense         euclidean    3.58~0s   2.79~0s   1.80~0s   904~0ms
     dense         manhattan    1.16~0m   35.2~0s   19.7~0s   8.69~0s
     dense        correlation   34.7~0s   34.2~0s   19.0~0s   9.19~0s
     sparse          cosine     3.83~0s   2.14~0s   1.37~0s   939~0ms
     sparse        euclidean    4.24~0s   2.36~0s   1.40~0s   659~0ms
     sparse        manhattan    51.4~0s   26.2~0s   14.2~0s   3.29~0s

Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     3.49~0s   2.81~0s   1.79~0s   1.13~0s
     dense         euclidean    3.63~0s   2.92~0s   2.03~0s   914~0ms
     dense         manhattan    1.18~0m   34.9~0s   19.6~0s   8.46~0s
     dense        correlation   34.8~0s   34.2~0s   19.0~0s   9.29~0s
     sparse          cosine     3.78~0s   2.09~0s   1.30~0s   887~0ms
     sparse        euclidean    4.25~0s   2.34~0s   1.48~0s   784~0ms
     sparse        manhattan    51.4~0s   26.0~0s   14.0~0s   3.29~0s

@pierreglaser pierreglaser force-pushed the pierreglaser:enforce-threading-in_parallel_pairwise branch from 34229aa to 6d84837 Feb 27, 2019

@pierreglaser pierreglaser force-pushed the pierreglaser:enforce-threading-in_parallel_pairwise branch from 6d84837 to 7361f64 Feb 27, 2019

@jeremiedbb

This comment has been minimized.

Copy link
Contributor

commented Feb 27, 2019

ping @rth
related to #8216

@jeremiedbb

This comment has been minimized.

Copy link
Contributor

commented Feb 27, 2019

Note that enforcing threading can lead to oversubscription. We should wait #13297 to dynamically set inner parallelism max threads.

@rth
Copy link
Member

left a comment

Interesting results and nice benchmarks !

A few comments below

fd = delayed(_dist_wrapper)
ret = np.empty((X.shape[0], Y.shape[0]), dtype=X.dtype)
Parallel(backend="threading", n_jobs=n_jobs, verbose=0)(
fd(func, ret, s, X, Y[s], **kwds)

This comment has been minimized.

Copy link
@rth

rth Feb 27, 2019

Member

Interesting that it goes faster, I would have expected to access the shared array and writing to a shared array be limiting with respect to GIL.

# enforce a threading backend to prevent data communication overhead
fd = delayed(_dist_wrapper)
ret = np.empty((X.shape[0], Y.shape[0]), dtype=X.dtype)
Parallel(backend="threading", n_jobs=n_jobs, verbose=0)(

This comment has been minimized.

Copy link
@rth

rth Feb 27, 2019

Member

verbose=0 is the default and could be dropped?

@@ -1049,6 +1049,11 @@ def distance_metrics():
return PAIRWISE_DISTANCE_FUNCTIONS


def _dist_wrapper(dist, mat, slice, *args, **kwargs):

This comment has been minimized.

Copy link
@rth

rth Feb 27, 2019

Member

maybe use more explicit names?

def _dist_wrapper(dist_func, dist_matrix, col_slice, *args, **kwargs):

This comment has been minimized.

Copy link
@pierreglaser

pierreglaser Feb 27, 2019

Author Contributor

ok. Keeping slice and not col_slice for now because of my answer to you comment below.

fd(X, Y[s], **kwds)
# enforce a threading backend to prevent data communication overhead
fd = delayed(_dist_wrapper)
ret = np.empty((X.shape[0], Y.shape[0]), dtype=X.dtype)

This comment has been minimized.

Copy link
@rth

rth Feb 27, 2019

Member

I would be curious to know if allocating this array Fortran ordered for better cache locality would matter for performance (no need to run all benchmarks, just one typical case).

This comment has been minimized.

Copy link
@pierreglaser

pierreglaser Feb 27, 2019

Author Contributor

nice one, but actually, is there any reason why we slice on Y and not on X? Otherwise we can swap, and keep the C order.

This comment has been minimized.

Copy link
@rth

rth Feb 27, 2019

Member

The thing is in pairwise_distances_chunked we are chunking along X and then X_chunk and Y are sent to pairwise_distances that can be parallelized with n_jobs. So we cannot chunk both pairwise_distances and pairwise_distances_chunked in the same way.

Maybe this is something we can think about in a follow-up PR.

@rth

This comment has been minimized.

Copy link
Member

commented Feb 27, 2019

Note that enforcing threading can lead to oversubscription.

Agreed, but it's still an improvement I think even without setting OMP_NUM_THREAD unless one is using a 20+ core machine? And in those cases making sure OMP_NUM_THREAD is limited is probably a good practice in any case.

@rth rth added this to the 0.21 milestone Feb 27, 2019

@rth
Copy link
Member

left a comment

Please add a what's new entry.

Overall looks good!

@@ -1049,6 +1049,11 @@ def distance_metrics():
return PAIRWISE_DISTANCE_FUNCTIONS


def _dist_wrapper(dist_func, dist_matrix, slice, *args, **kwargs):
"""Write in-place to a slice of a distance matrix"""
dist_matrix[:, slice] = dist_func(*args, **kwargs)

This comment has been minimized.

Copy link
@rth

rth Feb 27, 2019

Member

Ok, then maybe call it slice_ to avoid name collision with a reserved key word?

pierreglaser added some commits Feb 27, 2019

@rth

rth approved these changes Feb 27, 2019

@@ -182,6 +182,12 @@ Support for Python 3.4 and below has been officially dropped.
:mod:`sklearn.metrics`
......................

- |Efficiency| Force ``joblib`` to use the ``'threading'`` backend in
:func:`metrics.pairwise.pairwise_distances` when ``n_jobs`` is greater than

This comment has been minimized.

Copy link
@rth

rth Feb 27, 2019

Member

This doesn't really explain what changed. Maybe

- |Efficiency| Faster :func:`metrics.pairwise.pairwise_distances` with `n_jobs` > 1
  by using the `threading`, instead of the `multiprocessing`, parallel backend.

?

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 27, 2019

Contributor

instead of 'loky' :)

This comment has been minimized.

Copy link
@rth

rth Feb 27, 2019

Member

yeah, I know, I meant process based parallelism generally...

@rth rth added this to To do in Sprint Paris 2019 via automation Feb 27, 2019

@rth rth moved this from To do to Needs review in Sprint Paris 2019 Feb 27, 2019

@ogrisel

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

Could you please reorganize the benchmark report to make it easy to contrast loky vs threading for each case?

@ogrisel

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

you can even use plt.barh (horizontal bar plot) to put all the results in the same plot using rich labels on the y axis.

@pierreglaser

This comment has been minimized.

Copy link
Contributor Author

commented Feb 28, 2019

@zanospi if you got these results from asv and have a hard time parsing/normalizing them in a table-like format, I wrote a helper that aggregates asv benchmark results in a pandas dataframe. @jeremiedbb has it on his machine :)

@jnothman

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

@jeremiedbb

This comment has been minimized.

Copy link
Contributor

commented Feb 28, 2019

I'd say for all metrics, even scipy's one. In the benchmarks above there's correlation which is a scipy metric a threading improves that too.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

LGTM. Great! Thank you!

@GaelVaroquaux GaelVaroquaux merged commit 1ada3aa into scikit-learn:master Feb 28, 2019

17 checks passed

LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 92.55%)
Details
codecov/project Absolute coverage decreased by -0.6% but relative coverage increased by +7.44% compared to afc6cc5
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
scikit-learn.scikit-learn Build #20190227.101 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_np_atlas) Linux py35_np_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda) Linux pylatest_conda succeeded
Details
scikit-learn.scikit-learn (Windows py35) Windows py35 succeeded
Details
scikit-learn.scikit-learn (Windows py37) Windows py37 succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda) macOS pylatest_conda succeeded
Details

Sprint Paris 2019 automation moved this from Needs review to Done Feb 28, 2019

# enforce a threading backend to prevent data communication overhead
fd = delayed(_dist_wrapper)
ret = np.empty((X.shape[0], Y.shape[0]), dtype=X.dtype, order='F')
Parallel(backend="threading", n_jobs=n_jobs)(

This comment has been minimized.

Copy link
@ogrisel

ogrisel Feb 28, 2019

Member

we should probably not enforce backend like this be instead use the prefer="threads" hinting mechanism. That is using the **_joblib_parallel_args(prefer='threads') trick to keep backward compat with joblib 0.11.

This comment has been minimized.

Copy link
@rth

rth Mar 1, 2019

Member

I was wondering about that, but say when you are using Dask, do we really want it to be able to change the backend here? I can't think of a use case where we want to use process-based backend for Euclidean distances which probably is the most frequent case.

Other metrics might require more fine-grained treatment ...

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

@ogrisel

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

Note that enforcing threading can lead to oversubscription. We should wait #13297 to dynamically set inner parallelism max threads.

I believe this should not have been merged before having the opportunity to rerun the benchmark with oversubcription protection currently being implemented in #13297 are merged and used in this PR.

@ogrisel

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

Also as I said earlier it's really hard to analyze the benchmark results in their current form. Making comparisons requires a lot of scrolling.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

Note that enforcing threading can lead to oversubscription. We should wait #13297 to dynamically set inner parallelism max threads.

I believe this should not have been merged before having the opportunity to rerun the benchmark with oversubcription protection currently being implemented in #13297 are merged and used in this PR.

OK. I'll take the blame. I thought that @pierreglaser 's first benchmarks were quite conclusive.

@ogrisel

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

Yes but @pierreglaser's benchmark had MKL_NUM_THREADS=1 fixed in the environment variable which:

  • is not representative of what a naive user will have;
  • makes sequential (n_jobs=1) look worse that what it is;
  • hides the fact that the loky backend has built-in over-subscription protection while the threading backend does not because it's more complicated to get right for threads.
@GaelVaroquaux

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

Yes, granted. I had not seen that. I should have, given that I am aware of oversubscription problems. But I did not.

Is prioritizing #13297 a way forward (to be able to control for oversubscription, and then redo the benchmarks)?

@ogrisel

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

@zanospi benchmark's cover all the interesting cases but:

  • they are hard to read;
  • they should probably be re-run once this PR leverages manual oversubcription protection as implemented in #13297 (or even if we fix the oversubscription protection for the threading backend of joblib directly upstream in joblib now that @tomMoral and @jeremiedbb think they have found the fix to the failing loky helpers PR tests).

I don't think it is necessary to revert the merge as I think this PR probably already improves the performance for majority of our users.

@jeremiedbb

This comment has been minimized.

Copy link
Contributor

commented Feb 28, 2019

We can see oversubscription with the threading backend on 40 cores. But it's still not slower than loky backend. In any case, it's at least even with loky backend. It will be better when we have the clibs helpers but in the meanwhile it's still an improvment, and a really significant one for low n_jobs, e.g. speedup instead of slow donw :)

@rth

This comment has been minimized.

Copy link
Member

commented Mar 1, 2019

I agree there may be oversubscription issues, but we would indeed see those more for 20-40 CPU cores while most users have 4-8 CPU cores at most where this is a definite improvement.

In the former case, I would argue it also goes a bit into the HPC domain, and while it would be certainly nice if we can handle oversubscription, in that case IMO it's also a bit of a responsibility of the user to know how to efficiently use their compute server and be aware of different levels of parallelism going on (at least for BLAS)..

@zanospi

This comment has been minimized.

Copy link
Contributor

commented Mar 1, 2019

Easier to read benchmark just comparing loky and threading backends for different n_jobs > 1 and no constraint on MKL_NUM_THREADS

Representation Metric n_jobs loky threading Ratio
dense cosine 2 4 0.7 5.71429
_ _ 4 3.3 0.8 4.125
_ _ 40 3.9 3.3 1.18182
_ euclidean 2 4.3 0.9 4.77778
_ _ 4 3.2 0.9 3.55556
_ _ 40 4 3.3 1.21212
_ manhattan 2 17.8 13.6 1.30882
_ _ 4 11.1 7.6 1.46053
_ _ 40 6.2 2 3.1
_ correlation 2 17.4 14 1.24286
_ _ 4 10.6 7.8 1.35897
_ _ 40 5 2.1 2.38095
sparse cosine 2 3.8 1.5 2.53333
_ _ 4 3 1.1 2.72727
_ _ 40 3.6 1.7 2.11765
_ euclidean 2 3.9 1.6 2.4375
_ _ 4 3.1 1.2 2.58333
_ _ 40 3.5 1.8 1.94444
_ manhattan 2 17.9 13.8 1.2971
_ _ 4 11.1 7.5 1.48
_ _ 40 6 1.8 3.33333
@GaelVaroquaux

This comment has been minimized.

Copy link
Member

commented Mar 1, 2019

xhlulu added a commit to xhlulu/scikit-learn that referenced this pull request Apr 28, 2019

Enforce threading in parallel pairwise (scikit-learn#13310)
* ENH enforce threading backend in _parallel_pairwise

* OPT write inplace to the matrix in the workers

* CLN cosmetics

* CLN renaming, cosmetics

* CLN renaming

* DOC whats_new

* DOC rephrasing

xhlulu added a commit to xhlulu/scikit-learn that referenced this pull request Apr 28, 2019

xhlulu added a commit to xhlulu/scikit-learn that referenced this pull request Apr 28, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.