# scikit-learn/scikit-learn

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

# Euclidean pairwise_distances slower for n_jobs > 1 #8216

Closed
opened this Issue Jan 19, 2017 · 22 comments

Projects
None yet
Member

### rth commented Jan 19, 2017

In a followup of issue #8213 , it looks like using `n_jobs > 1` in Eucledian `pairwise_distances` makes computations slower instead of speeding them up.

#### Steps to reproduce

```from sklearn.metrics import pairwise_distances
import numpy as np

np.random.seed(99999)
n_dim = 200

for n_train, n_test in [(1000, 100000),
(10000, 10000),
(100000, 1000)]:
print('\n# n_train={}, n_test={}, n_dim={}\n'.format(
n_train, n_test, n_dim))

X_train = np.random.rand(n_train, n_dim)
X_test = np.random.rand(n_test, n_dim)

for n_jobs in [1, 2]:
print('n_jobs=', n_jobs, ' => ', end='')

%timeit pairwise_distances(X_train, X_test, 'euclidean',
n_jobs=n_jobs, squared=True)```

which on a 2 core CPU returns,

``````# n_train=1000, n_test=100000, n_dim=200

n_jobs= 1  => 1 loop, best of 3: 1.92 s per loop
n_jobs= 2  => 1 loop, best of 3: 4.95 s per loop

# n_train=10000, n_test=10000, n_dim=200

n_jobs= 1  => 1 loop, best of 3: 1.89 s per loop
n_jobs= 2  => 1 loop, best of 3: 4.74 s per loop

# n_train=100000, n_test=1000, n_dim=200

n_jobs= 1  => 1 loop, best of 3: 2 s per loop
n_jobs= 2  => 1 loop, best of 3: 5.6 s per loop
``````

While for small datasets, it would make sens that the parallel processing would not improve performance due to the multiprocessing etc overhead, this is by no mean a small dataset. And the compute time does not decrease when using e.g. n_jobs=4 on a 4 core CPU.

This also holds for other number of dimensions,
n_dim=10

``````# n_train=1000, n_test=100000, n_dim=10

n_jobs= 1  => 1 loop, best of 3: 873 ms per loop
n_jobs= 2  => 1 loop, best of 3: 4.25 s per loop
``````

n_dim=1000

``````# n_train=1000, n_test=100000, n_dim=1000

n_jobs= 1  => 1 loop, best of 3: 6.56 s per loop
n_jobs= 2  => 1 loop, best of 3: 8.56 s per loop
``````

Running `benchmarks/bench_plot_parallel_pairwise.py` also yields similar results, This might affect a number of estimators / metrics where `pairwise_distances` is used.

#### Versions

``````Linux-4.6.0-gentoo-x86_64-Intel-R-_Core-TM-_i5-6200U_CPU_@_2.30GHz-with-gentoo-2.3
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.11.1
SciPy 0.18.1
Scikit-Learn 0.18.1
``````

I also get similar results with scikit-learn 0.17.1

Closed

Member

### jnothman commented Jan 20, 2017

 Parallelism is always fraught with this kind of issue. There are multiple factors in whether it will succeed, and very often it won't. Again, you've not varied n_features, which might be an important factor. n_jobs may be more useful when the calculation is slow, i.e. non-euclidean, or over sparse matrices or whatnot. I can't remember if we experimented with multithreading for this; it might help. … On 19 January 2017 at 22:48, Roman Yurchak ***@***.***> wrote: In a followup of issue #8213 <#8213> , it looks like using n_jobs > 1 in Eucledian pairwise_distances makes computations slower instead of speeding them up. Steps to reproduce from sklearn.metrics import pairwise_distancesimport numpy as np np.random.seed(99999) n_dim = 200 for n_train, n_test in [(1000, 100000), (10000, 10000), (100000, 1000)]: print('\n# n_train={}, n_test={}, n_dim={}\n'.format( n_train, n_test, n_dim)) X_train = np.random.rand(n_train, n_dim) X_test = np.random.rand(n_test, n_dim) for n_jobs in [1, 2]: print('n_jobs=', n_jobs, ' => ', end='') %timeit pairwise_distances(X_train, X_test, 'euclidean', n_jobs=n_jobs, squared=True) which on a 2 core CPU returns, # n_train=1000, n_test=100000, n_dim=200 n_jobs= 1 => 1 loop, best of 3: 1.92 s per loop n_jobs= 2 => 1 loop, best of 3: 4.95 s per loop # n_train=10000, n_test=10000, n_dim=200 n_jobs= 1 => 1 loop, best of 3: 1.89 s per loop n_jobs= 2 => 1 loop, best of 3: 4.74 s per loop # n_train=100000, n_test=1000, n_dim=200 n_jobs= 1 => 1 loop, best of 3: 2 s per loop n_jobs= 2 => 1 loop, best of 3: 5.6 s per loop While for small datasets, it would make sens that the parallel processing would not improve performance due to the multiprocessing etc overhead, this is by no mean a small dataset. And the compute time does not decrease when using e.g. n_jobs=4 on a 4 core CPU. This also holds for other number of dimensions, *n_dim=10* # n_train=1000, n_test=100000, n_dim=10 n_jobs= 1 => 1 loop, best of 3: 873 ms per loop n_jobs= 2 => 1 loop, best of 3: 4.25 s per loop *n_dim=1000* # n_train=1000, n_test=100000, n_dim=1000 n_jobs= 1 => 1 loop, best of 3: 6.56 s per loop n_jobs= 2 => 1 loop, best of 3: 8.56 s per loop Running benchmarks/bench_plot_parallel_pairwise.py also yields similar results, [image: untitled] This might affect a number of estimators / metrics where pairwise_distances is used. Versions Linux-4.6.0-gentoo-x86_64-Intel-R-_Core-TM-_i5-6200U_CPU_@_2.30GHz-with-gentoo-2.3 Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul 2 2016, 17:53:06) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] NumPy 1.11.1 SciPy 0.18.1 Scikit-Learn 0.18.1 I also get similar results with scikit-learn 0.17.1 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8216>, or mute the thread .
Member Author

### rth commented Jan 20, 2017

 Parallelism is always fraught with this kind of issue. There are multiple factors in whether it will succeed, and very often it won't. [..] n_jobs may be more useful when the calculation is slow, i.e. non-euclidean, or over sparse matrices or whatnot. Yes, I know, I tried to do a quick parallellization of `pariwise_distances(..., n_jobs=1)`, ```def _naive_parallel_pairwise_distances(X, Y, metric, batch_size=100, n_jobs=1, squared=True): queue = [] n_samples = X.shape for k in range(n_samples//batch_size + int(n_samples % batch_size != 0)): mslice = slice(k*batch_size, min((k+1)*batch_size, n_samples)) X_sl = X[mslice, :] queue.append(delayed(pairwise_distances)(X_sl, Y, metric, n_jobs=1, squared=squared)) res = Parallel(n_jobs=n_jobs)(queue) return np.hstack(res)``` and this yields comparable performance, so it must be as you say that `n_jobs` would better improve non-eucledian distances etc that take longer to compute. Again, you've not varied n_features, which might be an important factor. I have tried `n_features` in [10, 200, 1000] in my original post above (sorry called it `n_dim`), in all cases `n_jobs > 1` makes things slower. Including for large arrays `X [100000, 1000]`, `Y [1000, 1000]` which are reaching the limit of what I can compute with 8 GB RAM. The question is that if `n_jobs` always makes pairwise euclidean distances slower, no matter `n_samples`, `n_features` (as that's what all the benchmarks suggest so far), then some warning should probably be printed to the user if `n_jobs > 1` is used? Particularly since euclidean distance is the default (and probably the most frequent used) distance metric. Also when `benchmarks/bench_plot_parallel_pairwise.py` was added by @mblondel in 3b8f54e , I imagine it aimed to demonstrate that using `n_jobs > 1` could speed up Eucledian / RBF distance calculations (couldn't find any plots / PR from that time). When I run this benchmark now, it illustrates that using `n_jobs > 1` makes things slower (cf images in the original post above), which doesn't really make sens to show on a benchmark...
Member Author

### rth commented Jan 20, 2017 • edited

 It looks like the cost of memory copies is the issue here. For instance, if we rewrite the parallel calculations to save the results to a shared array (adapted from this joblib example) in this gist, we get the results below (on a 8 core Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz), ``````# n_train=100000, n_test=1000, n_dim=1000 ## Backend : pairwise_distances n_jobs= 1 => 1 loop, best of 3: 2.63 s per loop n_jobs= 2 => 1 loop, best of 3: 8.43 s per loop n_jobs= 4 => 1 loop, best of 3: 9.15 s per loop n_jobs= 8 => 1 loop, best of 3: 14.5 s per loop ## Backend : mmap_pairwise_distances n_jobs= 1 => 1 loop, best of 3: 3.7 s per loop n_jobs= 2 => 1 loop, best of 3: 3.43 s per loop n_jobs= 4 => 1 loop, best of 3: 2.9 s per loop n_jobs= 8 => 1 loop, best of 3: 2.83 s per loop `````` so while the compute time of `pairwise_distances` keeps increasing as you use more CPU cores, using a shared array to store the results allows to at least to see some improvement with larger number of CPU cores (though the scaling is not great and it still does not beat single core performance)...
Member

### jnothman commented Jan 21, 2017

 I don't think you're gaining anything there by dumping X and Y: unless some optimisation is involved, there are still refs held by the calling code. However, I agree that creating memmapped storage for the result seems wise. However, you're deleting the memmapped file before returning it, which will work (since it's an open file) on *NIX, but I'm not sure if it'll work on Windows. … On 21 January 2017 at 03:57, Roman Yurchak ***@***.***> wrote: It looks like the cost of memory copies is the issue here. For instance, if we rewrite the parallel calculations to save the results to a shared array (adapted from this joblib example ) in this gist , we get the results below, # n_train=100000, n_test=1000, n_dim=1000 ## Backend : pairwise_distances n_jobs= 1 => 1 loop, best of 3: 2.63 s per loop n_jobs= 2 => 1 loop, best of 3: 8.43 s per loop n_jobs= 4 => 1 loop, best of 3: 9.15 s per loop n_jobs= 8 => 1 loop, best of 3: 14.5 s per loop ## Backend : mmap_pairwise_distances n_jobs= 1 => 1 loop, best of 3: 3.7 s per loop n_jobs= 2 => 1 loop, best of 3: 3.43 s per loop n_jobs= 4 => 1 loop, best of 3: 2.9 s per loop n_jobs= 8 => 1 loop, best of 3: 2.83 s per loop so while the compute time of pairwise_distances keeps increasing as you use more CPU cores, using a shared array to store the results allows to at least to see some improvement with larger number of CPU cores (though the scaling is not great and it still does not beat single core performance)... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8216 (comment)>, or mute the thread .
Member Author

### rth commented Jan 23, 2017

 I don't think you're gaining anything there by dumping X and Y: unless some optimisation is involved, there are still refs held by the calling code. You are right, looks like joblib with the multiprocesssing backend would do the automatic memapping of input array anyway if they are larger than 1MB. However, I agree that creating memmapped storage for the result seems wise. However, you're deleting the memmapped file before returning it, which will work (since it's an open file) on *NIX, but I'm not sure if it'll work on Windows. True, I should have copied data to memory and closed the mmap file before deleting it...
Member

### ogrisel commented Feb 8, 2017

 @rth I wonder if this case would not run faster with the threading backend of joblib. Can you please re-run your benchmarks with this backend as a new option?
Member

### ogrisel commented Feb 8, 2017

 You should also run the tests on a larger machines with 16 cores to see the scalability profile when n_jobs is increased.
Member Author

### rth commented Feb 8, 2017

@ogrisel I looked a bit more into this, and considered the influence of 3 parameters,

• parallel backend in `['multiprocessing', 'threading']`
• memmapping the output array or not
• setting `MKL_NUM_THREAD` in `[1, 8]` to disable multithreading in numpy (when using default conda install)

this benchmarks the `_parallel_pairwise` functio, run for

• dense and sparse CSR arrays
• metrics in `['euclidean', 'cosine', 'l1']` metrics.
• for input array sizes `(n_x, n_y, n_dim)` in
``````[(100000, 1000, 1000),
(10000, 10000, 1000),
(10000, 10000, 10),
(10000, 10000, 10000)]:
``````
for sparse arrays the `n_dim` is multiplied by 10 to account for lower density.

The benchmark script can be found in this gist. Here are the results for the runs, using an Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz),

Metruc\is_sparse_csr Dense Sparse CSR
euclidean http://pastebin.com/Wvn40Arh http://pastebin.com/4cMBr2xV
cosine http://pastebin.com/wX0pZiMQ
l1 in progress..

At least for this dataset of medium size (X array: 0.08 GB, Y array 0.08 GB, result array 0.8 GB), the conclusions appear to be that,

• for all possible combination of parameters, using `n_jobs=1` yields the best performance
• the multiprocessing backend decreases performance significantly with `n_jobs` in this use case
• the threading backend at best improves performance by 10-20% and doesn't outperform `n_jobs=1` significantly
* setting the number of numpy threads with `MKL_NUM_THREAD` doesn't seem to have an impact
* pre-allocating the result array, and providing it to joblib so it would be memmaped appears to work worse than doing the same thing but creating the mmemaped file by hand (see comment above).

Looks like everything is memory bound, and globally using more cores does not help. Will try testing scipy metrics (e.g. `jaccard`) next..

Member Author

### rth commented Feb 8, 2017

 Parallelizing the jacckard distance works much better, because the computations take much longer and are not memory bound anymore. The choice of the parallel backend doesn't seem to matter in this case, ``````(sklearn-env) [ec2-user@ip-172-31-11-89 ~]\$ ipython pairwise_distances_benchmark.py False jaccard # sparse=False, n_x=100000, n_y=1000, n_dim=1000 # X array: 0.8 GB, Y array 0.008 GB, result array 0.8 GB # metric = jaccard ## {'mmap_result': False, 'backend': 'multiprocessing', 'MKL_NUM_THREADS': 8} n_jobs= 1 => 1 loop, best of 1: 5min 7s per loop n_jobs= 2 => 1 loop, best of 1: 2min 40s per loop n_jobs= 4 => 1 loop, best of 1: 1min 25s per loop n_jobs= 8 => 1 loop, best of 1: 44.4 s per loop n_jobs= 16 => 1 loop, best of 1: 33.4 s per loop ## {'mmap_result': True, 'backend': 'multiprocessing', 'MKL_NUM_THREADS': 8} n_jobs= 1 => 1 loop, best of 1: 5min 7s per loop n_jobs= 2 => 1 loop, best of 1: 2min 34s per loop n_jobs= 4 => 1 loop, best of 1: 1min 18s per loop n_jobs= 8 => 1 loop, best of 1: 49.6 s per loop n_jobs= 16 => 1 loop, best of 1: 34.3 s per loop ## {'MKL_NUM_THREADS': 8, 'backend': 'threading'} n_jobs= 1 => 1 loop, best of 1: 5min 7s per loop n_jobs= 2 => 1 loop, best of 1: 2min 34s per loop n_jobs= 4 => 1 loop, best of 1: 1min 17s per loop n_jobs= 8 => 1 loop, best of 1: 42.4 s per loop n_jobs= 16 => 1 loop, best of 1: 30.7 s per loop `````` Here is a sample script to estimate relative compute time of different metrics, ```import numpy as np from sklearn.metrics.pairwise import (pairwise_distances, PAIRWISE_DISTANCE_FUNCTIONS, PAIRWISE_BOOLEAN_FUNCTIONS) from IPython import get_ipython ipython = get_ipython() n_x, n_y, n_dim = 10000, 1000, 1000 X = np.random.rand(n_x, n_dim) Y = np.random.rand(n_y, n_dim) print('\n# n_x={}, n_y={}, n_dim={}'.format(n_x, n_y, n_dim)) for metric in list(PAIRWISE_DISTANCE_FUNCTIONS.keys()) + PAIRWISE_BOOLEAN_FUNCTIONS: print(' * {:10} metric => '.format(metric), end='') ipython.magic("timeit -n 1 -r 1 pairwise_distances(X, Y, metric, n_jobs=1)") ``` which produces, ``````# n_x=10000, n_y=1000, n_dim=1000 * precomputed metric => 1 loop, best of 1: 6.22 ms per loop * cityblock metric => 1 loop, best of 1: 9.98 s per loop * euclidean metric => 1 loop, best of 1: 685 ms per loop * manhattan metric => 1 loop, best of 1: 10.3 s per loop * l2 metric => 1 loop, best of 1: 721 ms per loop * cosine metric => 1 loop, best of 1: 836 ms per loop * l1 metric => 1 loop, best of 1: 10.4 s per loop * dice metric => 1 loop, best of 1: 20.2 s per loop * jaccard metric => 1 loop, best of 1: 40.7 s per loop * kulsinski metric => 1 loop, best of 1: 19.9 s per loop * matching metric => 1 loop, best of 1: 7.08 s per loop * rogerstanimoto metric => 1 loop, best of 1: 17.4 s per loop * russellrao metric => 1 loop, best of 1: 13.3 s per loop * sokalmichener metric => 1 loop, best of 1: 17.4 s per loop * sokalsneath metric => 1 loop, best of 1: 19.7 s per loop * yule metric => 1 loop, best of 1: 25.4 s per loop `````` maybe it could be worth raising a warning when the `pairwise_distances` is used with faster metrics `['euclidean', 'l2', 'cosine']` and `n_jobs > 1`, saying that using the latter option may lead to worse performance?
Member

### jnothman commented Feb 8, 2017

 Start with a comment in the documentation? … On 9 Feb 2017 2:42 am, "Roman Yurchak" ***@***.***> wrote: Parallelizing the jacckard distance works much better, because the computations take much longer and are not memory bound anymore. The choice of the parallel backend doesn't seem to matter in this case, (sklearn-env) ***@***.*** ~]\$ ipython pairwise_distances_benchmark.py False jaccard # sparse=False, n_x=100000, n_y=1000, n_dim=1000 # X array: 0.8 GB, Y array 0.008 GB, result array 0.8 GB # metric = jaccard ## {'mmap_result': False, 'backend': 'multiprocessing', 'MKL_NUM_THREADS': 8} n_jobs= 1 => 1 loop, best of 1: 5min 7s per loop n_jobs= 2 => 1 loop, best of 1: 2min 40s per loop n_jobs= 4 => 1 loop, best of 1: 1min 25s per loop n_jobs= 8 => 1 loop, best of 1: 44.4 s per loop n_jobs= 16 => 1 loop, best of 1: 33.4 s per loop ## {'mmap_result': True, 'backend': 'multiprocessing', 'MKL_NUM_THREADS': 8} n_jobs= 1 => 1 loop, best of 1: 5min 7s per loop n_jobs= 2 => 1 loop, best of 1: 2min 34s per loop n_jobs= 4 => 1 loop, best of 1: 1min 18s per loop n_jobs= 8 => 1 loop, best of 1: 49.6 s per loop n_jobs= 16 => 1 loop, best of 1: 34.3 s per loop ## {'MKL_NUM_THREADS': 8, 'backend': 'threading'} n_jobs= 1 => 1 loop, best of 1: 5min 7s per loop n_jobs= 2 => 1 loop, best of 1: 2min 34s per loop n_jobs= 4 => 1 loop, best of 1: 1min 17s per loop n_jobs= 8 => 1 loop, best of 1: 42.4 s per loop n_jobs= 16 => 1 loop, best of 1: 30.7 s per loop Here is a sample script to estimate relative compute time of different metrics, import numpy as npfrom sklearn.metrics.pairwise import (pairwise_distances, PAIRWISE_DISTANCE_FUNCTIONS, PAIRWISE_BOOLEAN_FUNCTIONS) from IPython import get_ipython ipython = get_ipython() n_x, n_y, n_dim = 10000, 1000, 1000 X = np.random.rand(n_x, n_dim) Y = np.random.rand(n_y, n_dim) print('\n# n_x={}, n_y={}, n_dim={}'.format(n_x, n_y, n_dim)) for metric in list(PAIRWISE_DISTANCE_FUNCTIONS.keys()) + PAIRWISE_BOOLEAN_FUNCTIONS: print(' * {:10} metric => '.format(metric), end='') ipython.magic("timeit -n 1 -r 1 pairwise_distances(X, Y, metric, n_jobs=1)") which produces, # n_x=10000, n_y=1000, n_dim=1000 * precomputed metric => 1 loop, best of 1: 6.22 ms per loop * cityblock metric => 1 loop, best of 1: 9.98 s per loop * euclidean metric => 1 loop, best of 1: 685 ms per loop * manhattan metric => 1 loop, best of 1: 10.3 s per loop * l2 metric => 1 loop, best of 1: 721 ms per loop * cosine metric => 1 loop, best of 1: 836 ms per loop * l1 metric => 1 loop, best of 1: 10.4 s per loop * dice metric => 1 loop, best of 1: 20.2 s per loop * jaccard metric => 1 loop, best of 1: 40.7 s per loop * kulsinski metric => 1 loop, best of 1: 19.9 s per loop * matching metric => 1 loop, best of 1: 7.08 s per loop * rogerstanimoto metric => 1 loop, best of 1: 17.4 s per loop * russellrao metric => 1 loop, best of 1: 13.3 s per loop * sokalmichener metric => 1 loop, best of 1: 17.4 s per loop * sokalsneath metric => 1 loop, best of 1: 19.7 s per loop * yule metric => 1 loop, best of 1: 25.4 s per loop maybe it could be worth raising a warning when the pairwise_distances is used with faster metrics ['euclidean', 'l2', 'cosine'] and n_jobs > 1, saying that using the latter option may lead to worse performance? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8216 (comment)>, or mute the thread .
Member Author

### rth commented Feb 8, 2017

 Start with a comment in the documentation? True, but which part of the docs? `pairwise_distances` that allow a user provided `n_jobs` parameter are used (directly or indirectly via `NearestNeighbors`) in, `silhouette_score` `LocalOutlierFactor` `DBSCAN` `NearestNeighbors` `KNeighborsClassifier` `KNeighborsRegressor` `Isomap` `kneighbors_graph` `mean_shift` `LabelPropagation` which means that, if this issue is confirmed, using `n_jobs > 1` may potentially render those estimators / functions (significantly) slower. I imagine the impact of this is somewhat limited by the fact that NearestNeighbors would use by default kd_tree or ball_tree algorithm (cf. #8213) , even in cases when brute force with `pairwise_distances` would have been more efficient, but that's another issue..
Member

### jnothman commented Feb 9, 2017

 That's a fair comment. I don't mind a warning if we're able to get a good idea of when to warn … On 9 February 2017 at 08:08, Roman Yurchak ***@***.***> wrote: Start with a comment in the documentation? True, but which part of the docs? pairwise_distances that allow a user provided n_jobs parameter are used (directly or indirectly via NearestNeighbors) in, - silhouette_score - LocalOutlierFactor - DBSCAN - NearestNeighbors - KNeighborsClassifier - KNeighborsRegressor - Isomap - kneighbors_graph - mean_shift - LabelPropagation which means that, if this issue is confirmed, using n_jobs > 1 may potentially render those estimators / functions (significantly) slower. I imagine the impact of this is somewhat limited by the fact that NearestNeighbors would use by default kd_tree or ball_tree algorithm (cf. #8213 <#8213>) , even in cases when brute force with pairwise_distances would have been more efficient, but that's another issue.. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8216 (comment)>, or mute the thread .
Member

### lesteve commented Feb 9, 2017

 pre-allocating the result array, and providing it to joblib so it would be memmaped appears to work worse than doing the same thing but creating the mmemaped file by hand (see comment above). I had a look at this a few weeks ago and AFAICT the reason may be that `sklearn.metrics.pairwise._parallel_pairwise` does: ``` ret = Parallel(n_jobs=n_jobs, verbose=0)( fd(X, Y[s], **kwds) for s in gen_even_slices(Y.shape, n_jobs))``` For the auto memmapping to be more useful I think it should do: ``` ret = Parallel(n_jobs=n_jobs, verbose=0)( another_fd(X, Y, s, **kwds) for s in gen_even_slices(Y.shape, n_jobs))```
Member

### GaelVaroquaux commented Feb 9, 2017

 I had a look at this a few weeks ago and AFAICT the reason may be that sklearn.metrics.pairwise._parallel_pairwise does: ret = Parallel(n_jobs=n_jobs, verbose=0)( fd(X, Y[s], **kwds) for s in gen_even_slices(Y.shape, n_jobs)) For the auto memmapping to be more useful I think it should do: ret = Parallel(n_jobs=n_jobs, verbose=0)( another_fd(X, Y, s, **kwds) for s in gen_even_slices(Y.shape, n_jobs)) Yes. Maybe we should document this in joblib. Maybe it's time for joblib to have a set of examples that discuss such efficient tradeoffs.
Member Author

### rth commented Feb 9, 2017

 Thanks @lesteve , I'll test that option. Also would you have an opinion why auto memmapping the output array (results with `'mmap_result': True` here ) has significanly worse performance in this case than creating the memmap file by hand and providing it to joblib (results here). Though in the latter case I was using `squared=True` option, so maybe that has an impact. Or maybe I missed something else...
Member

### lesteve commented Feb 9, 2017

 Also would you have an opinion why auto memmapping the output array (results with 'mmap_result': True here ) has significanly worse performance in this case than creating the memmap file by hand and providing it to joblib (results here) Hmmm not off the top of my head I have to say.
Member

### ogrisel commented Feb 9, 2017 • edited

 As threading is not impacted by the GIL for those workloads, there is really no point in using multiprocessing + memmaping: that is just a waste of disk access and memory (multiprocessing + memmaping still does at least one memory copy that is not required when threading). The fact that cosine / euclidean do not get a speed up with n_jobs=1 is because those workloads are memory bound. It might be possible to find clever ways to block the data differently depending on the dimensions but I think it's primarily BLAS level 1 and 2 operations that dominate and those are memory bound. So +1 for using the threading backend by default and issueing a warning when using n_jobs>1 on memory bound metrics.
Member Author

### rth commented Feb 9, 2017 • edited

 Thanks @ogrisel I'll submit a PR. Need to check that threading performs as well or better for all metrics though.. It is interesting that BLAS level 1 and level 2 operations (dot product, matrix vector multiplications, ..) are memory bound. I imagine this would also explain why parallelizing the predict method (e.g. in linear models) may not yield any speed up (related issues #7448, #7635)...
Member

### ogrisel commented Feb 9, 2017

 Yes indeed I mean level 1 and 2: vec x vec and mat x vec are memory bound. Only vec x vec is has an algorithmic intensity large enough to benefit from multi cores. But it also depends on how it's "blocked". MKL and OpenBLAS do it for us though.
Member

### ogrisel commented Feb 10, 2017

 Actually for cosine similarities and and euclidean distances there is a level 3 GEMM involved. So at least this step should benefit from multicore without running into the memory bandwidth limit. But the BLAS implementation (MKL or OpenBLAS) should have no problem to multithread that part and there is probably no point in trying to manually handle the parallelism at a coarser level for those GEMM based distances.

Closed

Open

Open

### jnothman referenced this issue Dec 10, 2017

Merged

#### [MRG+1] ENH Add working_memory global config for chunked operations #10280

9 of 10 tasks complete

Closed

Open

Open

Member

### jnothman commented May 30, 2018

 Switching whether to use threading bases on the choice of metric may make sense

Merged

Member Author

### rth commented Mar 1, 2019

 This is resolved by using a threading backend for `pariwise_distances` calculations in #13310

### rth closed this Mar 1, 2019

to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.