Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dictionary learning is slower with n_jobs > 1 #4769

Open
arthurmensch opened this issue May 26, 2015 · 10 comments
Open

Dictionary learning is slower with n_jobs > 1 #4769

arthurmensch opened this issue May 26, 2015 · 10 comments

Comments

@arthurmensch
Copy link
Contributor

Setting n_jobs > 1 in MiniBatchDictionaryLearning (and in function dictionary_learning_online) leads to worse performance.

Multi processing is handled in sklearn.decompositions, function dict_learning, l 249

    code_views = Parallel(n_jobs=n_jobs)(
        delayed(_sparse_encode)(
            X[this_slice], dictionary, gram, cov[:, this_slice], algorithm,
            regularization=regularization, copy_cov=copy_cov,
            init=init[this_slice] if init is not None else None,
            max_iter=max_iter)
        for this_slice in slices)

Minimal example :
https://gist.github.com/arthurmensch/091d16c135f4a3ba5580

Output n_jobs = 1

Distorting image...
Extracting reference patches...
done in 0.05s.
Learning the dictionary...
done in 5.12s.
Extracting noisy patches... 
done in 0.02s.
Lasso LARS...
done in 10.24s.

Output n_jobs == 2

Distorting image...
Extracting reference patches...
done in 0.05s.
Learning the dictionary...
done in 78.98s.
Extracting noisy patches... 
done in 0.02s.
Lasso LARS...
done in 6.15s.

Output n_jobs == 4

Distorting image...
Extracting reference patches...
done in 0.05s.
Learning the dictionary...
done in 83.24s.
Extracting noisy patches... 
done in 0.02s.
Lasso LARS...
done in 3.82s.

We can see that transform function of MiniBatchDictionaryLearning (relying on sparse_encode function) benefits from multi-processing as expected.

Dictionary learning relies on successive calls of sparse_encode function : slowness may come from this.

@arthurmensch
Copy link
Contributor Author

Incriminated line can be located l. 692 of dict_learning.py

        this_code = sparse_encode(this_X, dictionary.T, algorithm=method,
                                  alpha=alpha, n_jobs=n_jobs).T

We have :

this_X.shape[0] = batch_size

which is typically of the same order than n_jobs. sparse_encode computation time may be dominated by joblib overhead and not by the lasso computation time itself.

@arthurmensch
Copy link
Contributor Author

Fix in pull request #4773

@martinosorb
Copy link

I believe I'm seeing this problem here: #4779

@amueller amueller modified the milestone: 0.19 Sep 29, 2016
@jakirkham
Copy link
Contributor

jakirkham commented Jul 4, 2018

Agreed. Have run into this myself as well. Basically Joblib would normally make sense for finding the sparse code for a precomputed dictionary. However in this specific instance, it is being called in a tight loop as part of dictionary learning. The overhead of starting Joblib in this tight loop is too costly. One is better off using the threaded parallelism that BLAS provides instead.

Edit: Alternatively it may make sense to start the Joblib thread or multiprocessing pool before the tight loop and keep it running so that the tight loop merely schedules jobs for this pool. Have not tested this latter approach, but it may work reasonably well.

@amueller
Copy link
Member

from #4773 (comment)

The joblib context manager API is implemented and available in #5016.

We did some profiling with @arthurmensch and apparently we should benefit a lot from skipping redundant inner input data validation checks by introducing a check_input=False flag for the functions called inside the fit loop (the default being check_input=True). Input validation takes a large fraction of the time and causes unnecessary GIL contention when using the threading backend.

@jakirkham
Copy link
Contributor

FWIW there has been a lot of work on this (particularly in the past few weeks) and IMHO things are getting better.

The main things that remain slow on the sparse coding side are this copy and _pre_fit generally (though ~75% of that is just the aforementioned copy). A smaller amount of time is also spent in this asfortranarray call, but that is largely insignificant by comparison. It is worth noting that more time is spent in _pre_fit than enet_path (where the bulk of sparse coding happens). However _pre_fit and enet_path are generally more comparable now, which is a significant improvement.

Most of these improvements have involved skipping the assert_all_finite check, which is rather slow. More details on why that check is slow can be found in issue ( #11681 ).

@jakirkham
Copy link
Contributor

On the Joblib side, would expect PR ( https://github.com/tomMoral/loky/pull/135 ) ends up being very helpful here as the bigger issue is likely thread contention between the BLAS used and Joblib in sparse coding.

@cmarmo cmarmo modified the milestones: 0.19, 1.0 Jan 11, 2021
@cmarmo
Copy link
Member

cmarmo commented Jan 11, 2021

The gists are no longer available.
The following code

from time import time
import numpy as np
from sklearn.datasets import make_sparse_coded_signal
from sklearn.decomposition import MiniBatchDictionaryLearning
X, dictionary, code = make_sparse_coded_signal(
    n_samples=1000, n_components=15, n_features=20, n_nonzero_coefs=10,
    random_state=42)

n_jobs = [1, 2, 3, 4, 10]

for nj in n_jobs:
    dict_learner = MiniBatchDictionaryLearning(
        n_components=15, transform_algorithm='lasso_lars', random_state=42,
        n_jobs=nj
    )
    t0 = time()
    X_transformed = dict_learner.fit_transform(X)
    print("done in %0.3fs, with n_jobs = %d." % ((time() - t0), nj))

gives the following output

done in 5.199s, with n_jobs = 1.
done in 6.903s, with n_jobs = 2.
done in 8.619s, with n_jobs = 3.
done in 8.418s, with n_jobs = 4.
done in 9.727s, with n_jobs = 10.

With the following environment:

$ python -c "import sklearn; sklearn.show_versions()"

System:
    python: 3.9.1 (default, Dec  8 2020, 00:00:00)  [GCC 10.2.1 20201125 (Red Hat 10.2.1-9)]
executable: /home/cmarmo/.skldevenv/bin/python
   machine: Linux-5.9.16-200.fc33.x86_64-x86_64-with-glibc2.32

Python dependencies:
          pip: 20.3.3
   setuptools: 49.1.3
      sklearn: 1.0.dev0
        numpy: 1.19.4
        scipy: 1.5.4
       Cython: 0.29.21
       pandas: 1.1.5
   matplotlib: 3.3.3
       joblib: 1.0.0
threadpoolctl: 2.1.0

Built with OpenMP: True

It seems that this issue is still relevant?

@jakirkham
Copy link
Contributor

Thanks for rechecking this. That is a good data point to have.

I had tried to Cythonize a good chunk of code and release the GIL for the bulk of it with PR ( #11874 ). Maybe that helps? Unfortunately don't have much time to work on that PR these days.

@adrinjalali
Copy link
Member

Removing the milestone. Maybe @jeremiedbb would like to have a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
scikit-learn 0.19
Issues Without PR
Development

No branches or pull requests

6 participants