Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH add n_jobs to mutual_info_regression and mutual_info_classif #28085

Merged
merged 19 commits into from Jan 18, 2024

Conversation

netomenoci
Copy link
Contributor

Reference Issues/PRs

Feat: #27795 (comment)

What does this implement/fix? Explain your changes.

This implement the addition of parameter n_jobs to sklearn.feature_selection.mutual_info_regression and sklearn.feature_selection.mutual_info_classif

Copy link

github-actions bot commented Jan 9, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 5a54c2f. Link to the linter CI: here

@netomenoci netomenoci changed the title add n_jobs to mutual_info_regression and mutual_info_classif FEAT add n_jobs to mutual_info_regression and mutual_info_classif Jan 9, 2024
@glemaitre
Copy link
Member

We will need an entry in doc/whats_new/v1.5.rst to acknowledge adding the parameter.
We also need to have a test that try different n_jobs. For instance we can test that n_jobs=None provide the same results than n_jobs=2.

doc/whats_new/v1.5.rst Outdated Show resolved Hide resolved
@netomenoci
Copy link
Contributor Author

@glemaitre thanks for the suggestions. I have implemented them.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It almost look good. Here are a couple of suggestions.

doc/whats_new/v1.5.rst Outdated Show resolved Hide resolved
sklearn/feature_selection/_mutual_info.py Show resolved Hide resolved
sklearn/feature_selection/_mutual_info.py Outdated Show resolved Hide resolved
sklearn/feature_selection/_mutual_info.py Outdated Show resolved Hide resolved
sklearn/feature_selection/_mutual_info.py Outdated Show resolved Hide resolved
sklearn/feature_selection/tests/test_mutual_info.py Outdated Show resolved Hide resolved
sklearn/feature_selection/tests/test_mutual_info.py Outdated Show resolved Hide resolved
netomenoci and others added 2 commits January 15, 2024 12:09
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@netomenoci
Copy link
Contributor Author

It almost look good. Here are a couple of suggestions.

@glemaitre i have made the suggestions, thanks. One of the circleci tests is failing, but i don't know why.

@@ -252,3 +253,18 @@ def test_mutual_info_regression_X_int_dtype(global_random_seed):
expected = mutual_info_regression(X_float, y, random_state=global_random_seed)
result = mutual_info_regression(X, y, random_state=global_random_seed)
assert_allclose(result, expected)


@pytest.mark.parametrize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@netomenoci I push a piece of code that show how to make the parallelization if you are interested in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, thanks :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi bary Wery good nice good luck bary

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We will need a second review.

@glemaitre glemaitre added Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one! labels Jan 15, 2024
@glemaitre glemaitre changed the title FEAT add n_jobs to mutual_info_regression and mutual_info_classif ENH add n_jobs to mutual_info_regression and mutual_info_classif Jan 15, 2024
@glemaitre
Copy link
Member

One of the circleci tests is failing, but i don't know why.

It was because of a git conflict.

@lesteve
Copy link
Member

lesteve commented Jan 18, 2024

Here is a quick benchmark, loosely adapted from #27795 (comment) (which unless I am mistaken was not really a benchmark of parallelizing mutual_info_regression).

I get 2.5 max speed improvement on my laptop, that has 8 logical cores (4 physical cores + hyper-threading).

from sklearn.datasets import make_sparse_uncorrelated
from sklearn.feature_selection import mutual_info_regression


n_features = 100
n_samples = int(1e4)
X, y = make_sparse_uncorrelated(random_state=0, n_features=n_features, n_samples=n_samples)

print('n_jobs=1')
%timeit mutual_info_regression(X, y, n_jobs=1)

print('n_jobs=4')
%timeit mutual_info_regression(X, y, n_jobs=4)

print('n_jobs=8')
%timeit mutual_info_regression(X, y, n_jobs=8)

I get:

n_jobs=1
3.56 s ± 127 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
n_jobs=4
1.41 s ± 179 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
n_jobs=8
1.34 s ± 28.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@@ -201,11 +202,13 @@ def _iterate_columns(X, columns=None):
def _estimate_mi(
X,
y,
*,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used keyword arguments to make the code more readable in the _estimate_mi function call and also use keyword-only arguments in the _estimate_mi definition.

I guess that's fine since _estimate_mi is private. @glemaitre do you agree?

Copy link
Member

@glemaitre glemaitre Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I do agree.

@lesteve lesteve merged commit 06e566e into scikit-learn:main Jan 18, 2024
27 checks passed
@lesteve
Copy link
Member

lesteve commented Jan 18, 2024

Let's merge this one, thanks @netomenoci!

@netomenoci
Copy link
Contributor Author

Here is a quick benchmark, loosely adapted from #27795 (comment) (which unless I am mistaken was not really a benchmark of parallelizing mutual_info_regression).

I get 2.5 max speed improvement on my laptop, that has 8 logical cores (4 physical cores + hyper-threading).

from sklearn.datasets import make_sparse_uncorrelated

from sklearn.feature_selection import mutual_info_regression





n_features = 100

n_samples = int(1e4)

X, y = make_sparse_uncorrelated(random_state=0, n_features=n_features, n_samples=n_samples)



print('n_jobs=1')

%timeit mutual_info_regression(X, y, n_jobs=1)



print('n_jobs=4')

%timeit mutual_info_regression(X, y, n_jobs=4)



print('n_jobs=8')

%timeit mutual_info_regression(X, y, n_jobs=8)

I get:


n_jobs=1

3.56 s ± 127 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

n_jobs=4

1.41 s ± 179 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

n_jobs=8

1.34 s ± 28.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using sklearn wrapper of parallel slows it down because there is some extra overhead on top of joblib (eg copies)

If u think it's useful , we can change (1 line change) the import of Parallel to joblib instead of sklearn utils and it should be faster and closer to the previous mentioned benchmark.

@glemaitre
Copy link
Member

If u think it's useful , we can change (1 line change) the import of Parallel to joblib instead of sklearn utils and it should be faster and closer to the previous mentioned benchmark.

We cannot do that because we will have issue with the global config. This is the reason to have the wrapper.

glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Feb 10, 2024
…kit-learn#28085)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:feature_selection Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants