Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing over features rather than CV folds in Sequential Feature Selection (addressing #191) #193

Merged
merged 16 commits into from
May 18, 2017

Conversation

whalebot-helmsman
Copy link
Contributor

Description

Use 1 process per feature for sequential feature selection and exhaustive feature selection

Related issues or pull requests

Fixes #191

@pep8speaks
Copy link

pep8speaks commented May 16, 2017

Hello @whalebot-helmsman! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 18, 2017 at 20:57 Hours UTC

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 93.463% when pulling 12f14a2 on whalebot-helmsman:#191 into 861cade on rasbt:master.

@rasbt rasbt changed the title #191 Multiprocessing over features rather than CV folds in Sequential Feature Selection (addressing #191) May 17, 2017
@rasbt
Copy link
Owner

rasbt commented May 18, 2017

Looks great, thanks a lot! I will add a few minor code comments regarding the documentation. (Alt. I am happy to add those to this PR if you do the "enabling repository maintainer permissions on existing pull requests" feature.)

@rasbt
Copy link
Owner

rasbt commented May 18, 2017

Okay, it didn't let me add the comments directly to the code since it wasn't modified by you in this PR, so let me post it here :)

Could you change the docstrings for n_jobs and pre_dispatch slightly to the following (since they are currently referencing the old cv_score parallelization):

"""
...
    n_jobs : int (default: 1)
        The number of CPUs to use for evaluating different feature subsets
        in parallel. -1 means 'all CPUs'.
    pre_dispatch : int, or string (default: '2*n_jobs')
        Controls the number of jobs that get dispatched
        during parallel execution if `n_jobs > 1` or `n_jobs=-1`.
        Reducing this number can be useful to avoid an explosion of
        memory consumption when more jobs get dispatched than CPUs can process.
        This parameter can be:
        None, in which case all the jobs are immediately created and spawned.
            Use this for lightweight and fast-running jobs,
            to avoid delays due to on-demand spawning of the jobs
        An int, giving the exact number of total jobs that are spawned
        A string, giving an expression as a function
            of n_jobs, as in `2*n_jobs`
...
"""

Also, could you add the following snippet to the Changelog in docs/sources/CHANGELOG.md?

##### Changes

...

- Parallel execution in `mlxtend.feature_selection.SequentialFeatureSelector` and `mlxtend.feature_selection.ExhaustiveFeatureSelector` is now performed over different feature subsets instead of the different cross-validation folds to better utilize machines with multiple processors if the number of features is large ([#193](https://github.com/rasbt/mlxtend/pull/193), via [@whalebot-helmsman](https://github.com/whalebot-helmsman)).

Otherwise, the PR looks great, thanks a lot!

@whalebot-helmsman
Copy link
Contributor Author

Added changelog and documentation updates

@@ -66,7 +66,8 @@ class ExhaustiveFeatureSelector(BaseEstimator, MetaEstimatorMixin):
otherwise.
No cross-validation if cv is None, False, or 0.
n_jobs : int (default: 1)
The number of CPUs to use for cross validation. -1 means 'all CPUs'.
The number of CPUs to use for evaluating different feature subsets
in parallel. -1 means 'all CPUs'.
pre_dispatch : int, or string (default: '2*n_jobs')
Controls the number of jobs that get dispatched
during parallel execution in cross_val_score.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Could you also update the pre_dispatch docstring since it still has the cross_val_score referenced there, which may be misleading

@@ -84,7 +84,8 @@ class SequentialFeatureSelector(BaseEstimator, MetaEstimatorMixin):
exclusion/inclusion if floating=True and
algorithm gets stuck in cycles.
n_jobs : int (default: 1)
The number of CPUs to use for cross validation. -1 means 'all CPUs'.
The number of CPUs to use for evaluating different feature subsets
in parallel. -1 means 'all CPUs'.
pre_dispatch : int, or string (default: '2*n_jobs')
Controls the number of jobs that get dispatched
during parallel execution in cross_val_score.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here. thanks!

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 93.463% when pulling c75003b on whalebot-helmsman:#191 into 861cade on rasbt:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 93.463% when pulling c75003b on whalebot-helmsman:#191 into 861cade on rasbt:master.

@rasbt
Copy link
Owner

rasbt commented May 18, 2017

I have to look into Travis CI to see if it supports multiple processors to better tests PRs like that. However, I just tried it locally on 8 cpus and it seems to yield the same results as in the unit tests with 1 processor -- everything seems to work as expected and I will merge it now.

Thanks a lot for this really useful contribution!

@rasbt rasbt merged commit 1b0decf into rasbt:master May 18, 2017
@whalebot-helmsman
Copy link
Contributor Author

Do you use unit-tests for comparison? There are very easy (in a sense of CPU time required) tasks in unit-tests. In such scenario overhead of starting new processes is bigger than gains of multiprocessing.
test.py imitates hard (in a sense of CPU time required) train sets with slow classifier on a small public dataset. You can measure it (parameter is n_jobs)

time python test.py 1
time python test.py 2
time python test.py 4
time python test.py 8 

@rasbt
Copy link
Owner

rasbt commented May 19, 2017

Do you use unit-tests for comparison?

Yeah, I was primarily looking for correctness. I.e., that the previous results in the unit tests are reproduced with > 1 CPUs.

@rasbt rasbt mentioned this pull request Jun 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants