Multiprocessing over features rather than CV folds in Sequential Feature Selection (addressing #191) #193

whalebot-helmsman · 2017-05-16T18:48:30Z

Description

Use 1 process per feature for sequential feature selection and exhaustive feature selection

Related issues or pull requests

Fixes #191

…thod pickling

pep8speaks · 2017-05-16T18:48:42Z

Hello @whalebot-helmsman! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 18, 2017 at 20:57 Hours UTC

coveralls · 2017-05-16T19:24:55Z

Coverage increased (+0.01%) to 93.463% when pulling 12f14a2 on whalebot-helmsman:#191 into 861cade on rasbt:master.

rasbt · 2017-05-18T16:07:34Z

Looks great, thanks a lot! I will add a few minor code comments regarding the documentation. (Alt. I am happy to add those to this PR if you do the "enabling repository maintainer permissions on existing pull requests" feature.)

rasbt · 2017-05-18T16:14:01Z

Okay, it didn't let me add the comments directly to the code since it wasn't modified by you in this PR, so let me post it here :)

Could you change the docstrings for n_jobs and pre_dispatch slightly to the following (since they are currently referencing the old cv_score parallelization):

"""
...
    n_jobs : int (default: 1)
        The number of CPUs to use for evaluating different feature subsets
        in parallel. -1 means 'all CPUs'.
    pre_dispatch : int, or string (default: '2*n_jobs')
        Controls the number of jobs that get dispatched
        during parallel execution if `n_jobs > 1` or `n_jobs=-1`.
        Reducing this number can be useful to avoid an explosion of
        memory consumption when more jobs get dispatched than CPUs can process.
        This parameter can be:
        None, in which case all the jobs are immediately created and spawned.
            Use this for lightweight and fast-running jobs,
            to avoid delays due to on-demand spawning of the jobs
        An int, giving the exact number of total jobs that are spawned
        A string, giving an expression as a function
            of n_jobs, as in `2*n_jobs`
...
"""

Also, could you add the following snippet to the Changelog in docs/sources/CHANGELOG.md?

##### Changes

...

- Parallel execution in `mlxtend.feature_selection.SequentialFeatureSelector` and `mlxtend.feature_selection.ExhaustiveFeatureSelector` is now performed over different feature subsets instead of the different cross-validation folds to better utilize machines with multiple processors if the number of features is large ([#193](https://github.com/rasbt/mlxtend/pull/193), via [@whalebot-helmsman](https://github.com/whalebot-helmsman)).

Otherwise, the PR looks great, thanks a lot!

whalebot-helmsman · 2017-05-18T20:51:32Z

Added changelog and documentation updates

rasbt · 2017-05-18T20:51:14Z

mlxtend/feature_selection/exhaustive_feature_selector.py

@@ -66,7 +66,8 @@ class ExhaustiveFeatureSelector(BaseEstimator, MetaEstimatorMixin):
        otherwise.
        No cross-validation if cv is None, False, or 0.
    n_jobs : int (default: 1)
-        The number of CPUs to use for cross validation. -1 means 'all CPUs'.
+        The number of CPUs to use for evaluating different feature subsets
+        in parallel. -1 means 'all CPUs'.
    pre_dispatch : int, or string (default: '2*n_jobs')
        Controls the number of jobs that get dispatched
        during parallel execution in cross_val_score.


Thanks! Could you also update the pre_dispatch docstring since it still has the cross_val_score referenced there, which may be misleading

rasbt · 2017-05-18T20:51:26Z

mlxtend/feature_selection/sequential_feature_selector.py

@@ -84,7 +84,8 @@ class SequentialFeatureSelector(BaseEstimator, MetaEstimatorMixin):
        exclusion/inclusion if floating=True and
        algorithm gets stuck in cycles.
    n_jobs : int (default: 1)
-        The number of CPUs to use for cross validation. -1 means 'all CPUs'.
+        The number of CPUs to use for evaluating different feature subsets
+        in parallel. -1 means 'all CPUs'.
    pre_dispatch : int, or string (default: '2*n_jobs')
        Controls the number of jobs that get dispatched
        during parallel execution in cross_val_score.


also here. thanks!

coveralls · 2017-05-18T21:01:44Z

Coverage increased (+0.01%) to 93.463% when pulling c75003b on whalebot-helmsman:#191 into 861cade on rasbt:master.

coveralls · 2017-05-18T21:09:32Z

Coverage increased (+0.01%) to 93.463% when pulling c75003b on whalebot-helmsman:#191 into 861cade on rasbt:master.

rasbt · 2017-05-18T22:07:19Z

I have to look into Travis CI to see if it supports multiple processors to better tests PRs like that. However, I just tried it locally on 8 cpus and it seems to yield the same results as in the unit tests with 1 processor -- everything seems to work as expected and I will merge it now.

Thanks a lot for this really useful contribution!

whalebot-helmsman · 2017-05-19T05:21:14Z

Do you use unit-tests for comparison? There are very easy (in a sense of CPU time required) tasks in unit-tests. In such scenario overhead of starting new processes is bigger than gains of multiprocessing.
test.py imitates hard (in a sense of CPU time required) train sets with slow classifier on a small public dataset. You can measure it (parameter is n_jobs)

time python test.py 1
time python test.py 2
time python test.py 4
time python test.py 8

rasbt · 2017-05-19T13:06:56Z

Do you use unit-tests for comparison?

Yeah, I was primarily looking for correctness. I.e., that the previous results in the unit tests are reproduced with > 1 CPUs.

hotdox added 11 commits May 15, 2017 00:52

run cv in 1 thread

47c7e00

return indices for whom score was calculated

bb28953

add joblib dependency

b76d786

parallel by features

10c2176

use 1 process for cv score

2d8a482

return indices from score function

83dffcc

parallel by feature sets

1c29d5c

fix flake8 warnings

f82a74f

fix flake8 warnings

72d0698

_calc_score as a free-standing function, to workaurnd inability of me…

cf2bf1f

…thod pickling

_calc_score as a free-standing function, to workaurnd inability of me…

bfa221f

…thod pickling

hotdox added 2 commits May 16, 2017 21:51

fix pep8 issues

c6791e0

use sklearn joblib

12f14a2

rasbt changed the title ~~#191~~ Multiprocessing over features rather than CV folds in Sequential Feature Selection (addressing #191) May 17, 2017

hotdox added 2 commits May 18, 2017 23:44

update CHANGELOG

2c6e695

update docstrings

dc2108b

rasbt requested changes May 18, 2017

View reviewed changes

remove mentions of cros_val_score

c75003b

rasbt approved these changes May 18, 2017

View reviewed changes

rasbt merged commit 1b0decf into rasbt:master May 18, 2017

rasbt mentioned this pull request Jun 23, 2017

0.7 release #206

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocessing over features rather than CV folds in Sequential Feature Selection (addressing #191) #193

Multiprocessing over features rather than CV folds in Sequential Feature Selection (addressing #191) #193

whalebot-helmsman commented May 16, 2017

pep8speaks commented May 16, 2017 •

edited

Loading

coveralls commented May 16, 2017

rasbt commented May 18, 2017

rasbt commented May 18, 2017

whalebot-helmsman commented May 18, 2017

rasbt May 18, 2017

rasbt May 18, 2017

coveralls commented May 18, 2017

coveralls commented May 18, 2017

rasbt commented May 18, 2017

whalebot-helmsman commented May 19, 2017

rasbt commented May 19, 2017

Multiprocessing over features rather than CV folds in Sequential Feature Selection (addressing #191) #193

Multiprocessing over features rather than CV folds in Sequential Feature Selection (addressing #191) #193

Conversation

whalebot-helmsman commented May 16, 2017

Description

Related issues or pull requests

pep8speaks commented May 16, 2017 • edited Loading

Comment last updated on May 18, 2017 at 20:57 Hours UTC

coveralls commented May 16, 2017

rasbt commented May 18, 2017

rasbt commented May 18, 2017

whalebot-helmsman commented May 18, 2017

rasbt May 18, 2017

Choose a reason for hiding this comment

rasbt May 18, 2017

Choose a reason for hiding this comment

coveralls commented May 18, 2017

coveralls commented May 18, 2017

rasbt commented May 18, 2017

whalebot-helmsman commented May 19, 2017

rasbt commented May 19, 2017

pep8speaks commented May 16, 2017 •

edited

Loading