Adding Multiprocessing feature #998

Ne-oL · 2022-11-23T13:15:16Z

Code of Conduct

Description

This update to the code allows the plot of Decision Regions to be executed on all available cores, in addition there an option to implement it on cluster-wide nodes using the 'ray' parameter in the calling.

I'm not a programmer by profession so I'm sure my implementation is not pythonic and probably needs a lot of work. so I hope the maintainer can modify or take inspiration of my implementation in case it can't be merged with the main branch as it is.

I found that the speed bottleneck for the decision_regions drawing was due to the slow prediction. to fix it, I added parallelization and distributed the iterative prediction on all available CPUs (local ). I have tested the changes and so far there doesn't seem to be any issues. the speed gain i measured on a small dataset was 21X the original (449 seconds vs 21 seconds on local HPC instance of 36 cores). the large dataset >9,000 input took days(>5 days) to draw, now its taking ~12 minutes on a HPC Cluster (72CPUs). its not their intended goal but it basically addresses issues #801, #804

Related issues or pull requests

Fixes #804, #801

Pull Request Checklist

Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
Checked for style issues by running flake8 ./mlxtend

found that the bottleneck for the decision_regions drawing was due to the slow prediction. to fix it, i added ray parallelization and distributed the iterative prediction on all available CPUs (local or within cluster). i have tested the changes and so far there doesn't seem to be any issues. the speed gain i measured on a small dataset was 21X the original (449 seconds vs 21 seconds on local HPC instance of 36 cores). the large dataset >2,000 input took days(>5 days) to draw, now its taking ~12 minutes on a HPC Cluster (72CPUs). unfortunately, due to not being a programmer, I think my implementation needs a lot of optimization and its definitely not the most pythonic way. I failed to implement the parallelization using python multiprocessing module due to pickling issue. hence using ray algorithm.

Fixed the python native Multiprocessing implementation and made it the default workflow execution. added the parameter for ray utilization in the case of running the code in an HPC cluster.

- Removed Ray support as it would be too complicated for inexperienced users and not many would have access to HPC clusters - Renamed cluster variable to n_jobs to follow Scikit-Learn API - Changed the default workflow to one process with the option of enabling multiprocessing through the n_jobs - Added small test to check if n_jobs values is above the available CPU resources

removed redundant code.

corrected importing mistake

rasbt · 2022-12-13T01:45:18Z

Wow thanks for this! I've been out of office and am a bit tied up right now, but I am looking forward to reviewing this PR. Sounds super awesome and promising. Thanks for submitting this!

codecov · 2022-12-13T02:01:07Z

Codecov Report

Patch coverage has no change and project coverage change: -0.17 ⚠️

Comparison is base (33b9a6c) 77.46% compared to head (806de15) 77.30%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #998      +/-   ##
==========================================
- Coverage   77.46%   77.30%   -0.17%     
==========================================
  Files         200      200              
  Lines       11287    11311      +24     
  Branches     1480     1486       +6     
==========================================
  Hits         8744     8744              
- Misses       2324     2348      +24     
  Partials      219      219

Impacted Files	Coverage Δ
mlxtend/classifier/stacking_cv_classification.py	`100.00% <ø> (ø)`
mlxtend/plotting/decision_regions.py	`0.00% <0.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Ne-oL · 2022-12-13T14:11:37Z

No worries, i really appreciate the work you have done in this library. In all honesty, i copied 'decision_regions' content to my jupyter notebook, modified it, and have been using it as a method ever since. So i didn't do my due diligence in testing it as a package, But I will review the code again and troubleshoot any issues in the weekend. Hopefully, when you have the time, it would be in a much better shape. Thank you again for the awesome library.

Ne-oL · 2022-12-20T05:57:52Z

sorry i was busy and didn't have the time to work on it in the weekend. So it took a bit longer to finish it. here is a summary of what i have done in it:

I fixed all the issues and the package seems to be working fine now.
working with Multiprocessing module complicated the code a bit as the "parallel" function could not be used inside the plotting function (mp can only pickle top funcs) so i had to pass any needed variables as parameters. (this could be alleviated if another parallelization library was used like Multiprocess or ray).
created xtype variable to pass the X.dtype output instead of passing X as a whole.
Logically n_jobs default should be 1, but i kept the n_jobs default as "None" to follow Scikit-learn API, changing the variable value later to 1 inside the function, its ugly but it keeps the API consistent to any users.

As you can see, there are several opinionated choices in the code, however, as i mentioned before I'm not a programmer so my choices are just what seems logical to me (and not necessarily the best or the pythonic way) so please feel free to change it as you see fit.

rasbt · 2023-03-28T22:16:04Z

I am super sorry about this! I must have been out during the holidays when this PR landed, and I am just seeing this now! Thanks for your work on this, and I am definitely excited to merge this into mlxtend before the next version release!

review-notebook-app · 2023-03-28T23:06:11Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

rasbt · 2023-03-28T23:07:21Z

I think this was a bit messy since you made the changes in master, but no worries. This is a fantastic PR btw. Thanks a lot!

Ne-oL · 2023-03-29T08:32:48Z

Sorry about the noob mistake of making changes directly in master, its my first PR in all honesty. is there a way to easily rectify it? and sorry for troubling you with the linter and formatting issues. i didn't know how to do it and just forgot about it with my ongoing research. I'm unsure if there is anything extra on my end required to complete the merge or its all up to you now?

rasbt · 2023-03-29T12:44:36Z

No worries @Ne-oL it's all good now :). Thanks a for the PR!

Ne-oL added 6 commits November 23, 2022 16:38

Update decision_regions.py

4bf4013

Fixed the python native Multiprocessing implementation and made it the default workflow execution. added the parameter for ray utilization in the case of running the code in an HPC cluster.

Merge branch 'rasbt:master' into master

1125f0f

Update decision_regions.py

a2d8741

removed redundant code.

Update decision_regions.py

9db0b63

corrected importing mistake

Ne-oL added 3 commits December 20, 2022 09:37

Update decision_regions.py

8b75856

Update decision_regions.py

b416919

Merge branch 'rasbt:master' into master

0bf0821

Merge branch 'rasbt:master' into master

c8f6472

doc updates and cosmetics

b3e4722

Merge branch 'master' into master

806de15

rasbt merged commit 77d0696 into rasbt:master Mar 29, 2023
2 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Multiprocessing feature #998

Adding Multiprocessing feature #998

Ne-oL commented Nov 23, 2022 •

edited by rasbt

rasbt commented Dec 13, 2022

codecov bot commented Dec 13, 2022 •

edited

Ne-oL commented Dec 13, 2022

Ne-oL commented Dec 20, 2022 •

edited

rasbt commented Mar 28, 2023

review-notebook-app bot commented Mar 28, 2023

rasbt commented Mar 28, 2023

Ne-oL commented Mar 29, 2023 •

edited

rasbt commented Mar 29, 2023

Adding Multiprocessing feature #998

Adding Multiprocessing feature #998

Conversation

Ne-oL commented Nov 23, 2022 • edited by rasbt

Code of Conduct

Description

Related issues or pull requests

Pull Request Checklist

rasbt commented Dec 13, 2022

codecov bot commented Dec 13, 2022 • edited

Codecov Report

Ne-oL commented Dec 13, 2022

Ne-oL commented Dec 20, 2022 • edited

rasbt commented Mar 28, 2023

review-notebook-app bot commented Mar 28, 2023

rasbt commented Mar 28, 2023

Ne-oL commented Mar 29, 2023 • edited

rasbt commented Mar 29, 2023

Ne-oL commented Nov 23, 2022 •

edited by rasbt

codecov bot commented Dec 13, 2022 •

edited

Ne-oL commented Dec 20, 2022 •

edited

Ne-oL commented Mar 29, 2023 •

edited